Tasks

Tasks in Cirrus implement a unit of processing, to be composed together into a Workflow. Tasks are expected to support both input and output formatted as a Cirrus Process Payload. As part of it’s processing, a task can make any requisite modifications to its input payload and/or derive any output assets, pushing them to the canonical storage location in S3.

In other other words, to implement custom processing routines for a pipeline, use a task. The best tasks are modular, simple, focused, and composable. Most projects end up with more custom tasks than other component types, so it pays to be familiar with the tasks ins and outs.

Tasks can make use of AWS Lambda and/or AWS Batch for execution. Lambda tasks are simpler to manage and quicker to start up, but the Lambda runtime constraints can be prohibitive or untenable for some task workloads. In those cases, Batch allows for extended runtimes, greater resource limits, and specialized instance types.

Anatomy of a task

Generally speaking, every task should do a few key things:

  • Take an input Cirrus Process Payload

    • In the case of Batch tasks and/or large payloads, tasks should support receiving a url input parameter pointing to a payload object in S3

  • Download any/all required assets from the items in the input payload

  • Perform any asset metadata manipulation and/or derived product processing

  • Update/replace payload items based on task outputs

  • Upload any output assets/items to S3 for persistence

  • Return the output Cirrus Process Payload

    • In the case of Batch tasks and/or large payloads, tasks should support uploading the output payload to S3 and returning an output url parameter pointing to that payload object in S3

Certain tasks may deviate from this pattern, but the vast majority of tasks will follow this flow, either in part or in full. The python stac-task library provides convenience classes/methods to help build tasks and easily facilitate these common actions.

Lambda tasks

Lambda tasks use the AWS Lambda runtime to power executions. Lambda has the advantage of quick startup and easy management, but has many restrictions like short timeouts and significant resource limits.

Lambda-only tasks follow the specifications outlined in the Lambda-based components documentation.

Batch tasks

Batch tasks use AWS Batch semantics to define jobs that execute within a compute environment as determined by the job queue to which the job is submitted. Batch compute environments can make use of Fargate or EC2 to run jobs, allowing significantly more control over the execution environment than Lambda allows, as well as much greater limits on resources.

Batch tasks are inherently just an abstraction around a set of CloudFormation resources, minimally just a Batch job definition, but commonly also the job queue, compute environment, and any other requried resources.

For more infomation see the Batch tasks documentation.

Lambda vs Batch

When to chose either

  • can be run as a Docker image

When to chose Lambda

  • short runtime, with a maximum of 15 minutes

  • single vCPU is acceptable

  • under 10GB of memory

  • small code size / few dependencies

  • need code to live in the cirrus project repository

When to chose Batch

  • runtime always longer than 5 minutes

  • larger package size / many dependencies

  • need multiple vCPUs, more than 10GB memory, or the ability to more precisely specify a vCPU-to-memory ratio

  • easier to manage code as separate container images

  • need more than 10GB storage

  • need special hardware resources (e.g., GPU)

Docker Image

It is generally recommended to use a Docker image for tasks unless the task is only intended to run in Lambda and has few dependencies, which is rarely the case with geospatial processing.

The easiest way to do this is to create a Python class that extends stactask.task.Task and implements the abstract method def process(self, **kwargs: Any) -> List[Dict[str, Any]]. This class should be put in the file src/task/task.py, and can then be invoked by Docker with directives:

FROM public.ecr.aws/lambda/python:3.11

# ENTRYPOINT ["/lambda-entrypoint.sh"] # set by base lambda image
COPY src/task/task.py ${LAMBDA_TASK_ROOT}/task.py
CMD [ "task.handler" ]

If you choose to use your own container that is not based on a standard Lambda image, you must add the lambda-entrypoint.sh file to your image and set ENTRYPOINT explicitly.

Unfortuantely, the same Docker image cannot be used by both Batch and Lambda, as they have slightly different requirements for the CMD and ENTRYPOINT parameters. The solution to this is to have one base image definition that has all directives exception CMD and ENTRYPOINT, and then create two image definitions that use the base image but set CMD and ENTRYPOINT. For example, if we had a image that used a standard AWS lambda image (public.ecr.aws/lambda/python:3.11), we would use these directives for Batch and Lambda.

Batch:

COPY src/task/task.py ${LAMBDA_TASK_ROOT}/task.py
ENTRYPOINT []

(any prior CMD will be reset by the ENTRYPOINT change, and the command is set by the Batch Job Definition)

Lambda:

COPY src/task/task.py ${LAMBDA_TASK_ROOT}/task.py
CMD [ "task.handler" ]

(base image sets ENTRYPOINT ["/lambda-entrypoint.sh"])

Thereby, an example base image that needed geospatial tools could look like this:

FROM ghcr.io/lambgeo/lambda-gdal:3.8-python3.11 as gdal

FROM public.ecr.aws/lambda/python:3.11

# Bring C libs from lambgeo/lambda-gdal image
COPY --from=gdal /opt/lib/ ${LAMBDA_TASK_ROOT}/lib/
COPY --from=gdal /opt/include/ ${LAMBDA_TASK_ROOT}/include/
COPY --from=gdal /opt/share/ ${LAMBDA_TASK_ROOT}/share/
COPY --from=gdal /opt/bin/ ${LAMBDA_TASK_ROOT}/bin/

ENV \
  GDAL_DATA=${LAMBDA_TASK_ROOT}/share/gdal \
  PROJ_LIB=${LAMBDA_TASK_ROOT}/share/proj \
  GDAL_CONFIG=${LAMBDA_TASK_ROOT}/bin/gdal-config \
  GEOS_CONFIG=${LAMBDA_TASK_ROOT}/bin/geos-config \
  PATH=${LAMBDA_TASK_ROOT}/bin:$PATH

RUN yum update -y && \
    yum install -y libxml2-devel libxslt-devel python-devel gcc && \
    yum clean all && \
    rm -rf /var/cache/yum /var/lib/yum/history

COPY requirements.txt ${LAMBDA_TASK_ROOT}

RUN pip3 install --no-cache-dir -r requirements.txt

COPY src/task/task.py ${LAMBDA_TASK_ROOT}/task.py

That could be built and tagged as the base image, and then the Lambda and/or Batch images based on it.

Batch:

FROM my_base_task

ENTRYPOINT []

Lambda:

FROM my_base_task

CMD [ "task.handler" ]

Of course, if only one of the Batch or Lambda is needed, these commands can be into a single definition.

Task parameters

Tasks can take arguments at runtime via process definition parameters. See the Cirrus Process Payload docs for more information.