Tasks

Tasks in Cirrus implement a unit of processing, to be composed together into a Workflow. Tasks are expected to support both input and output formatted as a Cirrus Process Payload. As part of its processing, a task can make any requisite modifications to its input payload and/or derive any output assets, pushing them to the canonical storage location in S3.

In other other words, to implement custom processing routines for a pipeline, use a task. The best tasks are modular, simple, focused, and composable. Most projects end up with more custom tasks than other component types, so it pays to be familiar with the tasks ins and outs.

Tasks can make use of AWS Lambda and/or AWS Batch for execution. Lambda tasks are simpler to manage and quicker to start up, but the Lambda runtime constraints can be prohibitive or untenable for some task workloads. In those cases, Batch allows for extended runtimes, greater resource limits, and specialized instance types.

In a Cirrus project, tasks are stored inside the tasks/ directory, each in a subdirectory named for the task. Each task requires a definition.yml file with the task’s configuration, and a README.md file documenting the task’s usage.

Anatomy of a task

Generally speaking, every task should do a few key things:

  • Take an input Cirrus Process Payload

    • In the case of Batch tasks and/or large payloads, tasks should support receiving a url input parameter pointing to a payload object in S3

  • Instantiate a cirrus.lib.ProcessPayload instance from the input payload JSON

  • Download all required assets from the items in the input payload

  • Perform any asset metadata manipulation and/or derived product processing

  • Update/replace payload items based on task outputs

  • Upload any output assets to S3 for persistence

  • Return the output Cirrus Process Payload

    • In the case of Batch tasks and/or large payloads, tasks should support uploading the output payload to S3 and returning an output url parameter pointing to that payload object in S3

Certain tasks may deviate from this pattern, but the vast majoity of tasks will follow this flow. cirrus-lib provides convenince classes/methods to help with these common needs.

Lambda tasks

Lambda tasks use the AWS Lambda runtime to power executions. Lambda has the advantage of quick startup and easy management, but has many restrictions like short timeouts and significant resource limits.

Lambda-only tasks follow the specifications outlined in the Lambda-based components documentation. Refer there for specifics on what files are requried for Lambda tasks and how to structure the definition.yml file.

Batch tasks

Batch tasks use AWS Batch semantics to define jobs that execute within a compute environment as determined by the job queue to which the job is submitted. Batch compute environments can make use of Fargate or EC2 to run jobs, allowing significantly more control over the execution environment than Lambda allows, as well as much greater limits on resources.

Batch tasks are inherently just an abstraction around a set of CloudFormation resources, minimally just a Batch job definition, but commonly also the job queue, compute environment, and any other requried resources.

For more infomation see the Batch tasks documentation.

Lambda vs Batch

When to chose Lambda

  • small code size/not many dependencies

  • single-threaded

  • short runtime (no more than 15 minutes max)

  • need code to live in the cirrus project repo

When to chose Batch

  • long runtimes

  • large package size/non-native dependecies

  • can use multiple CPUs

  • easier to manage code as separate container images

  • need significant RAM

  • need more than 10GB disk

  • need special hardware resources (e.g., GPU)

Creating a new task

Creating a new task involves creating a directory with the task name under tasks/ and the required files inside it. Getting everything setup with all the requisite boiler-plate takes some minor work. The cirrus cli includes a convenience function to automate getting started with a new task.

Lambda-only

To create a lambda-only task, simply create a new task with a description and the options --has-lambda and --no-batch:

❯ cirrus create task --has-lambda --no-batch <TaskName> "<task description>"

This command will create the task directory and required files from a minimal template. The new task will obviously need to have the custom handler code added, and the definition.yml configuration will need to be validated to ensure it matches the task requirements. Any usage information should also be added to the README.md file.

Batch-only

To create a Batch-only task, simply create a new task with a description, but add the --has-batch and --no-lambda options:

❯ cirrus create task --has-batch --no-lambda <TaskName> "<task description>"

The task directory and required files will be created from a minimal template. The templated Batch configuration in the definition.yml should be considered a rough starting point, and will require fairly significant modification for most uses. Be sure to also update the README.md file with usage information.

Lambda and Batch

For tasks that should support both Lambda and Batch, run the create command, this time using the options --has-lambda and --has-batch:

❯ cirrus create task --has-lambda --has-batch <TaskName> "<task description>"

This command does the same as both of the above create command examples, so the listed caveats of both apply here: ensure the handler code is completed, and the batch configuration is updated to match the task requirements.

Task parameters

Tasks can take arguments at runtime via process definition parameters. See the Cirrus Process Payload docs for more information. When authoring a task, be sure to document all supported task parameters in the task’s README.md. In using an existing task, the task README can always be view via the cli:

❯ cirrus show task <TaskName> readme

This will dump the README.md contents to the terminal with appropriate markup applied.

Running tasks locally

We are working to standardize task code and cirrus cli tooling to provide an easy and consistent means to execute tasks locally. This feature is still under development, so for now please consult the project or task documentation for further information (if available).