Tasks
Tasks in Cirrus implement a unit of processing, to be composed together into a Workflow. Tasks are expected to support both input and output formatted as a Cirrus Process Payload. As part of its processing, a task can make any requisite modifications to its input payload and/or derive any output assets, pushing them to the canonical storage location in S3.
In other other words, to implement custom processing routines for a pipeline, use a task. The best tasks are modular, simple, focused, and composable. Most projects end up with more custom tasks than other component types, so it pays to be familiar with the tasks ins and outs.
Tasks can make use of AWS Lambda and/or AWS Batch for execution. Lambda tasks are simpler to manage and quicker to start up, but the Lambda runtime constraints can be prohibitive or untenable for some task workloads. In those cases, Batch allows for extended runtimes, greater resource limits, and specialized instance types.
In a Cirrus project, tasks are stored inside the tasks/
directory, each in a
subdirectory named for the task. Each task requires a definition.yml
file with
the task’s configuration, and a README.md
file documenting the task’s usage.
Anatomy of a task
Generally speaking, every task should do a few key things:
Take an input Cirrus Process Payload
In the case of Batch tasks and/or large payloads, tasks should support receiving a
url
input parameter pointing to a payload object in S3
Instantiate a
cirrus.lib.ProcessPayload
instance from the input payload JSONDownload all required assets from the items in the input payload
Perform any asset metadata manipulation and/or derived product processing
Update/replace payload items based on task outputs
Upload any output assets to S3 for persistence
Return the output Cirrus Process Payload
In the case of Batch tasks and/or large payloads, tasks should support uploading the output payload to S3 and returning an output
url
parameter pointing to that payload object in S3
Certain tasks may deviate from this pattern, but the vast majoity of tasks will
follow this flow. cirrus-lib
provides convenince classes/methods to help with
these common needs.
Lambda tasks
Lambda tasks use the AWS Lambda runtime to power executions. Lambda has the advantage of quick startup and easy management, but has many restrictions like short timeouts and significant resource limits.
Lambda-only tasks follow the specifications outlined in the Lambda-based
components documentation. Refer there for specifics on what files
are requried for Lambda tasks and how to structure the definition.yml
file.
Batch tasks
Batch tasks use AWS Batch semantics to define jobs that execute within a compute environment as determined by the job queue to which the job is submitted. Batch compute environments can make use of Fargate or EC2 to run jobs, allowing significantly more control over the execution environment than Lambda allows, as well as much greater limits on resources.
Batch tasks are inherently just an abstraction around a set of CloudFormation resources, minimally just a Batch job definition, but commonly also the job queue, compute environment, and any other requried resources.
For more infomation see the Batch tasks documentation.
Lambda vs Batch
When to chose Lambda
small code size/not many dependencies
single-threaded
short runtime (no more than 15 minutes max)
need code to live in the cirrus project repo
When to chose Batch
long runtimes
large package size/non-native dependecies
can use multiple CPUs
easier to manage code as separate container images
need significant RAM
need more than 10GB disk
need special hardware resources (e.g., GPU)
Creating a new task
Creating a new task involves creating a directory with the task name under
tasks/
and the required files inside it. Getting everything setup with all
the requisite boiler-plate takes some minor work. The cirrus
cli includes
a convenience function to automate getting started with a new task.
Lambda-only
To create a lambda-only task, simply create a new task with a description and
the options --has-lambda
and --no-batch
:
❯ cirrus create task --has-lambda --no-batch <TaskName> "<task description>"
This command will create the task directory and required files from a minimal
template. The new task will obviously need to have the custom handler code
added, and the definition.yml
configuration will need to be validated to
ensure it matches the task requirements. Any usage information should also be
added to the README.md
file.
Batch-only
To create a Batch-only task, simply create a new task with a description, but
add the --has-batch
and --no-lambda
options:
❯ cirrus create task --has-batch --no-lambda <TaskName> "<task description>"
The task directory and required files will be created from a minimal template.
The templated Batch configuration in the definition.yml
should be
considered a rough starting point, and will require fairly significant
modification for most uses. Be sure to also update the README.md
file with
usage information.
Lambda and Batch
For tasks that should support both Lambda and Batch, run the create
command, this time using the options --has-lambda
and --has-batch
:
❯ cirrus create task --has-lambda --has-batch <TaskName> "<task description>"
This command does the same as both of the above create
command examples, so
the listed caveats of both apply here: ensure the handler code is completed,
and the batch configuration is updated to match the task requirements.
Task parameters
Tasks can take arguments at runtime via process definition parameters. See the
Cirrus Process Payload docs for more information. When
authoring a task, be sure to document all supported task parameters in the
task’s README.md
. In using an existing task, the task README can always be
view via the cli:
❯ cirrus show task <TaskName> readme
This will dump the README.md
contents to the terminal with appropriate
markup applied.
Running tasks locally
We are working to standardize task code and cirrus
cli tooling to provide
an easy and consistent means to execute tasks locally. This feature is still
under development, so for now please consult the project or task documentation
for further information (if available).