Tasks ===== .. toctree:: :hidden: :maxdepth: 2 batch Tasks in Cirrus implement a unit of processing, to be composed together into a :doc:`Workflow <../workflows/index>`. Tasks are expected to support both input and output formatted as a :doc:`Cirrus Process Payload <../../30_payload>`. As part of its processing, a task can make any requisite modifications to its input payload and/or derive any output assets, pushing them to the canonical storage location in S3. In other other words, to implement custom processing routines for a pipeline, use a task. The best tasks are modular, simple, focused, and composable. Most projects end up with more custom tasks than other component types, so it pays to be familiar with the tasks ins and outs. Tasks can make use of AWS Lambda and/or AWS Batch for execution. Lambda tasks are simpler to manage and quicker to start up, but the Lambda runtime constraints can be prohibitive or untenable for some task workloads. In those cases, Batch allows for extended runtimes, greater resource limits, and specialized instance types. In a Cirrus project, tasks are stored inside the ``tasks/`` directory, each in a subdirectory named for the task. Each task requires a ``definition.yml`` file with the task's configuration, and a ``README.md`` file documenting the task's usage. Anatomy of a task ----------------- Generally speaking, every task should do a few key things: * Take an input Cirrus Process Payload * In the case of Batch tasks and/or large payloads, tasks should support receiving a ``url`` input parameter pointing to a payload object in S3 * Instantiate a ``cirrus.lib.ProcessPayload`` instance from the input payload JSON * Download all required assets from the items in the input payload * Perform any asset metadata manipulation and/or derived product processing * Update/replace payload items based on task outputs * Upload any output assets to S3 for persistence * Return the output Cirrus Process Payload * In the case of Batch tasks and/or large payloads, tasks should support uploading the output payload to S3 and returning an output ``url`` parameter pointing to that payload object in S3 Certain tasks may deviate from this pattern, but the vast majoity of tasks will follow this flow. ``cirrus-lib`` provides convenince classes/methods to help with these common needs. Lambda tasks ^^^^^^^^^^^^ Lambda tasks use the `AWS Lambda`_ runtime to power executions. Lambda has the advantage of quick startup and easy management, but has many restrictions like short timeouts and significant resource limits. Lambda-only tasks follow the specifications outlined in the :doc:`Lambda-based components <../lambdas>` documentation. Refer there for specifics on what files are requried for Lambda tasks and how to structure the ``definition.yml`` file. .. _AWS Lambda: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html Batch tasks ^^^^^^^^^^^ Batch tasks use `AWS Batch`_ semantics to define `jobs`_ that execute within a `compute environment`_ as determined by the `job queue`_ to which the job is submitted. Batch compute environments can make use of `Fargate`_ or `EC2`_ to run jobs, allowing significantly more control over the execution environment than Lambda allows, as well as much greater limits on resources. Batch tasks are inherently just an abstraction around a set of CloudFormation resources, minimally just a Batch `job definition`_, but commonly also the job queue, compute environment, and any other requried resources. For more infomation see the :doc:`Batch tasks ` documentation. .. _AWS Batch: https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html .. _jobs: https://docs.aws.amazon.com/batch/latest/userguide/jobs.html .. _compute environment: https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html .. _job queue: https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html .. _Fargate: https://docs.aws.amazon.com/eks/latest/userguide/fargate.html .. _EC2: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html .. _job definition: https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html Lambda vs Batch --------------- When to chose Lambda ^^^^^^^^^^^^^^^^^^^^ .. TODO * small code size/not many dependencies * single-threaded * short runtime (no more than 15 minutes max) * need code to live in the cirrus project repo When to chose Batch ^^^^^^^^^^^^^^^^^^^ .. TODO * long runtimes * large package size/non-native dependecies * can use multiple CPUs * easier to manage code as separate container images * need significant RAM * need more than 10GB disk * need special hardware resources (e.g., GPU) .. Omitting discussion of Batch Lambdas for the moment Or maybe both? ^^^^^^^^^^^^^^ Sometimes both Lambda and Batch can fit the task requirements, depending on input. Other times, avoiding the overhead of managing/deploying a Batch container image makes Lambda attractive, but runtime constraints like max execution time mean that only Batch is viable. In each of these cases, one can specify a task as *both* Lambda and Batch, using the Cirrus Batch Lambda runner container to run the packaged Lambda code. Doing this allows the user to choose which execution mechanism is most appropriate in a given context. This could be parameterized in a workflow based on the input (like a ``batch`` flag in the task parameters), or on logic in an separate input inspection Lambda. Some workflows could always run the Lambda version of a task, and others the batch version. Or maybe the Batch version is the only ever actually used, taking advantage of the Lambda packaging support solely to make it easier to keep task code inside the Cirrus project. Using the Batch Lambda runner ***************************** See the `cirrus-task-image`_ repo for more information. .. _cirrus-task-image: https://github.com/cirrus-geo/cirrus-task-images Creating a new task ------------------- Creating a new task involves creating a directory with the task name under ``tasks/`` and the required files inside it. Getting everything setup with all the requisite boiler-plate takes some minor work. The ``cirrus`` cli includes a convenience function to automate getting started with a new task. Lambda-only ^^^^^^^^^^^ To create a lambda-only task, simply create a new task with a description and the options ``--has-lambda`` and ``--no-batch``:: ❯ cirrus create task --has-lambda --no-batch "" This command will create the task directory and required files from a minimal template. The new task will obviously need to have the custom handler code added, and the ``definition.yml`` configuration will need to be validated to ensure it matches the task requirements. Any usage information should also be added to the ``README.md`` file. Batch-only ^^^^^^^^^^ To create a Batch-only task, simply create a new task with a description, but add the ``--has-batch`` and ``--no-lambda`` options:: ❯ cirrus create task --has-batch --no-lambda "" The task directory and required files will be created from a minimal template. The templated Batch configuration in the ``definition.yml`` should be considered a rough starting point, and will require fairly significant modification for most uses. Be sure to also update the ``README.md`` file with usage information. Lambda and Batch ^^^^^^^^^^^^^^^^ For tasks that should support both Lambda and Batch, run the ``create`` command, this time using the options ``--has-lambda`` and ``--has-batch``:: ❯ cirrus create task --has-lambda --has-batch "" This command does the same as both of the above ``create`` command examples, so the listed caveats of both apply here: ensure the handler code is completed, and the batch configuration is updated to match the task requirements. Task parameters --------------- Tasks can take arguments at runtime via process definition parameters. See the :doc:`Cirrus Process Payload <../../30_payload>` docs for more information. When authoring a task, be sure to document all supported task parameters in the task's ``README.md``. In using an existing task, the task README can always be view via the cli:: ❯ cirrus show task readme This will dump the ``README.md`` contents to the terminal with appropriate markup applied. Running tasks locally --------------------- We are working to standardize task code and ``cirrus`` cli tooling to provide an easy and consistent means to execute tasks locally. This feature is still under development, so for now please consult the project or task documentation for further information (if available).