Batch tasks
Required files
Tasks that support Batch-only operation need just the standard
definition.yml
and README.md
files. Tasks that support both Batch and
Lambda will additionally need all files required for Lambda-based
components. In the Batch-only case specifically, the task
directory structure looks very similar to Lambda-based tasks:
<project_dir>/
tasks/
BatchTask/
definition.yml
README.md
Definition file
The definition.yml
contains a Lambda component’s configuration. The format
is similar to that used by the Serverless Framework, which underlies cirrus’s
deployment mechanism, but is subtly different.
Batch tasks include CloudFormation resource declarations in the
definition.yml
file for all resources required for the Batch execution
environment. At minimum, a Batch job definition resource is required, which
should specify a link to an ECR image managed/built via an external source.
Often Batch tasks include dedicated compute environment and job queue
resources. Other common resources found in Batch task definitions include
launch templates, IAM roles and profiles, and ECR repositories.
Here is an example definition.yml
file for a fairly complex Batch-only task
named Reproject
:
description: A sample Batch-only task definition
environment:
BATCH_VAR_1: some value
OVERRIDDEN_VAR: another_value
enabled: true
batch:
enabled: true
resources:
Resources:
ReprojectBatchJob:
Type: "AWS::Batch::JobDefinition"
Properties:
JobDefinitionName: '#{AWS::StackName}-Reproject'
Type: Container
Parameters:
url: ""
ContainerProperties:
Command:
- cirrus-batch.py
- process
- Ref::url
Environment:
- Name: JOB_DEF_VAR
Value: 1234
- Name: OVERRIDDEN_VAR
Value: last_value
ResourceRequirements:
- Type: VCPU
Value: 32
- Type: MEMORY
Value: 240000
- Type: GPU
Value: 4
Image: '123456789012.dkr.ecr.#{AWS::Region}.amazonaws.com/some-image-name:${opt:stage}'
ReprojectLaunchTemplate500GB:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: '#{AWS::StackName}-Reproject-500GB'
LaunchTemplateData:
BlockDeviceMappings:
- Ebs:
VolumeSize: 500
VolumeType: gp3
DeleteOnTermination: true
Encrypted: true
DeviceName: /dev/xvda
ReprojectComputeEnvironment500GB:
Type: AWS::Batch::ComputeEnvironment
Properties:
ComputeEnvironmentName: '#{AWS::StackName}-Reproject-500GB'
Type: MANAGED
ServiceRole: !GetAtt BatchServiceRole.Arn
ComputeResources:
MaxvCpus: 2000
SecurityGroupIds: ${self:custom.batch.SecurityGroupIds}
Subnets: ${self:custom.batch.Subnets}
Type: EC2
AllocationStrategy: BEST_FIT_PROGRESSIVE
MinvCpus: 0
InstanceRole: !GetAtt ReprojectInstanceProfile.Arn
LaunchTemplate:
LaunchTemplateId: !Ref ReprojectLaunchTemplate500GB
Version: $Latest
Tags: {"Name": "Batch Instance - #{AWS::StackName}"}
DesiredvCpus: 0
State: ENABLED
ReprojectJobQueue500GB:
Type: AWS::Batch::JobQueue
Properties:
JobQueueName: '#{AWS::StackName}-Reproject-500GB'
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: !Ref ReprojectComputeEnvironment500GB
State: ENABLED
Priority: 1
ReprojectInstanceRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- ec2.amazonaws.com
Action:
- sts:AssumeRole
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
Policies:
- PolicyName: Cirrus
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:PutObject
Resource:
- Fn::Join:
- ''
- - 'arn:aws:s3:::'
- ${self:provider.environment.CIRRUS_DATA_BUCKET}
- '*'
- Fn::Join:
- ''
- - 'arn:aws:s3:::'
- ${self:provider.environment.CIRRUS_PAYLOAD_BUCKET}
- '*'
- Effect: Allow
Action:
- s3:ListBucket
- s3:GetObject
- s3:GetBucketLocation
Resource: '*'
- Effect: Allow
Action: secretsmanager:GetSecretValue
Resource:
- arn:aws:secretsmanager:#{AWS::Region}:#{AWS::AccountId}:secret:cirrus*
- Effect: Allow
Action:
- lambda:GetFunction
Resource:
- arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:#{AWS::StackName}-*
ReprojectInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- Ref: ReprojectInstanceRole
Let’s break down the resources at play in this Batch example.
Description
The top-level description
value is used for the component’s description
within Cirrus. It has no further purpose in the case of Batch.
Enabled state
Components can be disabled within Cirrus, which will exclude them from the
compiled configuration. All components support a top-level enabled
parameter
to completely enable/disable the component. Batch tasks also support
an enabled
parameter under the batch
key, which will enable/disable
just the Batch portion of the component.
For Batch-only components these enabled
controls function more or less
identically. For tasks that support both Batch and Lambda, the
lambda.enabled
and batch.enabled
paramters can prove useful in certain
circumstances. However, note that if the Lambda component of a dual
Lambda/Batch task is disabled, the Lambda deployment zip will not be
packaged/deployed and the Lambda will be deleted from AWS. This can leave the
Batch task unable to execute due to the missing code package.
Job definition
The ReprojectBatchJob
resource defines a CloudFormation resouce of job
definition type, and represents the job configuration used when running our
Reproject
job. The job definition includes such configuration settings as
the container image to run, the command to run inside that container, and the
resource requirements of the container. See the AWS Job Definition
CloudFormation reference for the full list of supported settings.
It is worth highlighting a few aspects of job definition resources.
Job parameters
The job definition Parameters
key defines a list of parameter and optional
default values that can be passed in to a job instance when run. In the example
above, the url
parameter is used to pass an S3 URL of the process payload
in to the executed command.
This is an important note: Batch has a rather low limit on the size of a job
sent to the SubmitJob API (30KiB at current). To mitigate impacts from this
limit, use the pre-batch
task immediately prior to any batch tasks to upload
the payload to S3 and return a url
to that payload, which can then be
referenced when calling the batch job as the value to the url
parameter.
In the ReprojectBatchJob
example resource above, we can see that the url
parameter is referenced in the executed command:
Command:
- cirrus-batch.py
- process
- Ref::url
which tells Batch to run a command like:
❯ cirrus-batch.py process <contents_of_url_parameter>
Exactly what command should be specified for a job definition is dependent on
the appropriate entry point inside the specified container image. Regardless,
that entry point should be expecting an S3 URL to a process payload, specified
in some manner. cirrus-lib
provides convenince classes/methods to help with
this common need.
The Batch tasks should replace the payload in S3 at the end of execution after
any modifications. Follow the Batch task with the post-batch
task to resolve
that S3 URL into a JSON payload to pass to successive tasks. post-batch
will
also pull any Batch errors from the logs and raise them within the workflow, in
the event of an unsuccessful Batch execution.
See Batch tasks in workflows for an example of how a
payload is passed to a job using this url
parameter, how pre-batch
and
post-batch
are used, and some other tips regarding Batch tasks in workflows.
Job parameters can also be used for other job settings, but are most commonly
used within the Command
specification in Cirrus.
Environment variables
Batch job definition resources support defining a list of environment variable
names and values, similar to Lambda functions, though with a slightly different
format. Like Lambda tasks, Batch tasks job definitions support the task
definition’s top-level environment
specification, which they inherit, along
with any environment variable defined globally in the cirrus.yml
file under
the provider.environment
key, with preference given to any duplicate
varaibles defined on the Batch job defintion.
Additionally, AWS_REGION
and AWS_DEFAULT_REGION
are added to the job
defintion’s environment variables with the value derived from the stack’s
deployment region.
If ever in doubt about the final environment variables/values (or the values of
any other parameters) used in a Batch task definition, the cirrus
cli
provides a show
command that runs the full configuration interpolation to
generate the “complete” definition as it appears in the compiled configuration
generated by the build
command. Run it like this:
❯ cirrus show task <TaskName>
Resource requirements
The ResourceRequirements
key allows specification of a list of all hardware
resources required by the job (unfortunately with the exception of disk space).
Note that the values provided here serve as defaults for spawned jobs, and can
be overriden when calling SubmitJob
in the workflow. Again, see Batch
tasks in workflows for an example of overriding resource
requirements.
The specified resource requirements are used by the compute environment to pick an appropriate-sized instance type for the job, either by doing a best fit across all available instance types, or by selecting the best fit instance type from a user-provided list. Additional factors come into play with instance selection such as whether the compute environment is using on-demand or spot instance.
Optimizing task resource requirements to the minimum required is critical. While doing so certainly provides an important cost savings, often the more meaningful reason to do so is to ensure fast instance start up time. Larger instances can take much longer to become available than small instance, delaying instance provisioning and therefore job start.
Image specification
The Image
key accepts an image name within a docker registry in the form
repository-url/image:tag
. If omitted, the repository-url
will point to
Docker Hub.
For Cirrus tasks, using the AWS Elastic Container Registry to store images is
common, as is show in the example Image
value:
123456789012.dkr.ecr.#{AWS::Region}.amazonaws.com/some-image-name:${opt:stage}
Note the use of the Serverless parameter ${opt:stage}
, which allows
specification of an image tag based on the stage in a multi-stage deployment
pipeline. For example, if we have a deployment pipeline with the stages,
dev
, staging
, and prod
, we will want to ensure we have image
versions in the ECR repo with tags of those same names.
Compute environments
Compute environments are perhaps the most complex of the Batch resources. Users are strongly encouraged to read through both the Batch compute environment documentation and the CloudFormation compute environment documentation to gain an understanding of the role of compute environments, how they can be used, and what options are available for controlling how jobs are executed within them.
Within the Cirrus context, it is recommended to use MANAGED
compute
environments. Whether to make use of Fargate or EC2 for job execution is highly
dependent on the workload involved. Many geospatial processing tasks a
compute-intensive and make heavy, constant use of instance CPUs, which often
tips the balance in favor of EC2 for the cost savings. EC2 also allows great
flexibility, at the expense of having more complex configuration to manage.
Naming compute environments
Naming a compute environment is seemingly desirable as the autogenerated names created when a name is omitted are often less than useful and potentially shortened in ways that incite confusion. Unfortunately, as a replace-only resource, using a name can lead to an issue if needing to update an existing compute environment, due to name conflict between the existing environment and its replacement.
Sometimes that restriction is advantageous, as it acts as something of a barrier to prevent potentially service-impacting updates. However, for some projects, omitting the name may be preferable, as doing so allows updates without requiring explicit name changes or resource duplication.
Review the Batch resource management strategies for more information.
Compute resources
The majority of the compute environment configuration is provided by the
ComputeResources
settings. Refer to the compute resources CloudFormation
documentation for a complete list of all supported options.
Compute environment scaling is defined by several parameters, most notably
AllocationStrategy
and MaxvCpus
. As jobs are submitted with a desired
CPU count, the compute environment responds by spinning up instances to match
the total number of CPUs required by all executing jobs. Instances are allocated
to the compute environment using the specified AllocationStrategy
. In some
cases, the desired instance type may not be available, and more-strict
strategies may prevent substitute instance types from being allocated, causing
jobs to wait for instance to become available. A similar situation can happen
with large resource demands, where even less-strict allocation strategies cannot
find a suitable instance and jobs have to wait.
Specifying a MinvCpus
value as a multiple of the number of jobs the compute
environment should minimally accommodate without waiting can be a viable
mechanism for dealing with instance inavailability. That is, if needing to
ensure ten jobs each requiring four CPUs can run without a wait then a
MinvCpus
value of 40 will ensure enough instances are continuously running
to support those jobs. However, using this parameter can add significant idle
costs and is not recommended unless strictly required. It also does not help
mitigate latency in the case of job bursts beyond the minimum constant capacity.
Back to scaling: the compute environment will continue to allocate instances
until the total number of CPUs in the environment matchs the total CPU demand
from jobs. However, this allocation will only continue as long as MaxvCpus
is greater than the number of CPUs in the environment. In this way MaxvCpus
acts as the cap on instance count and therefore the maximum number of Batch jobs
that can be running at any given time. Therefore, ensuring MaxvCpus
is
appropriately set is important; an optimal value can be calculated by
multiplying the maximum number of simultaneous jobs required by the number of
CPUs each job requires.
Using the AWS spot market
Launch templates
Launch templates provide a way to run scripts, apply configuration, and make
other initialization changes to EC2 instances started in a compute environment.
Perhaps most commonly, launch templates are used to increase the root disk size
to ensure enough space is available for running containers and any scratch space
they may require. The ReprojectLaunchTemplate500GB
resource in the example
definition.yml
is doing exactly that, increasing the root disk to 500GB from
the default 30GB.
Other common uses of launch templates for Batch tasks include mounting an EFS volume or tweaking the ECS container agent settings.
When using launch templates with compute environments please note that updating a luanch template will not affect any existing compute environments referencing that launch template. The launch template referenced at compute environment creation is cached independently of the base version, and cannot be updated. If needing launch template changes to apply to an existing compute environment the compute environment must to be recreated so the new environment can pull the new launch template version.
Consult the CloudFormation documentation for launch templates to learn more.
IAM permissions
Looking closely at the compute environment in the above example, one will notice
two IAM role keys: ServiceRole
and ComputeResources.InstanceRole
.
ServiceRole
is the IAM role used by AWS Batch and normally requires a fairly
standard set of permissions. Therefore, the same role is commonly shared across
all compute environments as the permissions typically do not differ between
environments (that role is not part of the example for that reason).
The ComputeResources.InstanceRole
is the role used for each container
instance, and is therefore rather specific to the Batch task at hand. Unlike
ServiceRole
the instance role parameter does not expect an IAM role ARN;
instead it expects an IAM instance profile. Consequently, the above example
features both the ReprojectInstanceRole
IAM role resource and the
ReprojectInstanceProfile
IAM instance profile referencing the former. We
then resolve the profile’s ARN and pass that to the compute environment, and it
can use that profile to associate the desired role and its polices to all Batch
job containers.
Commands run as Batch jobs therefore get the permissions allowed by the specified IAM role, in a similar manner to the unique role created and used for Lambda-based components. Ensure this role has all required permissions and no more, so the Batch task does not encounter any permissions errors but also cannot access unexpected resources. Roles/profiles can be shared between compute environments, but doing so is discouraged.
Container overrides can be used when calling SubmitJob
to change the profile
used for a specific job. This feature can be useful for advanced users
attempting to run multiple jobs are executed in the same compute environment
(again, having unique compute environments per task is typically recommended).
Job queues
Compute environments are not actually referenced when submitting a job. Instead, a job queue is specified, which itself provides a link to a specific compute environment. Job queues are used as a means of holding submitted jobs while waiting for available CPUs in a saturated compute environment, and can also provide prioritization in the case where different types of jobs share a single compute environment.
Multiple compute environment can also be specified for a single queue. This can be useful in the case of wanting some on-demand capacity, but pushing overflow into the spot market, or vise versa.
Job queues can be combined with a Batch scheduling policy for advanced use-cases.
See the job queue CloudFormation documentation for more information about supported job queue configurations.
Other considerations
Other CloudFormation template sections
In addition to support for CloudFormation Resources
under
batch.resources
, Cirrus also supports defining other CloudFormation
template section types such as Outputs
or Conditions
. Use those as
required to keep such items together with the associated Batch task.
Batch CloudFormation resources in the cli
The cirrus
cli allows search/discovery of all CloudFormation resources in a
project. All resources within a project can be listed with the show
command:
❯ cirrus show cloudformation
[Outputs]
CirrusQueueSnsArn (built-in)
[Resources]
AddPreviewAsBatchJob [AWS::Batch::JobDefinition] (from built-in task add-preview)
AddPreviewComputeEnvironment [AWS::Batch::ComputeEnvironment] (from built-in task add-preview)
AddPreviewJobQueue [AWS::Batch::JobQueue] (from built-in task add-preview)
BasicOnDemandComputeEnvironment [AWS::Batch::ComputeEnvironment] (built-in)
BasicOnDemandJobQueue [AWS::Batch::JobQueue] (built-in)
...
The CloudFormation items are broken down by types, and show the source. For
Batch resources, they will look something like AddPreviewAsBatchJob
, where
it shows that the resource is specifically from the built-in batch-enabled task
add-preview
. In this way it is easy to identify a given resource, output, or
other CloudFormation object and determine if its origin is a Batch task, and if
so, which one.
Batch Quotas
AWS limits the number of job queues and compute environments in an account to 50 each. Considering this limit is important when determining how to structure/organize a project’s compute environments. In a large, batch-heavy deployment, consolidating compute environments and job queus such that they can be shared between tasks may be advantegeous or even necessary to ensure the deployment can remain below these quotas. If diverging from the general recommendation of a unique job queue and compute environment per task, be sure to fully consider instance requirement compatibilities between tasks (including instance AMI selection), job queue scheduling policies and prioritization mechanisms, and compute environment capacities.
Also consider the deployment downtime requirements and how changes to compute environments must be managed per the following guidelines, making sure that the chosen strategy will have enough headroom within the quotas.
Managing changes to Batch resources
Observed issues
Several different service-impacting issues can result from changes to Batch resources. The following is an attempt to capture those issues and the affected resource types, though it is not an exhaustive list of potential problems.
Workflows started during a deployment can have broken Batch configurations
A step function referencing a Batch job definition does so via the definition
ARN, including revision, when using the standard reference syntax like
#{JobDefinitionName}
. When deploying a new revision of a job definition,
CloudFormation automatically deactivates the old revision before the step
function is updated. Any workflow executions trying to start a batch job between
the deactivation of the old revision and the step function update will fail.
Batch job definitions “roll forward” on CloudFormation rollback
If CloudFormation encounters an error during stack deployment and has to rollback after updating a Batch job definition, the old job revision is not reactivated. Rather, the job definition is “rolled forward,” such that the old definition is used to create a second new revision. It looks something like this:
Job definition A B A
Revision number 1 -> 2 -> 3
At the time the updated definition with B is created as revision 2, revision 1 is deactivated. Then, on rollback, CloudFormation re-deploys the definition with A as revision 3, deactivating revision 2. But, like the above temporary issue with job definition revisions, the step function definition will not be updated and still points to revision 1. Unlike the above issue, this case results in a premanent issue, unless fixed by another deployment or manual configuration changes.
Killed jobs on job queue removal
Perhaps obviously, if deleting a job queue all associated jobs will be killed. While not typical, it is important to note when making large changes/refactoring existing compute environments/job queues, or simply just renaming a template resource.
What to do about these issues
In essences, each of these issues results from trying to make updates to Batch infrastructure while jobs are running or can be started. How best to mitigate these issues, therefore, depends on project uptime requirements and/or how constantly jobs are run as part of the pipeline.
Downtime an issue, jobs continuously running
In the case where jobs are continuously running and any pipeline downtime is undeserable, the best management strategy is to avoid any Batch resource updates, instead deferring to strategy of duplicating all changed resources. Commonly, this results in something like a blue-green deployment where every Batch resource has two copies, a revision A and revision B. Changes then alternate between the two revisions, ensuring the active revision is not updated at any time.
In the provided Batch example above, we would end up with a list of resources like:
ReprojectBatchJobRevA
ReprojectLaunchTemplate500GBrevA
ReprojectComputeEnvironment500GBrevA
ReprojectJobQueue500GBrevA
ReprojectInstanceRoleRevA
ReprojectInstanceProfileRevA
ReprojectBatchJobRevB
ReprojectLaunchTemplate500GBrevB
ReprojectComputeEnvironment500GBrevB
ReprojectJobQueue500GBrevB
ReprojectInstanceRoleRevB
ReprojectInstanceProfileRevB
In this circumstance, it is advantageous to name things like the compute environments and job queues to prevent updates and try to force the duplication workflow.
If currently using the revision A resources and needing to update, say, the launch template, the procedure would be as follows:
Copy
ReprojectLaunchTemplate500GBrevA
asReprojectLaunchTemplate500GBrevB
and update as requiredCopy
ReprojectComputeEnvironment500GBrevA
toReprojectComputeEnvironment500GBrevB
and change the latter to point to the new launch templateReprojectLaunchTemplate500GBrevB
Copy
ReprojectJobQueue500GBrevA
toReprojectJobQueue500GBrevB
and update the copy to referenceReprojectComputeEnvironment500GBrevB
Update all workflow references to
ReprojectJobQueue500GBrevA
to point toReprojectJobQueue500GBrevB
On deploy, CloudFormation should perform the following operations, in order:
Create the new launch template
ReprojectLaunchTemplate500GBrevB
Create the new compute environment
ReprojectComputeEnvironment500GBrevB
Create the new job queue
ReprojectJobQueue500GBrevB
Update any workflow step functions per the new job queue reference
If at any point in this deployment an error is encountered, the step functions and the old batch resources are left unmodified. The case of a new workflow execution prior to the step function updates is similar, in that the step functions still point to old batch resources which can continue to process jobs.
After a successful deployment of the revision B resources and confirmation that all running jobs have completed, the old revision A resources can be removed entirely. Next time changes are required the revision B resources can be copied to revision A resources.
The above steps are the minimal set of changes for the example launch template update. In practice it is often easiest to copy all resources at once, to ensure all resources are consistently using revision A or B.
If needing to use this management strategy for batch resources, be sure to remember the Batch resource quota. Ensure enough headroom is present at all times in the batch resource totals to allow any possible changes to take place.
Downtime okay, jobs intermittent
In the case where downtime is acceptable and jobs are intermittent and/or can fail without issues, avoiding the complexities of the above management strategy may be preferable. In this case, use the simpler strategy of simply updating resources and handling any potential issues as they occur during deployment. In this case it might be best to omit names from resources like compute environments and job queues; else plan to change the names on any update.