IaC is necessary for most cloud projects, data engineering included — but it can get tricky in some setups. Here's how to solve one of those.
Anyone using Amazon ECS and Infrastructure as Code tools of any kind — such as Terraform, CDK or Pulumi — will inevitably face a subtle issue: how to make the ever-changing Task Definition Revisions and container image versions stay reliably in sync with IaC code, which itself is also versioned. To complicate things, any proper solution needs to be viable for multiple environments.
In this article, we’ll focus on solving a relatively common issue in cloud-native data engineering projects: with Apache Airflow (AWS MWAA) running dbt containers scheduled by AWS ECS, how can we take care of semantic versioning, IaC and all the best DevOps practices in a sane way?
Amazon ECS is a container orchestration service. You instruct it to run a given container and it will handle the heavy lifting for you — find an appropriate compute environment and then run & manage the container’s lifecycle.
The service also handles load balancing, autoscaling, smooth deployment, service discovery, logging and more (by itself or by natively integrating with other AWS services that provide these features). ECS is comparable to technologies such as Kubernetes, Marathon or Nomad.
ECS can pull images from various sources, including public Docker Registries or AWS ECR — a technology that simply stores your container images in AWS.
The basic, simplified flow (which skips things like ECS Agents or ECS Services, as those are irrelevant to this article) looks like below:
We can simply tell ECS to run our app at tag “1.0.0”, and it will do so.
Let’s assume some time passed by and a new version of our app has just been developed. Using the same approach, we could simply tell ECS to run the newest version of our app (the “1.1.0” tag):
If the ECS Task had been running continuously (like a web server that’s on 24/7), then ECS would also smoothly replace the previous version of the application (1.0.0) with the new one (1.1.0), without any impact on our ability to handle incoming requests. This is also known as a blue-green deployment:
On the other hand, if the ECS tasks were one-off jobs (like data processing tasks, that are done after processing a chunk of data), then ECS would just run a newer version of the task for the job:
The second scenario is more common in the data engineering world. In data projects, there are often hundreds of one-off tasks running on a schedule. Some tasks produce vital reports for management and other stakeholders. Some process data for machine learning purposes. Some transport data from one place to another, while others monitor the quality of newly ingested data. It looks more or less as follows:
There can be dozens of various types of applications, each with their own container and many versions. One of those could contain dbt, itself holding thousands of SQL statements to be scheduled by Amazon Redshift. Alongside, there could be many others, running wildly different software for unrelated purposes.
Their execution graph might end up looking like this:
As workflows tend to increase in number over time and their execution conditions (like the time they are supposed to run at) grow in complexity, it is prudent to set up their scheduling using an existing orchestrator instead of rolling your own.
In this case, we’ll be using AWS MWAA, which is essentially a managed Apache Airflow service neatly integrated with the remainder of AWS’ environment. Besides its core responsibility of scheduling and managing the lifecycle of these tasks, it also keeps track of them, letting us inspect their various logs and metrics in a centralized and clean manner.
While we generally rely on AWS MWAA (Apache Airflow) in our projects, nothing stops you from using different technologies which solve more or less the same problems, such as: AWS Step Functions, Prefect or Mage.
At its core, such a setup is relatively straightforward. However, things can quickly start getting out of hand once we want to run this in multiple environments with a IaC mindset.
First, we need one more ECS “entity” to actually run a container — a Task Definition. A Task Definition is essentially an extended configuration of a task. At minimum it defines the container image to run (along with its version), but it can also specify its ports, its CPU and memory limits, its logging configuration — and more.
Task Definitions are immutable and versioned. Each change to a task definition results in a new Revision. As such, when you run an ECS task, you tell the ECS service which Revision of a Task Definition to run — and not which container image and version, as that information is already part of the particular Task Definition, at that particular Revision.
This adds yet another entity with a versioning scheme to the project. However, you cannot manually adjust said scheme to your liking — their version numbers are simply arbitrary integers. No semantic versioning here.
What’s worse is that in a multi-tenant environment, each account could have their own numbering (for various reasons, like sudden hotfixes at PROD or rapid, manual development at DEV):
Revision 13 at Production could point to version 1.0.0 of the app, while Revision 13 at Preprod could target 2.0.0! Whatever schedules these tasks (e.g. MWAA DAGs) needs to take crucial differences into account, forcing you to store potentially wildly different configurations across environments.
On top of that, once IaC tooling is involved (as it should be), when a developer pushes new code — creating a new version of an image in the process — they will also need to tweak IaC setups and wait until those get reviewed and deployed.
Woah! Even more cognitive load, more context switching, more complexity! We added a new image version, so we had to deploy a new Terraform module version, to create a new Task Definition Revision. Great.
This is compounded by the fact that there’s often a separate CI/CD process responsible for the deployment of those tasks (or an orchestrator, such as MWAA, that needs an update in its DAGs to point to a new version), forcing you to ignore some inconsistencies between what’s in your IaC, and what’s really in your system.
To be clear — Task Definitions are absolutely necessary in the grand scheme of things. They not only tackle container versioning, but dozens of different configuration parameters. They need their own versioning. However, they are also these weird thingies that reside on the intersection of the application and its infrastructure, instead of firmly in one or the other. Thus, if you try to blindly apply infrastructure practices on them, things start to break at weird places.
In the data engineering domain, a data engineer that just pushed a new SQL statement to their dbt repository and wants to reconfigure Airflow to run it, should be able to do so without all this hassle that we outlined so far:
Is there any way we can simplify this flow without sacrificing the benefits those abstractions bring to the table?
We had a simple idea, and it turned out to be the solution - create Task Definition Revisions on the fly. Create the “piece that connects ECS with the proper container version to run” automatically, at runtime — and only for a single execution.
As a result, your data engineers will only need to remember which container version they want to run. Simple as that.
Suppose we need to update our existing app (dbt-model:1.0.0) with new SQL statements. This modification in the dbt-model repository triggers our deployment pipeline — which includes testing — and ultimately results in the creation of a Docker image. This image is then stored in Amazon ECR under the name dbt-model:1.0.1.
Of course, dbt is only an arbitrary example of software that could run in ECS.
In essence, dbt is an orchestrator of SQL statements. It lets you define various data models (“ETLs”) along with their dependencies. Then, dbt connects to your data warehouse and runs them on your behalf. Dbt by itself doesn’t require that much compute power (it uses the compute of the data warehouse), but still needs a place to run in AWS. We utilize ECS for that. However, you can run anything inside your containers — e.g. your own custom Python data processing scripts.
Back to the showcase — once we have the new dbt container image in Amazon ECR, the only action required from the Data Engineer is to update the reference to point to the image dbt-model:1.0.1 in the Apache Airflow DAG.
Here comes the “magic”. A new task definition is created for each DAG execution by using official Amazon ECS Operators for Apache Airflow. Each workflow run leads to the creation of a new Task Definition Revision, followed by the execution of the dbt code, and ending with its deregistration. With this configuration, the Data Engineer doesn't need any prior knowledge about ECS, as the only necessary change is in the dbt model version reference.
See the exact DAG snippet below:
Note that the Task Definition contains not just the container version, but also various technical parameters — such as which ECS Cluster to run, the security group configuration, and several others. Thankfully, these are mostly static in practice and can be coded once, and then reused.
The screenshot below shows how the DAG execution appears in Apache Airflow’s (MWAA) console:
Et voilà.
Your IaC is still used nearly everywhere, but your data team does not need to be aware of all its quirks nor live in dependency hell.
Remember: both AWS MWAA and dbt were used here merely as data engineering-related examples. You could use another orchestrator and other software inside your containers — but the underlying IaC vs Task Definition Revision issue would still be there …and you could solve it basically the same way we did!
Of course, as you run dozens of workflows every day, you might quickly find yourself with thousands of Task Definition Revisions. AWS imposes a hard limit of 1,000,000 on their count. Deregistering a revision is not the same as deleting it, and does not exclude it from being included in this limit. Still, that’s often enough for years of such practice.
However, if you need more than that, you might want to implement one of the following workarounds:
Scheduling and managing ephemeral tasks on AWS is simple and efficient with technologies such as AWS ECS and AWS MWAA. However, as you add more environments and IaC to the mix, the subtle quirks of Task Definition Revisions might make your life more complicated.
Thankfully, you can simply detach the Task Definition Revisions from your IaC setup altogether, and just generate them at runtime — saving you many headaches and sleepless nights.
We'd love to answer your questions and help you thrive in the cloud.