Managed Apache Airflow on AWS — New AWS Service For Data Pipelines

Ayush Kumar srivastava
7 min readJul 1, 2021

AWS Managed Workflows for Apache Airflow (MWAA) is now available — is it worth trying?

Apache Airflow has been initially released as an open-source product in 2015 Since its creation, it gained a lot of traction in the data engineering community due to its capability to develop data pipelines with Python, its extensibility, a wide range of operators, and an open-source community. Despite its popularity, deploying Airflow to a robust and secure production environment has always been challenging.

In fact, there are companies (such as Astronomer), consultants (a.o. Polidea and GoDataDriven), and cloud services (such as Google Cloud Composer or many AWS Marketplace offerings) that specialized in offering enterprise support for deploying and managing Airflow environments. AWS now enters this market, too.

The new fully managed service from AWS lets you create a production-ready Airflow environment within a few clicks in the management console. In this article, we’ll look at how it works and investigate how it differs from competing managed Airflow offerings.

Introducing MWAA: Managed Workflows for Apache Airflow

The main benefit and selling point of MWAA is convenience: a managed service with elastic worker-node capacity that allows you to deploy your DAGs without having to worry about the underlying infrastructure. This means that you no longer need to monitor and manually scale your Celery workers to meet your workflows’ demand. So far, having elastic worker nodes with Airflow was only possible using KubernetesExecutor and KEDA (Kubernetes Event-Driven Autoscaler).

With MWAA, you don’t choose Airflow executor — MWAA only supports CeleryExecutor with an autoscaling mechanism implemented under the hood. It’s no surprise that AWS-native SQS is leveraged as a message queue to process the scheduled Celery worker tasks.

Airflow relies on a metadata database that stores information about your workflows. AWS utilizes RDS Aurora (Postgres) for that purpose.

MWAA uses S3 as a storage layer for the code, i.e., DAG files, plugins code, and requirements.txt to install additional Python packages within the Airflow environment.

1. Managed service

Besides the autoscaling of worker node capacity, one of the most considerable advantages of MWAA is the fact that it’s a managed service. You don’t need to monitor webserver, worker nodes, and scheduler logs to ensure that all components within your environment are working — AWS is responsible for keeping your environment up and running at all times. The same is true for security patches and upgrades to new Airflow versions.

2. Integration with AWS services

The integration with other AWS services makes it easier to manage communication between Airflow and other services running within your VPC. For instance, instead of maintaining and manually rotating credentials, you can now leverage IAM roles for more robust management of permissions.

3. Convenient configuration

Another useful feature is a centralized storage of configuration options. Airflow is configured to a large extent by setting variables in the airflow.cfg file. With MWAA, you can manage those settings directly from the management console. Normally, after having changed your configuration variables, you would have to restart your scheduler and worker nodes to apply the changes. With MWAA, it’s performed under the hood without bringing down your Airflow environment.

4. Extensibility via plugins

The functionality of MWAA environments can be extended by using plugins — you simply need to upload plugins.zip to your S3 bucket to make custom operators, hooks, and sensors available to all your DAGs. Example structure of plugins.zip [3]:

__init__.py
|-- airflow_plugin.py
hooks/
|-- __init__.py
|-- airflow_hook.py
operators/
|-- __init__.py
|-- airflow_operator.py
sensors/
|-- __init__.py
|-- airflow_sensor.py

5. Convenient logging options

One of my favorite aspects of MWAA is how easy it is to configure logging to CloudWatch. Setting this up yourself would require a lot of work to configure a CloudWatch Agent on all instances to stream the logs to CloudWatch, and ensure proper log groups for all components (scheduler logs, worker nodes logs, web server logs, and the actual logs from your tasks).

6. Security out of the box

My second favorite feature is the integration of MWAA with IAM: only authorized IAM users can log into your Airflow UI. This is enabled by default without having to implement any custom Auth logic. When I open the link to the Airflow UI within a private browser window, we are not able to access the UI until we sign in with a user who has access to the AWS management console:

From the security perspective, I would definitely recommend using MWAA rather than deploying Airfow yourself with EC2. AWS ensured that any data processed within MWAA is by default encrypted with KMS, and you get the Single-Sign-On that works out of the box. Similarly, you don’t need to worry about Security Groups, and a domain name is automatically assigned to your environment.

7. CI/CD pipelines can be easily established

From the DevOps perspective, MWAA is attractive since establishing a CI/CD pipeline for your DAGs is as easy as making sure that any push to the master branch triggers upload of the code to s3://airflow-BUCKET/dags.

8. Support for containerized workflows & consolidated pricing

Lastly, AWS also promises support for containerized workloads with AWS Fargate and enables easy consolidated pricing for all your Airflow components by using tags assigned to your MWAA, S3, and CloudWatch resources.

Demo

The overall process of creating an environment is to configure:

  • VPC with two private subnets (via CloudFormation template),
  • S3 bucket that will be used as a storage of your DAG files (it must be a bucket that starts with “airflow-” and with versioning enabled!), and optionally to upload plugins.zip to use custom operators, sensors, and hooks, as well as to upload requirements.txt to make additional Python packages available within this Airflow environment.

In the demo below, I’m using the AWS management console to create the MWAA Airflow environment, including creating a VPC with two private subnets.

Creating an Airflow environment in MWAA — image by author

Once the environment is available, you can upload a simple test DAG to the S3 bucket: s3://airflow-mwaa-dev.

Hello-world DAG from AWS MWAA

aws s3 cp ex_dag.py s3://airflow-mwaa-dev/dags/

After uploading to S3, we can find the DAG in the UI:

Managed Airflow on AWS — image by author

Costs

As with any cloud service, you need to know what such a managed service will cost and how it differs from a self-managed environment. By deploying everything yourself and perfectly matching the compute resources to your needs, you could have some potential savings compared to a managed service. However, from my perspective, the autoscaling feature of your Celery worker nodes will most likely already provide cost savings compared to a self-hosted Airflow environment with idle compute resources.

Also, one needs to consider the time savings that a managed service enables — you no longer need to spend time monitoring, patching, and restarting stuck webserver and scheduler instances, fixing broken Celery queues, and matching the worker nodes capacity to the needs of your workflows.

AWS charges separately for the MWAA instance, worker nodes, and for the metadata database storage. Also, S3 and CloudWatch charges apply based on your region and usage.

AWS MWAA — pricing table. Source

Important note regarding costs (!!!)

When you delete your Airflow MWAA environment, please make sure to delete the VPC that was configured within the default MWAA configuration via CloudFormation template, especially the NAT Gateway, for which the charges apply hourly. As long as you delete the VPC, make sure that the NAT Gateway is deleted as well to avoid any unnecessary charges. You can see it all from the VPC section in the management console:

Make sure that after deleting your MWAA environment the number of NAT Gateways remains zero for all regions — image by author

--

--