What is Apache Airflow?
Find out what Airflow is and how it's used to orchestrate data pipelines.
Craig Dennis
March 22, 2023
8 minutes
Airflow Components
Airflow’s underlying architecture is powered by several architectural components used to power every orchestration job. Here is a breakdown of Airflow’s underlying architecture:
- DAGs: as mentioned above, a DAG is a container for a workflow. It contains the tasks you want to complete and allows you to select the order in which they run and which tasks depend on others running first.
- DAG Runs: When executing a DAG, a DAG run gets created, and all its tasks are executed. You can schedule DAGs by a cron expression, a
datatime.timedelata
objects, or with one of the cron presets. - Tasks: Tasks are what sit in a DAG. They contain the actions that need to be executed. Each task can have downstream and upstream dependencies, which is key to creating effective workflows. There are three different types of tasks: Operators, Sensors, and TaskFlow-decorated tasks.
- Operators: An operator is a way of accessing predefined templates to use written code. They are Python classes that encapsulate logic to do a unit of work. Some of the most frequently used operators are PythonOperator (allows you to execute a Python function), BashOperator (allows you to execute a bash script), and SnowflakeOperator (allows you to execute a query against a Snowflake database).
- Sensors: Sensors are a special type of operator that waits to meet a certain condition. When the condition is met, it’s marked as successful, and the downstream tasks are executed. There are two different modes you can select for sensors. The default mode is ‘poke’ which takes up a worker slot for its entire runtime. And ‘reschedule’ which only takes up a worker slot when it’s checking and sleeps based on a set time interval between checks.
- TaskFlow: If you write your DAGs mainly in Python, you can use TaskFlow API to reduce the amount of boilerplate code by using the ‘@task’ decorator. TaskFlow helps move inputs and outputs between your tasks using XComs.
- Schedulers: A scheduler monitors all tasks and DAGs and triggers Tasks when any dependencies are complete. At a set interval (the default being one minute), the scheduler checks the results from DAGs and sees if any of the active tasks can be executed.
- Executor: Executors are needed to run tasks. They have a common API, and you can swap executors based on your installation needs. There are two types of executors, ones that run locally and ones that run their tasks remotely.
- XComs: Xcoms stands for cross communications. They are a way of letting Tasks communicate with each other, as Tasks are completely isolated and may be running on a different machine.
- Variables: Variables are a way of accessing information globally. They are stored by key values and can be queried from your Tasks. You typically use them for overall configuration rather than passing data between Tasks or Operators.
- Params: Params are a way of providing runtime configurations to tasks. When a DAG is triggered manually, you can modify its Params before the DAG run starts. A warning message will be displayed if the values supplied don’t pass validations instead of creating the DAG run.
Airflow Benefits
Airflow can offer many benefits to help you manage your complex data pipeline workflows.
- Ease of use: The learning curve to begin working with Airflow is quite small. Once you understand the different components and are familiar with Python, you can get started quickly.
- Open-source: With Airflow being open-source, it comes with a lower cost, gives people the ability to contribute improvements, and has a community to offer support.
- Graphical user interface: To make managing your workflows easier, Airflow has a graphical UI where you can view the workflow you’ve created and see the status of ongoing and completed tasks.
- Integrations: Airflow comes with many pre-built operators that allow you to integrate with cloud platforms, such as Google, AWS, and Snowflake and popular databases.
- Automation: Airflow aids with automating your workflows. Once set up, you can set them to be active at a set schedule or be run automatically.
- Centralized workflow/orchestration management: Airflow offers a single place to set up all your workflows and provides a rich user interface that makes seeing, orchestrating your data pipelines easier, and handling complex relationships between tasks.
Airflow Use Cases
You can use Airflow for a lot of things due to its flexibility. However, here are some common use cases you can do with Airflow:
- Data Integration (ETL/ELT): You can use Airflow for data integration. You can set up DAGs for each of your data pipelines so it connects to a data source, such as your CRM, your social media accounts, or your ad platforms, and create tasks that can retrieve the required data and transfer it to a data warehouse.
- Workflow automation across systems (orchestration): Because of the multiple operators that Airflow gives you access to, you can connect across multiple systems and automate workflows by setting up relevant DAGs.
- Scheduling: You need to run data pipelines at specific intervals. Rather than manually running them yourself daily or weekly, Airflow allows you to set your DAGs to be executed at a certain interval, automatically completing tasks.
- Alerting: Because you can specify dependencies of the tasks in your DAGs, you can build in control flow. If a particular task fails, you can set alerts to notify you via email, so you can investigate what may have caused the failure.
Final Thoughts
Apache Airflow has been around for a long time, but it's still one of the single most prominent and important technologies in the data engineering world because of the unique challenges it addresses. Airflow creates a centralized place where you can easily manage, orchestrate, and schedule all of your ETL jobs and batch processes.
Airflow is the most widely adopted open-source tool for data orchestration. Data integration is extremely complex, and the ability to visually define pipelines in DAGs is extremely useful. Airflow is essentially in a league of its own acting as the ultimate framework in the data engineering world.