Data Engineer’s Lunch #4: Airflow for Data Engineering

In case you missed it, the fourth installment of our weekly data engineering lunch was presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Our guest speaker this week is Will Angel. Will Angel is a Data Scientist and Author, co-organizer for the Data Visualization DC meetup group, and helps lead communications and development for the Data Community DC non-profit. He has worked in consumer and healthcare startups and has a background in bioethics and physics. Learn more at www.williamangel.net or on Twitter at @DataDrivenAngel

Overview

Apache Airflow is an open-source task scheduling tool. It works based on DAG workflows, built-in python. We schedule the workflows with cron notation. Airflow is separated into three pieces. The Scheduler executes Workflows. The executors handle running tasks within the workflows. And lastly, the webserver has a UI that helps view and manage almost all aspects of managing airflow scheduling (except for the actual creation of workflows). Workflows are directed acyclic graphs that get processed by the scheduler and sent to executors. The program does a lot of generating and handling metadata and logs, which can potentially be stored in any database.

Airflow Advantages

As a scheduling tool, Airflow is comparable to cron, the basic Linux command-line utility for scheduling things. When cron jobs fail entirely or partially they cause a lot of problems that have no standard tools to resolve them. Even the debugging to figure out what caused the failure and why can take a whole lot of time. Airflow’s metadata storage and logging capabilities make figuring out that sort of thing much easier. Airflow can also plan for retries when tasks fail. Airflow is scalable with Kubernetes whereas scheduling across several machines using just cron is all but impossible to manage. Workflows exist as files stored in a common directory and so we can manage them with version control. It also has a large number of extensions and add-ons to extend its functionality.

Airflow Downsides

Airflow has a higher overhead than cron jobs. An instance of Airflow needs to be running, and the central processing means that more resources are dedicated to it than simple cron jobs. Base airflow is pretty low maintenance but with many additional features installed via add-ons, that maintenance cost increases. The same goes for the complexity of managing the airflow instance or even individual workflows engaging lots of different features.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!