Data Engineer’s Lunch #66: Airflow and Presto

In Data Engineer’s Lunch #66, we discuss how to connect Airflow and Presto. The live recording of Data Engineer’s Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. Subscribe to our YouTube Channel to keep up to date and watch Data Engineer’s Lunches live at 12 PM EST on Mondays!

In Data Engineer’s Lunch #66, we discuss how to connect Apache Airflow and open-source Presto, as well as Cassandra. The main focus of this Data Engineer’s Lunch is on connecting Airflow and Presto and we are just using Cassandra as a queryable database. However, in the near future, we will have a Cassandra Lunch in which Cassandra will be more involved in the demonstration.

If you are not familiar with Presto, it’s an open-source distributed SQL query engine for big data. Presto was originally created at Facebook to solve slow queries on a 300 PB Hive Data Warehouse and this original version of Presto is called PrestoDB. A few of the founders of PrestoDB left Facebook in 2018 and created Trino, formerly PrestoSQL.

Check out this blog for more discussion of Presto and Cassandra together. Again, the focus of this Data Engineer’s Lunch was on connecting Airflow and Presto, and then in a future Cassandra Lunch, we will dive deeper into Airflow, Presto, and Cassandra.

While setting up this demo, there weren’t a ton of great resources on how to connect Airflow and Presto outside of the basic Airflow provider docs and there were no example DAGs in the resources section of the documentation. We ended up using the PrestoHook, which outside of just showing 3 lines of example code, doesn’t really show you how to truly use it in a DAG itself. No worries, we figured it out for you, so follow along below in the embedded live walkthrough!

The live recording below contains a live walkthrough of the demo where we discuss nuances you might find while trying to connect Airflow and Presto, so be sure to check that out. Don’t forget to like and subscribe while you are there! You can also follow along on the walkthrough below, or on the GitHub README.md which can all be done directly on your browser via Gitpod!

Click below to get started!

Open in Gitpod

1. Start Presto and Cassandra Docker Containers and install Airflow

IMPORTANT: Remember to make the ports public when the dialog shows in the bottom righthand corner!

1.1 Run setup script

bash setup.sh

2. Open a new terminal and confirm services are running

2.1 Confirm Docker containers are running

docker ps

2.2 Airflow UI on port 7080 and log in

username: admin
password: password

2.3 Presto UI on port 8080

3. Setup Presto Connection in Airflow Admin

3.1 Get IP Address

hostname -I | awk '{print $2}'

3.2 Fill in presto_default connection with the following items and then confirm by testing the connection

Connection Type: Presto
Host: value copied from 3.1
Schema: remove hive and leave blank
Login: admin
Port: 8080

4. Create Cassandra Catalog in Presto

4.1 Update {hostname} in cassandra.properties file with value from 3.1

sed -i "s/{hostname}/$(hostname -I | awk '{print $2}')/" cassandra.properties

4.2 Copy cassandra.properties to Presto container

docker cp cassandra.properties $(docker container ls | grep 'presto' | awk '{print $1}'):/opt/presto-server/etc/catalog/cassandra.properties

4.3 Confirm cassandra.properties was moved to Presto container

docker exec -it presto sh -c "ls /opt/presto-server/etc/catalog"

5. Confirm Presto CLI can see Cassandra catalog

5.1 Start Presto CLI

docker exec -it presto presto-cli

5.2 Run show command

show catalogs ;

If you do not see cassandra, then we need to restart the container

5.3 Restart Presto container

docker restart presto

5.3 Repeat 5.1 and 5.2 and confirm if you can now see the cassandra catalog

6. Run Airflow Presto Dag

6.1 Move DAG into Dags Folder

mkdir ~/airflow/dags && mv presto.py ~/airflow/dags

6.2 While waiting on scheduler to pick up Dag, create Presto query variable

6.2.1 Under Airflow Admin in the UI, click variables
6.2.2 Click the blue plus sign to create a variable
6.2.3 Fill in the key and values as below:
key: presto_query
value: show catalogs;

6.3 Confirm scheduler has picked up Presto Dag in Airflow UI (might take a couple minutes)

6.4 Enable and run Presto dag

6.5 Review Logs and confirm query returned same thing as we saw with show catalogs

6.6 Update the Presto query variable value to show schemas in cassandra;

6.7 Rerun dag and confirm that the returned values match the default keyspaces that exist in Cassandra

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!