In this blog post, we will introduce Presto, an open-source distributed SQL query engine, and discuss how to connect it with Cassandra. We will also give you a quick demo to show how you can quickly test it out yourself. Also, this will be part 1 of our series on “Doing SQL and Reporting on Apache Cassandra with Open Source Tools”, so be on the lookout for parts 2, 3, and 4!
Presto is an open-source distributed SQL query engine that can be used for running interactive analytic queries against various data sources of various sizes. Presto was originally created by Martin Traverso, David Phillips, Dain Sundstrom, and Eric Hwang while at Facebook, and its original purpose was to solve for slow queries on a 300 PB Hive Data Warehouse. This original version of Presto is called PrestoDB. A few of the founders of PrestoDB, Martin Traverso, David Phillips, and Dain Sundstrom left Facebook in 2018 and created PrestoSQL. For the purposes of our blog and demo, we will focus on PrestoDB, since it is the original version of Presto.
As mentioned above, Presto can run queries against various data sources of various sizes including Hive, Cassandra, RDBMS, and other NoSQL databases. A single Presto query can combine data from multiple sources to allow for analytics across a user’s entire system. For this blog, we will solely focus on Presto and Cassandra.
If you have not already and would like to, you can learn how to deploy Presto here. For demo purposes, for the config properties file, we used the provided setup that functions both as a coordinator and worker. We also used the provided setups for the node properties file, JVM config file, and log properties file. We will be demoing this connection locally using Datastax Enterprise for Cassandra, Datastax Studio for a notebook, the Presto CLI to connect and make queries, and the Presto UI to view the analytics. To connect Presto and Cassandra, you will need to set up the Cassandra connector. To configure it, you will need to add a catalog properties file to
cassandra.properties. If you are using multiple Cassandra clusters, then you can create multiple
.properties files prefixed by a descriptive indicator as mentioned here. Within this
cassandra.properties, you will fill it out as so. Since we are doing this locally, our contact-points will be 127.0.0.1.
To begin the demonstration, we will need to download the Datastax studio notebook that we have already created for you.
A general overview of the notebook is that we will create 2 tables, seed those two tables, and try to run a join query using CQL, which will not work. We will try that join statement in the Presto CLI however, and visualize the results.
We will need to run the Datastax Enterprise Cassandra server, Datastax Studio server, and Presto server. You can choose however to run them, but we ran them in 3 separate tabs within 1 terminal with each tab cd’d into each respective directory.
bin/launcher startfor no logs
bin/launcher runfor logs
sudo bin/launcher start/run
Once we have all 3 servers running, we can open
localhost:8080 for the Presto UI, and
localhost:9091 for the Datastax Studio UI as seen below.
Now we can drag and drop the downloaded notebook file into Datastax Studio, which should import it. Once imported, we can open it.
Moving back to the notebook, we will start with the first cell. We can run the first cell by hovering over the cell and clicking
Run LOCAL.QUORUM, which will create our demo keyspace. After this has been run, we can now use our Presto CLI and connect to it. If you have not already, download the Presto CLI and follow the instructions to make it executable. Run the following command in a new tab in our terminal that is running our servers, or however you choose to do so.
./presto --server localhost:8080 --catalog cassandra --schema demo
The terminal should look as so:
If we run
show tables; in the CLI, we will see the following as we have not created any tables yet.
Now, we will move back to the notebook, and run the second cell. Before running the cell, we will need to select the keyspace “demo”, and then hit
Run LOCAL.QUORUM. Once that has run, we can again run
show tables; in the Presto CLI, and see the results.
With this, we have now confirmed that Presto can connect to our local instance of Cassandra and make queries on it. We will now move onto the 3rd and 4th cells in the notebook. The 3rd cell will seed the table we just made, and the 4th cell will confirm that the table is seeded. Go ahead and run both cells, and remember to select “demo” as the keyspace. Once the 4th cell runs, you will see the result of the query from a CQL aspect. We will copy and paste the cell 4 query and try it out in the Presto CLI. The results should be as seen below.
Also, we can check out our query analytics in the Presto UI at
localhost:8080. We can click on the “checkmark Finished” tab under “State” under QUERY DETAILS and you can visualize the analytics of the queries we have been making so far.
Moving back to the notebook, we can run cells 5, 6, and 7, and repeat the steps we did for cells 2, 3, and 4 to verify that the table was made and seeded correctly, and visualize the query analytics in the Presto UI.
We have now reached the end of the demo, and more importantly, the purpose of why we are using Presto with a NoSQL database like Cassandra. Say we want to get the speed of a spacecraft by spacecraft name, and also return the summary of the journey by spacecraft name as well with only one query. Will we be able to do that with only CQL? NOTE: The data model we are using is strictly for demo purposes to showcase how Presto can connect with Cassandra.
Take the command in the last cell of the notebook and try running it. It won’t work as we can’t do joins natively with CQL, and we would need to run at least 2 queries, one to each table, and then programmatically connect that data.
If we try running the same query in the Presto CLI, then it will work. We will get results of a join statement that joins 2 Cassandra tables based off of the journey_id in both tables.
We can also visualize the analytics of that join query in the Presto UI.
To also demonstrate some other cool things you can do with Presto’s UI, we can filter and analyze queries by query. If you type
show tables into the search bar, we can analyze the different
show tables queries we made and use that for BI analysis.
For this demo, we can make an observation that the average wall time for the show tables query is around 226.59ms while ignoring the small sample size. Also, a more impactful observation that we can make is that as we added more tables, the wall time tended to increase ~5-10ms per table. However, these observations would not likely hold weight in the real world without a larger sample size and proper regression analysis.
With Presto, an open-source distributed SQL query engine, we connected two NoSQL tables and ran a join query on them. This example demo only utilizes one data source, Cassandra, but if we want to use multiple supported data sources and connect them via Presto to run joins across them, then we can. This concludes part 1 of our “Doing SQL and Reporting on Apache Cassandra with Open Source Tools” series, so be on the lookout for parts 2, 3, and 4!
If you want to watch this demo in video form, it is linked below:
Also, the slide deck associated with the video is linked below:
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!