Open Source Notebooks and Cassandra: Doing SQL on Cassandra Tables

In this blog post, we will introduce a few open-source notebooks that we can use to do SQL on Cassandra. At the bottom of the blog, we have an accompanying webinar that you can watch to see a live demo using 2 of the notebooks we discuss in this blog. This is Part 3 of our series on “Doing SQL and Reporting on Apache Cassandra with Open Source Tools”, and Parts 1 and 2 are also linked below. Also, be on the lookout for part 4 coming soon!

In this blog, we will cover 4 open-source notebooks that can be used to SQL on Cassandra: Jupyter Notebook, Apache Zeppelin, Quix, and Polynote.

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. The Jupyter Notebook is probably the most commonly used notebook that we will discuss in this blog, and for good reason given the capabilities and ease of use.

Example Jupyter Notebook
Example Jupyter Notebook

A few use cases of Jupyter Notebooks include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Jupyter supports over 40 programming languages, including Python, R, Julia, and Scala. We will be focusing on Python for our purposes as we will be utilizing PySpark in order to do SQL on Cassandra within Jupyter. To see a live demo using Jupyter and PySpark, check out the video linked below!

Notebooks can be shared with others using email, Dropbox, GitHub, and the Jupyter Notebook Viewer. We uploaded the Jupyter Notebook we used in the demo to GitHub, and it can be found here.

With Jupyter Notebook, users can leverage big data tools, such as Apache Spark, from Python, R, and Scala like we did by using PySpark (see video below). Users can then explore that same data with pandas, scikit-learn, ggplot2, TensorFlow for purposes like data analysis, data visualization, and doing machine learning.

The notebooks are an open document format based on JSON. They contain a complete record of the user’s sessions and include code, narrative text, equations, and rich output. The Notebook communicates with computational Kernels using the Interactive Computing Protocol, an open network protocol based on JSON data over ZMQ and WebSockets. The Kernels themselves are processes that run interactive code in a particular programming language and return output to the user. Kernels also respond to tab completion and introspection requests.

As mentioned a few times above, we explored Jupyter Notebook with a live demo, and the video of that is linked below.

Apache Zeppelin

Another prevalent open-source notebook is Apache Zeppelin. Apache Zeppelin is a web-based notebook that enables data-driven, interactive data analytics, and collaborative documents with SQL, Scala, and more. The most common use cases for Apache Zeppelin are data ingestion, data discovery, data analytics, data visualization, and collaboration.

Example Apache Zeppelin Data Visualization
Example Apache Zeppelin Data Visualization

Zeppelin supports Spark, PySpark, Spark R, and Spark SQL with a dependency loader. Zeppelin also allows users to connect any JDBC data sources seamlessly including technologies such as Postgresql, MySQL, MariaDB, Redshift, and Apache Hive. Apache Zeppelin also supports python with Matplotlib, Conda, Pandas SQL, and PySpark integrations.

The Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Apache Zeppelin supports many interpreters such as Apache Spark, Python, JDBC, Markdown, and Shell.

Technologies Available to Apache Zeppelin Through Interpreter
Technologies Available to Apache Zeppelin Through Interpreter

As mentioned above, Apache Zeppelin can be used for data visualization. Zeppelin includes some basic charts from the get-go and visualizations are not limited to a SparkSQL query, as any output from any language backend can be recognized and visualized.

Example Zeppelin Graph
Example Zeppelin Graph

Zeppelin also aggregates values and displays them in a pivot chart with simple drag and drop. Users can easily create charts with multiple aggregated values including sum, count, average, min, max.

Example Apache Zeppelin Pivot Chart
Example Apache Zeppelin Pivot Chart

Zeppelin notebooks can also be shared for collaboration through URLs. Any changes made the notebooks will be broadcast in real-time so every user can appreciate the newly made changes.

The notebooks can also be embedded as an iframe wherever iframes are allowed.

Published Apache Zeppelin Notebook
Published Apache Zeppelin Notebook

Unfortunately, we were not able to demo Apache Zeppelin in the video linked below, but we were able to demo the next notebook (see video below!).

Quix

Quix is a multi-user, easy-to-use notebook manager, which can act as a shared space for BI insights. Quix has support for PrestoAthenaBigQueryMySQLPostgreSQLClickHouse, and more through either direct support or through JDBC’s. We used Quix’s Presto functionality in order to do SQL and joins on Cassandra from Quix itself. A live demo of this is available in the video linked below. Also, if you want to check out part 1 of this series, Presto and Cassandra, where we discussed how to set up Presto and connect it to Cassandra with a live demo of using Presto to do SQL and joins on Cassandra, then you can check that out here or at the series listing below. Also, Quix can be run in single-user or multi-user mode via Google OAuth.

Quix uses notes to edit and run queries, and those notes live inside notebooks, which can be organized in folders. The Database explorer lets users navigate, search, and preview their tables as seen below.

Quix Dashboard and Database Explorer
Quix Dashboard and Database Explorer

Quix can also help users write their queries with autocomplete, syntax highlighting, live syntax validation, error line highlight, and dataset explorer with drag and drop support. Additionally, the Quix smart editor supports strong type variables, multiple result statements, simple pivot and pie visualizations, and the ability to export results as CSV. The visualizations can be done with one click as seen below.

Example Quix Visualization
Example Quix Visualization

Users can share notebooks via a simple link; as well as, search the notes of every user within their organization with the help of full-text search.

As mentioned above, we have a live demo of Quix in addition to a live demo of Jupyter linked in the video below.

Polynote

The final notebook that we will discuss is Polynote. Polynote is a polyglot notebook with first-class Scala support. Polyglot means knowing or using several languages, and in that regard, Polynote supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model.

Example Polynote Notebook
Example Polynote Notebook

A few unique features that Polynote offers are:

  • Interactive autocomplete
  • Error Highlighting
  • Rich text editing
  • LaTeX equations
  • Individual task tracking
  • Symbol table to track defined variables
  • Real time running statement highlighting
  • Ordered cell semantics
  • Apache Spark integration
  • Table structured data
  • Rich representations of data
  • Built-in plot editor

Another cool thing about Polynote is that users can utilize Vega for visualizations.

A warning that Polynote does provide is that it currently does not contain any built-in security or authentication of its own and entirely relies on the users deploying and configuring it in a secure way. Also, another thing is that Polynote is currently only tested on Linux and macOS using Chrome as the client. They hope to test other platforms and browsers soon, so if you run into issues using Windows, you can either Google potential fixes or create an issue on Polynote’s GitHub repo.

Unfortunately, we were not able to show a live demo of Polynote in addition to Apache Zeppelin. However, as mentioned before, we were able to live demo Jupyter Notebook and Quix, so if you want to watch them, it is linked below! And with this, we conclude our blog on open source notebooks and Cassandra.

Webinar Recording
Deck

Doing SQL and Reporting on Apache Cassandra with Open Source Tools

  1. Presto and Cassandra
  2. Spark and Cassandra
  3. Open Source Notebooks and Cassandra
  4. Open Source BI Tools and Cassandra

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!