Apache Cassandra Lunch #94: StreamSets and Cassandra

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. Subscribe to our YouTube Channel to keep up to date and watch Cassandra Lunches live at 12 PM EST on Thursdays!

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! Be sure to watch the demo in the live recording of Cassandra Lunch embedded below! Additionally, check out our Data Engineer’s Lunch where we covered how to use StreamSets for Data Engineering, which included a walkthrough using the Transformer Engine with Spark as well! In the walkthrough embedded below, we only focus on the Data Collecter with the use case of Cassandra.

StreamSets is a data integration platform built for dataops. With StreamSets, you can build streaming, batch, CDC, ETL, and ML pipelines from a single UI and deploy data and workloads to any cloud. The DataOps platform has a free tier (no cc required) with Data Collector Engine, Transform Engine, Control Hub. As shown in the demo below, it allows for self-managed deployments via Docker. In the free tier, we get 2 active jobs, 2 active users, and 10 published pipelines, so depending on your workload, you might be able to get away with using the free tier itself on self-managed deployments.

streamsets ui
StreamSets DataOps Platform UI

As mentioned above, with the DataOps Platform, we get the Control Hub, Data Collecter Engine, Transform Engine, and pre-built connectors and native integrations. The Data Collector Engine is open source, which can be found here: https://github.com/streamsets/datacollector-oss. The transform engine can natively execute on Apache Spark, Snowflake, AWS EMR, Google Cloud Dataproc, and Databricks platforms. And the pre-built connectors and native integrations allow for connection to applications, big data, SQL/NoSQL DBs, storage/warehouses, and streaming tools. Check out all the connectors here: https://streamsets.com/support/connectors/.

Streamsets Basic Architecture
StreamSets Basic Architecture

In order to connect StreamSets and Cassandra, we will need the Cassandra Connector. The Cassandra connector supports Apache Cassandra 1.2, 2.x, and 3.x. For additional security, you can enable DSE auth if using DSE, Kerberos, SSL, and TLS.

The Cassandra connector uses 2 methods of batch writes: logged and unlogged. Logged writes are for distributed batch log and are atomic. Unlogged writes can write partial batches of records to Cassandra. So keep that in mind when configuring the settings for your Cassandra connector as the write method may change pending your use case.

As mentioned above, we have a demo of how we can connect StreamSets and Cassandra! Don’t forget to like and subscribe! In the demo, we go through the following items:

  • Spin up Data Collector Deployment from Control Hub + Docker
  • Get CSV data from GitHub
  • Make data transformations + arithmetic
  • Spin Up Cassandra on Docker + Connect in Control Hub
  • Write to Cassandra

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!