Alpakka is an open-source project designed to implement stream-aware and reactive integration pipelines for Java/Scala which is built on top of Akka Streams. This blog talks specifically about using Alpakka Cassandra and Akka Streams together with Twitter4S (Twitter client written in Scala) to pull new Tweets from Twitter for a given hashtag (or set of hashtags) using Twitter API v1.1 and write them into a local Cassandra database.
The Github Repository for the demo project can be found here: https://github.com/Anant/example-cassandra-alpakka-twitter
Akka Streams is a library in Akka, used to deal with processing a sequence of elements using bounded buffer space. A list of some core terms from Akka Streams:
Akka Streams library automatically deals with back-pressure: any component that receives elements from a previous component will only receive them when it is ready. More detailed information on Akka Streams can be found on Akka’s documentation website here.
Alpakka is an “open source initiative to implement stream-aware and reactive integration pipelines for Java and Scala”. It is built on top of Akka Streams, and is Reactive Streams compliant. Thus, it is interoperable with many other programs and pieces of software that are also Reactive Streams compliant. Alpakka provides many Akka Streams integration implementations for different databases, and also provides many classes and functions for dealing with common data transformations including JSON, CSV, XML, and more. Documentation for Alpakka Cassandra can be found here.
The demo project uses Alpakka Cassandra, which provides an Akka Streams API on top of CqlSession from the Datastax Java Driver for Apache Cassandra. In particular, Alpakka Cassandra provides CassandraSource and CassandraFlow, which are Akka Streams sources and flows respectively. CassandraSource obtains an Akka Streams Source from CQL queries and from com.datastax.oss.driver.api.core.cql.Statements. CassandraFlow obtains an Akka Streams Flow to run CQL statements that change data (update, or insert).
The demo uses a standard Akka Streams Source queue as the source for a CassandraFlow which adds tweets by their id and text into a local Cassandra database. The output of CassandraFlow goes into Sink.ignore, which does nothing else with the tweets. This data stream is visualized in the image below:
Note that previous versions of Alpakka Cassandra used to have an implementation for CassandraSink, which was an Akka Streams Sink. However, a Sink is equivalent to having a flow which goes to Sink.empty, and therefore CassandraSink was removed in later versions of Alpakka Cassandra.
Finally, driver configuration for the Datastax Java Driver that is being used by Alpakka Cassandra is provided in the same configuration format that Akka uses, and be placed in the same application.conf file as other Akka settings go in for the project.
Twitter4S is a Twitter client written in Scala which uses Twitter API v1.1, and is implemented using Akka-http. Using Twitter4S allows for an easy way to pull things from Twitter in a Scala application without having to deal with Twitter API directly.
To use the client, it is necessary to obtain a Twitter developer account and application, both of which can be applied for at: https://developer.twitter.com/en. Additionally, Twitter4S requires JDK version 8, not more recent versions. The demo project uses the class TwitterStreamingClient from Twitter4S, which is used to support stream connections offered by the Twitter Streaming API. In particular, the demo project uses the method .filterStatuses on the Twitter streaming client, inputting a field called tracks which gives a stream of tweets which have any of a given set of hashtags.
The demo project also provides a function to .filterStatuses which filters the output of the Twitter stream, and queue’s tweets which are not retweets to the Source queue acting as the Source connected to CassandraFlow from Alpakka Cassandra. Running Tweets through this flow then writes them to Cassandra. Relevant source code for the method filterStatuses can be found here, within the class TwitterStatusClient (which is extended by TwitterStreamingClient).
SBT is an interactive build tool for Scala and Java projects. As with a large majority of Scala projects, this application is built using SBT. More info on SBT can be found at the official website: https://www.scala-sbt.org/
The Github repository for the demo project can be found here. Along with the source code, the repository includes detailed requirements and instructions for running the demo. A brief rundown of the process for running the demo goes as follows:
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!