In Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker, we discussed getting started with DataStax Enterprise on Docker, we discussed some of the applications that make up the DataStax ecosystem. In the process we pulled some Docker images of the applications we are interested in and we got into working with the DSE Search, DSE Analytics with Spark, and DSE Graph on the Docker desktop. We were able to learn about these tools and their strengths. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In this article, we are going to look at getting started with DataStax Enterprise on Docker. Recently, there have been lots of shifts in looking at how data problems can be solved by NoSQL databases. There are different problems in using relational databases that are very difficult to navigate.
One of the interesting features of NoSQL databases is the ability to scale whenever there is a daunting and disturbing workload that may cause applications to go offline. Interestingly, customers are very good at waiting to get responses at a longer time, as such your application must be available at all times, there must not be downtime.
Due to the fact that Cassandra is known to be scalable and highly available, different companies have been working on using Apache Cassandra as their application database. DataStax Enterprise has done a great deal of work by building transformational data architectures for applications, microservices, and experiences that require data sovereignty, availability, scale, agility, and accessibility by any user. They have built different applications by leveraging Apache Cassandra and building enterprise applications that make application deployment much earlier.
One of these applications is the DataStax Enterprise (DSE), built on Apache Cassandra which is well known for 100% uptime, unmatched low latency, and it also has the ability to handle massive data at a planetary scale.
There are different packages and capabilities that have been introduced into the DataStax Enterprise ecosystem, we are going to look at some of these software. As part of this, we are going to provision these software on Docker containers and work with them. We are going to look at working with DSE Search, DSE Analytics (Spark), and DSE Graph to demonstrate handling data workloads.
DataStax Enterprise (DSE)
In DSE, there are different capabilities you can leverage, to handle different data problems. We are going to look at DataStax Studio, DataStax enterprise server which comes with DSE Search, DSE Analytics with Spark, and DSE Graph
DataStax Enterprise Search
DSE Search allows you to quickly find data and provide a modern search experience for your users, helping you create features like product catalogs, document repositories, ad-hoc reporting engines, and more. One of the restrictions possessed by Apache Cassandra is being able to use tables for applications that the tables were not predefined to support beforehand. Cassandra provides a solution called materialized view and the creating secondary views, however, these solutions are not flexible because managing these views and tables require some tricks. Indexing on data types like tuples and user-defined types is a lot of work. Let’s get straight at it.
Install DataStax Enterprise server and enable Search capability
$ docker pull datastax/ddac:5.1.17 $ docker pull datastax/dse-server:6.8.16 $ docker pull datastax/dse-opscenter:6.8.15 $ docker pull datastax/dse-studio:6.8.15
With all the images pulled, we are going to run and start the containers of all the images we pulled. We will run the containers using the command below.
$ docker run -e DS_LICENSE=accept --name my-ddac -d datastax/ddac $ docker run -e DS_LICENSE=accept -p 7080:7080 -p 7081:7081 --name datastax_server -d datastax/dse-server -k -s -g $ docker run -e DS_LICENSE=accept -p 8888:8888 --name my-opscenter -d datastax/dse-opscenter $ docker run -e DS_LICENSE=accept --name my-studio -p 9091:9091 -d datastax/dse-studio --link datastax_server
Use the command below to get into the running containers.
$ docker exec -it <container name> /bin/bash
Use the command below to check for status of the node, you will be able to know if your node is running or not with this command.
$ dsetool status
In my case, I am going to start working with the DSE server, notice I have enabled the DSE Search, the Analytics with Spark and the Graph capability with the below line.
$ docker run -e DS_LICENSE=accept -p 7080:7080 -p 7081:7081 --name datastax_server -d datastax/dse-server -k -s -g $ docker exec -it <container name> /bin/bash
Use CQL to create the search index on all columns in the table and all the Search nodes in the datacenter. Use the CQLSH command to get into the CQL shell.
$ CREATE SEARCH INDEX ON keyspace.table; $ CREATE SEARCH INDEX ON voting_system.voters; $ CREATE SEARCH INDEX ON voting_system.voters WITH COLUMNS column1, column2, column3, ...;
DataStax Enterprise with Analytics using Spark
DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced version of Apache Spark. With DSE Analytics you can easily generate ad-hoc reports, target customers with personalization, and process real-time streams of data. The analytics toolset lets you write code once and then use it for both real-time and batch workloads.
Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark, Spark is the framework that will help to support our analytics applications. Spark is a distributed computation engine that is designed to handle big data and for in-memory processing. According to Apache Spark, Spark supports interactive and batch analytics, it is up to 100 times faster than Hadoop. Spark requires 5 – 10 times less code compared to Hadoop and at the same time supports efficiency and scalability. One of the features of Spark is fault tolerance.
Some of the advantages of using DSE Analytics include the following;
Use the command below to get to the Spark shell.
$ docker exec -it <container name> /bin/bash $ dse spark
Use the Spark Scala command below to manipulate the Cassandra table.
$ val table = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace"->"voting_system", "table"-> "voters")).load() $ val voters_table = table.select("voterid", "city", "state","supporting").where("solr_query='supporting:A'").show()
DataStax Enterprise Graph
DSE Graph is a distributed graph database that is optimized for fast data storage and traversals. DSE Graph database ensures zero downtime, analysis of complex, disparate, and related datasets in real-time. With all capabilities that come with DSE Graph, the database is capable of scaling to massive datasets and executing both transactional and analytical workloads (OLTP and OLAP). DSE Graph incorporates all of the enterprise-class functionality found in DataStax Enterprise, this includes advanced security protection, built-in DSE Analytics, the DSE Search functionality, visual management and monitoring, and development tools including DataStax Studio.
DSE Graph is built on top of Apache TinkerPop, Apache Cassandra, Apache Solr, and Apache Spark. DSE Graph uses Apache TinkerPop standards for data and traversal while also using Apache Cassandra for scalable storage and retrieval. DSE Graph leverages Apache Solr for search and for indexing Capabilities. DSE Graph employs Apache Spark for fast analytic traversal. All these components are integrated into the DSE graph to form a real-time graph database management system.
DSE Graph supports both transactional and analytic workloads, using two different engines. The analytic engine solely relies on Spark, which comes as part of the DSE product.
Get into the Gremlin console using the below command.
$ docker exec -it <container name> /bin/bash $ dse gremlin-console
Check if a graph called voting_system exists in the system.
Create a new graph using the below command.
Get a list of graphs available in the system.
I find it very easy to work with graph and CQL commands on DataStax studio, the interactive notebook makes things a lot easier. We are going to link the DataStax studio with the DSE server and start working on the graph database from the studio. We are going to open the command line interface and use the below command to link the DSE server with the DSE studio and access the studio at http://localhost:9091/.
docker run -e DS_LICENSE=accept --name <my-studio-name> -p 9091:9091 -d datastax/dse-studio --link <datastax_server_container_name>
With the DSE studio and the DSE server connected, we can now start defining our graph vertices, the graph edges, and our graph properties. Just a recap of what we covered in this article, we looked into various packages and capabilities of DSE, we created indexes on our Cassandra table and we also query and work with the table on Spark, then we proceeded to create a graph database on the DataStax studio.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!