This resource for monitoring Datastax, Cassandra, Spark, & Solr performance is just the first iteration of a longer initiative to create the best knowledge base on these real-time data platform technologies such as DataStax Enterprise (Cassandra, Spark, and Solr) as well as for Kafka, Docker, and Kubernetes. Our firm, Anant, has been working with Solr/Lucene for several years, and then over the years picked up Spark and Cassandra, and then made the logical move to become experts at and partners with Datastax.
Datastax OpsCenter is good but we’re wise enough to say, however, that it is just the beginning of the toolset needed to really understand what is happening under the hood in the component technologies that comprise of the Datastax Enterprise Platform. When monitoring to scale complex systems such as business platforms you need to review all signals, not just those that come from the database.
One of the most fundamental principles of life and work which I’ve adopted is making decisions based on factual data is the best way to move in the desired direction. If the goal is to lose weight, you have to first know how much you weigh as well as what percentage of your body weight is muscle, fat, or water. If you want to get faster at running, you have to know how long it takes for you to run 1 mile, and then 2, and so on and so forth.
In real-time business platforms, the technologies that make up the system are generally distributed. This means that they are running on different computers and in different parts of the world. They may also be running several thousand or million different processes at the same time. It’s important to know what to measure, how to measure these metrics, and which tools to measure them.
The best way to measure progress is to measure progress. – Rahul Singh, CEO @ Anant
Distributed systems can be “simple” in terms of architecture but because of the moving pieces seem “complex.” That’s complexity comes from needing to see and correlate the different types of events and metrics that are happening in different parts of the system.
Just as there are different parts of a scalable business platform, there are different types of monitoring and they all have a reason. One of the biggest reasons people measure and monitor their systems is to adhere to a Service Level Agreement (SLA) or an Operating Level Agreement. “Make our Platform faster” maybe an objective in your team’s OKRs but be sure to have a key result that’s tied to establishing a baseline or to improve the baseline in relation to actuals business goals around SLAs and OLAs.
Over the years I’ve seen that most businesses that need to manage a business critical platform tend to end up centralizing their metrics and logging into one system. The most common are Splunk, ELK (Elasticsearch, Logstash/Beats, Kibana), ELG (same but with Grafana), Graylog, SumoLogic, Sematext, DataDog, New Relic, and more recently Prometheus or Graphite, with Grafana (read below for links). These resources and our descriptions can help you decide what is good for you
These resources were placed here as a starting point for you to figure out what’s going on with your Cassandra, Spark, or Solr cluster. Whatever you end up using, remember the goal: Measuring & monitoring for performance and stability. If you can’t quantify what performance or stability means in something similar to a Service Level Agreement, then all this is just for show. Need help with scaling a business platform built on these components? I’d love to chat and offer some thoughts. Send me an email with your question.