TableAnalyzer is a tool for analyzing Cassandra (CFStats/TableStats) output that visualizes variance in metrics between nodes. We use TableAnalyzer to generate a conditionally-formatted spreadsheet that can be used to perform data model review.
This is the first of many tools we’ve been working on that help understand what’s going on in a Cassandra cluster. It’s not supposed to replace OpsCenter, Prometheus+Grafana, or other tools out there which chart various Cassandra metrics over time.
Instead, TableAnalyzer takes the table stats from all of the nodes at a specific point in time in order to compare across nodes rather than comparing across time. This allows us to identify issues such as tombstones, wide partitions, and different kinds of skew. The goal is to eventually build intelligence in here to help people build better tables through analysis of the keys, etc.
This tool was created because we had a hard time explaining to non-believers that data-model issues were the reason their “amazing app” crashed all the time, and that it wasn’t Cassandra’s fault. TableAnalyzer generates a conditionally formatted spreadsheet which makes it easy to not just locate potential data modeling issues, but also provide quantifiable statistics to back it up.
TableAnalyzer can be broken down into two functions:
1) Collect cfstats/tablestats Output (“Receive”)
2) Visualize cfstats/tablestats in an Excel Spreadsheet and JSON file (“Transform”).
TableAnalyzer can collect the cfstats/tablestats by:
*By Cassandra compatible instance, we are referring to DSE, open source Cassandra, Elassandra, or Scylla. However, note that so far we have only tested with DSE and open source Cassandra.
Once it has the stats, TableAnalyzer parses and transforms the stats into a CSV file format, which then is transformed into a conditionally formatted Excel file (xlsx). It also creates the data set as JSON which can be then sent to ElasticSearch, New Relic, etc. for further visualization.
Table Analyzer can collect the cfstats/tablestats by:
This option is only possible when using DSE and having access to OpsCenter.
The DSE OpsCenter tarball has the cfstats/tablestats output contained within it, so we will not need to run nodetool cfstats or nodetool tablestats, but will need to extract the cfstats/tablestats output from the tarball.
We have written a script for this as well:
# unzip tarball
tar xzvf <path-to-unarchived-diagnostics-tarball-dir>/<tarball-filename>.tar.gz -C .
# get tablestats into our data directory
# python3 cfstats.receive.diag.py <region> <cluster> <datacenter> <path-to-unarchived-diagnostics-tarball-dir> <get-all-data {True|False}> <cassandra-major-version> <debug {0|1}>
python3 cfstats.receive.diag.py uswest prod cassandra <path-to-unarchived-diagnostics-tarball-dir> True 3 1
Another way to get the tablestats output is to manually run the tablestats command on a local node that is running Cassandra. This command will have to be run several times, with different a different host specified in the environments.yaml
file for each node in order to get tablestats output on a multi-node cluster.
# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
<p>python3 cfstats.receive.py uswest prod cassandra 1</p>
Using SSH can make this whole process much faster.
First, change settings.json so that our script knows to use SSH:
vim ./config/settings.json
# change useSSH to "True"
It should now look like this:
{
"connection":{
"useDiag" : "False",
"useSSH": "True",
"detectTopology":"False"
}
}
Now run the script (can use same arguments as when executing locally):
# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
python3 cfstats.receive.py uswest prod cassandra 1
At this point, there should now be .txt files, one for each node, in the ./data/<region>/<cluster>/
directory (e.g., ./data/uswest/prod
).
Now that we have our tablestats output written to .txt files, we want to convert these files into a csv, and then convert that csv into a .xlsx format. The advantage of using the Excel format is being able to use formatting, and in particular conditional formatting. This conditional formatting will make it easy to see differences between nodes and outliers.
# python3 cfstats.transform.py <region> <cluster> <datacenter> <cassandra-major-version> <debug {0|1}>
python3 cfstats.transform.py uswest prod cassandra 3 1
./data/uswest/prod
directoryuswest.prod.cfstats.csv
, uswest.prod.cfstats.issues.csv
, uswest.prod.cfstats.pivot.csv
for node) (e.g., uswest.prod.cfstats.pivot.json
for json file)uswest.prod.cfstats.pivot.csv
is the most important oneFor this we use csv2formattedxls.py
:
# python3 ./csv2formattedxls.py ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.csv ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.xlsx
python3 ./csv2formattedxls.py data/uswest/prod/uswest.prod.cfstats.pivot.csv data/uswest/prod/uswest.prod.cfstats.xlsx
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms.If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!