Using TableAnalyzer – Anant’s Tool for Analysis of Cassandra Tables

TableAnalyzer is a tool for analyzing Cassandra (CFStats/TableStats) output that visualizes variance in metrics between nodes. We use TableAnalyzer to generate a conditionally-formatted spreadsheet that can be used to perform data model review.

TableAnalyzer Among Other Cassandra Tools

This is the first of many tools we’ve been working on that help understand what’s going on in a Cassandra cluster. It’s not supposed to replace OpsCenter, Prometheus+Grafana, or other tools out there which chart various Cassandra metrics over time. 

Instead, TableAnalyzer takes the table stats from all of the nodes at a specific point in time in order to compare across nodes rather than comparing across time. This allows us to identify issues such as tombstones, wide partitions, and different kinds of skew. The goal is to eventually build intelligence in here to help people build better tables through analysis of the keys, etc.

Purpose

This tool was created because we had a hard time explaining to non-believers that data-model issues were the reason their “amazing app” crashed all the time, and that it wasn’t Cassandra’s fault. TableAnalyzer generates a conditionally formatted spreadsheet which makes it easy to not just locate potential data modeling issues, but also provide quantifiable statistics to back it up. 

What TableAnalyzer Actually Does

TableAnalyzer can be broken down into two functions:

1) Collect cfstats/tablestats Output (“Receive”)

2) Visualize cfstats/tablestats in an Excel Spreadsheet and JSON file (“Transform”).

1. Collect cfstats/tablestats Output

TableAnalyzer can collect the cfstats/tablestats by:

  • Using the extracted contents of a diagnostics tarball from DSE OpsCenter
  • Running nodetool cfstats or nodetool tablestats from a local Cassandra/Cassandra compatible* instance
  • Running nodetool cfstats or nodetool tablestats from a Cassandra/Cassandra compatible* instance through SSH

*By Cassandra compatible instance, we are referring to DSE, open source Cassandra, Elassandra, or Scylla. However, note that so far we have only tested with DSE and open source Cassandra.

2. Visualize cfstats/tablestats (Excel Spreadsheet and JSON)

Once it has the stats, TableAnalyzer parses and transforms the stats into a CSV file format, which then is transformed into a conditionally formatted Excel file (xlsx). It also creates the data set as JSON which can be then sent to ElasticSearch, New Relic, etc. for further visualization.

Collect Tablestats/CFStats Output

Table Analyzer can collect the cfstats/tablestats by:

  1. Using the extracted contents of a diagnostics tarball from DSE OpsCenter
  2. Running nodetool cfstats or nodetool tablestats from a local Cassandra/Cassandra compatible* instance
  3. Running nodetool cfstats or nodetool tablestats from a Cassandra/Cassandra compatible* instance through SSH

Option #1: Using the extracted contents of a diagnostics tarball from DSE OpsCenter

This option is only possible when using DSE and having access to OpsCenter. 

The DSE OpsCenter tarball has the cfstats/tablestats output contained within it, so we will not need to run nodetool cfstats or nodetool tablestats, but will need to extract the cfstats/tablestats output from the tarball. 

We have written a script for this as well:

# unzip tarball
tar xzvf <path-to-unarchived-diagnostics-tarball-dir>/<tarball-filename>.tar.gz -C .

# get tablestats into our data directory
# python3 cfstats.receive.diag.py <region> <cluster> <datacenter> <path-to-unarchived-diagnostics-tarball-dir> <get-all-data {True|False}> <cassandra-major-version> <debug {0|1}>
python3 cfstats.receive.diag.py uswest prod cassandra <path-to-unarchived-diagnostics-tarball-dir> True 3 1

Option #2: Using nodetool cfstats or nodetool tablestats from a local Cassandra instance

Another way to get the tablestats output is to manually run the tablestats command on a local node that is running Cassandra. This command will have to be run several times, with different a different host specified in the environments.yaml file for each node in order to get tablestats output on a multi-node cluster.

# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
<p>python3 cfstats.receive.py uswest prod cassandra 1</p>

Option #3: Using nodetool cfstats or nodetool tablestats from a Cassandra instance through SSH

Using SSH can make this whole process much faster.

First, change settings.json so that our script knows to use SSH:

vim ./config/settings.json
# change useSSH to "True"

It should now look like this:

{
  "connection":{
    "useDiag" : "False",
    "useSSH": "True",
    "detectTopology":"False"
  }
}

Now run the script (can use same arguments as when executing locally): 

# python3 cfstats.receive.py <region> <cluster> <datacenter> <debug {0|1}>
python3 cfstats.receive.py uswest prod cassandra 1

Expected result

At this point, there should now be .txt files, one for each node, in the ./data/<region>/<cluster>/ directory (e.g., ./data/uswest/prod).

Generate Spreadsheet

Now that we have our tablestats output written to .txt files, we want to convert these files into a csv, and then convert that csv into a .xlsx format. The advantage of using the Excel format is being able to use formatting, and in particular conditional formatting. This conditional formatting will make it easy to see differences between nodes and outliers.

Convert to CSV

# python3 cfstats.transform.py <region> <cluster> <datacenter> <cassandra-major-version> <debug {0|1}>
python3 cfstats.transform.py uswest prod cassandra 3 1

Expected result:

  • There should now  be .csv files, one for each node, in the ./data/uswest/prod directory
  • There might also be a few other outputted csv files that aggregate for each node and json file. (e.g., uswest.prod.cfstats.csv, uswest.prod.cfstats.issues.csv, uswest.prod.cfstats.pivot.csv for node) (e.g., uswest.prod.cfstats.pivot.json for json file)
    • uswest.prod.cfstats.pivot.csv is the most important one

Convert to spreadsheet

For this we use csv2formattedxls.py:

# python3 ./csv2formattedxls.py ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.csv ./data/<region>/<cluster>/<region>.<cluster>.cfstats.pivot.xlsx

python3 ./csv2formattedxls.py data/uswest/prod/uswest.prod.cfstats.pivot.csv data/uswest/prod/uswest.prod.cfstats.xlsx

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms.If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!


Join Anant's Newsletter

Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!