Friday, April 1, 2011

Cassandra 0.7.x - Understanding the output of nodetool cfhistograms


Command - Usage and Output
Cassandra provides nodetool cfhistograms command to print statistic histograms for a given column family. Following is the usage:
./nodetool -h -p cfhistograms

The output of the command has following 6 columns:
  • Offset
  • SSTables
  • Write Latency
  • Read Latency
  • Row Size
  • Column Count

Interpreting the output
  • Offset: This represents the series of values to which the counts for below 5 columns correspond. This corresponds to the X axis values in histograms. The unit is determined based on the other columns.
  • SSTables: This represents the number of SSTables accessed per read. For eg if a read operation involved accessing 3 SSTables then you will find a +ve value against Offset 3. The values are recent i.e. for duration lapsed between two calls.
  • Write Latency: This shows the distribution of number of operations across the range of Offset values representing latency in microseconds. For eg. If 100 operations took say 5 ms then you will find a +ve value against offset 5.
  • Read Latency: This is similar to write latency. The values are recent i.e. for duration lapsed between two calls.
  • Row Size: This shows the distribution of rows across the range of Offset values representing size in bytes. For eg. If you have 100 rows of size 2000bytes then you will find a +ve value against offset 2000.
  • Column Count: This is similar to row size. The offset values represent column count.

Some additional details
  • Typically in a histogram the values are plotted over discrete intervals. Similarly Cassandra defines buckets. The number of buckets is 1 more than the bucket offsets. The last element is values greater than the last offset. The values you see in the Offset column in the output is bucket offsets.
  • The bucket offset starts at 1 and grows by 1.2 each time (rounding and removing duplicates). It goes from 1 to around 36M by default (creating 90+1 buckets), which will give us timing resolution from microseconds to 36 seconds, with less precision as the numbers get larger. (see EstimatedHistogram class)