Apache Cassandra Administrator Certification

Last week I completed the Apache Cassandra 3.x Administrator Associate Certification exam. Though I didn’t find the exam to be exceptionally difficult, it is a comprehensive test of Cassandra’s architecture and requires understanding of key Cassandra concepts in order to score well. Here is my list of the concepts that should be thoroughly reviewed, and some tips for the exam.

Partitions, Partition Tokens, Primary Keys, Partition Key, Clustering Columns, and Consistent Hashing

Rows in Cassandra must be uniquely identifiable by a Primary Key that is given at table creation. The Primary Key consists of 1 or more Partition Keys, and 0 or more Clustering Columns. For example, this CQL statement

CREATE TABLE crossfit_gyms_by_city (  
 country_code text,  
 state_province text,  
 city text,  
 gym_name text,  
 opening_date timestamp,  
 PRIMARY KEY ((country_code, state_province, city), opening_date, gym_name)
);

creates a table with the Partition Key as the concatenation of country_code,state_province, and city. Additionally, the table has 2 Clustering Columns: opening_date, and gym_name.

As previously stated, any row in the crossfit_gyms_by_city must be uniquely identifiable by its values of the Partition Key and Clustering Columns.

The Partitioner in Cassandra used the Primary Key (in the example (country_code, state_province, city) ) to compute a hash value (called the Partition Token) to determine which node is responsible for the row. Cassandra partitions all data amongst nodes and each node is responsible for (at least) a Partition of the data, and the Partition Token is how tokens are assigned to a Partition.

Queries into the table must at the least provide values for country_code,state_province, and city in the WHERE clause since without all 3, the Partitioner cannot compute the Partition Token to find with node with the wanted Partition of data.

The Clustering Columns are subsequently used to place the record within the Partition.

Understand the Write Path - Commit Log, MemTable, and SSTables

When Cassandra receives a write request- the data is written to 2 different places- an on-disk, append-only CommitLog that is used to replay writes when nodes restart (this is very similar to what other DBMS like mySQL do), and an in-memory MemTable where records are sorted by the Primary Key. Write requests are acknowledged as successful only once the data is written to both of these places. The MemTables are occasionally written to disk as immutable SSTables. SSTables are the files that actually contain the data in partitions and where data is read from.

Understand the Read Path- Partitions, Partition Index, Partition Summaries

Cassandra reads from the SSTables which are stored on disk. In order to minimize seeks and paging, Cassandra has several indices on the SSTables. First, in addition to writing the SSTables to disk, it writes a PartitionIndex that maps Partition Tokens in an SSTable to the byteoffset where the Partition is found. Additionally, Cassandra maintains 2 more in-memory structures - a Partition Summary which maps the Partition Tokens to the byte offset in the Partition Index. Lastly, there is a Bloom Filter that Cassandra uses to first find if the Partition Token exists in the database (this again is just like what other DBMS use). With all that said, the read path through Cassandra goes like:

  1. Find out if the data exists from the Bloom Filter if it does.
  2. Get the byte offset into the Partition Index from the Partition Summary
  3. Get the byte offset into the SSTable from the Partition Index.
  4. Seek to the byte offset to get to the partition containing the data
  5. Read the data from the partition

Understand Compaction- What it does, the different strategies, and how it's triggered

In order to keep reads performing optimally, at times Cassandra needs to combine SSTables, which each may contain data with the same Partition Token, into a single SSTable. This process is called Compaction. There are multiple strategies for when and how to run Compaction- SizeTieredCompaction, LeveledCompaction, and TimeWindowCompaction. Each of these performs better in different conditions- write heavy, read heavy, or time sensitive data. Study these in detail, and study when compaction is triggered- such as when the MemTable is flushed to disk, or when executed manually using the nodetool CLI.

Be familiar with the test format and be ready to be proctored.

The test is proctored via a rather ancient video chat client. If you're using a Chromium based browser, make sure you know how to enable Flash on it. Additionally, make sure your webcam is of a decent quality and you're able to physically move it around. The proctor will ask you to show all 4 walls of your room as well as your desk. One more thing: prior to the exam, the proctor will run a script on your laptop that allows them to view your screen and check for other conditions. One of those conditions is that you're not running a VM. For some reason, Windows Subsystem for Linux triggered their script, so I had to move to another laptop. You may run into the same issue.

Review the videos on Datastax Academy

The DS201 and DS210 courses are excellent. Watch the videos, and work through the exercises. The exercises are very similar to what is covered in the exam.

With that in mind, study hard, and good luck!

Annotation 2020-03-02 230221.png