A common requirement is to record a unique count of the occurences of some event. Obviously to do this you have to know if it has happened before, but using the standard “set if not exists” semantic in an eventually consistent environment like Cassandra can be challenging.
Cassandra does support “INSERT … IF NOT EXISTS”, however this is intended for multiple datacenter scenarios and special cases, but it’s not performant in a cluster under high load because an agreement has to be reached between multiple nodes to decide whether or not the insert can succeed. This slows down writes enormously.
Using IF NOT EXISTS incurs a performance hit associated with using Paxos internally
Last write wins
The Cassandra read and write path is already well documented, so I’ll be brief.
When Cassandra writes data it performs no seek, only appending writes. This means that if there is more than one write to the same primary key the latest timestamp is used to decide the winning value. The last write wins. Most of the time this is exactly the behaviour you want, but not for our unique counter.
First write wins
Lets invent a somewhat contrived example. We want to count the number of users who have accepted a meeting invitation.
The table might look something like this CQL definition.
1 2 3 4
We also need to know if we have counted this user for this meeting before because we don’t want to count the same user twice. Lets create another table, meeting_users.
1 2 3 4 5
Internally to Cassandra the table looks a bit like this.
|meeting_id partition key||
So how do we set if not exists without using some kind of distributed lock? Another feature of the Cassandra INSERT statement is the ability to provide the value that will be used for the timestamp, instead of it being generated silently based on the current time. Given that the highest timestamp value wins, we want a value that gets smaller over time so that the first write always wins.
We can do this by using (long.MaxValue – current timestamp) as the timestamp value along with the INSERT request. Lets say we calculate this to be 9223372035433107000, then our INSERT will look like this.
Did my write win?
Now we need a way to know the result of that INSERT, did it already exist? This can be easily achieved using the WRITETIME function.
1 2 3 4
If this result matches the timestamp we provided (9223372035433107000), then this was the first write of that primary key and we can increment our counter. If it doesn’t match then the insert will effectively be ignored and the data will be dropped during compaction. Extra cost is incurred during this read to merge-out the subsequent writes until that happens.
Compaction merges data and consolidates SSTables. To reduce the amount of read required during a query it’s probably a good idea to use the DateTieredCompactionStrategy on the unique set table (meeting_users in this example).
If you have any opinion on why this is a terrible idea, I really love to hear it!