CODEZUKI

A blog about coding.

A Unique Counter With Cassandra

A common requirement is to record a unique count of the occurences of some event. Obviously to do this you have to know if it has happened before, but using the standard “set if not exists” semantic in an eventually consistent environment like Cassandra can be challenging.

Cassandra does support “INSERT … IF NOT EXISTS”, however this is intended for multiple datacenter scenarios and special cases, but it’s not performant in a cluster under high load because an agreement has to be reached between multiple nodes to decide whether or not the insert can succeed. This slows down writes enormously.

Using IF NOT EXISTS incurs a performance hit associated with using Paxos internally

– DataStax INSERT reference

Last write wins

The Cassandra read and write path is already well documented, so I’ll be brief.

When Cassandra writes data it performs no seek, only appending writes. This means that if there is more than one write to the same primary key the latest timestamp is used to decide the winning value. The last write wins. Most of the time this is exactly the behaviour you want, but not for our unique counter.

First write wins

Lets invent a somewhat contrived example. We want to count the number of users who have accepted a meeting invitation.

The table might look something like this CQL definition.

1
2
3
4
CREATE TABLE meeting_acceptance_counts (
  meeting_id uuid PRIMARY KEY,
  acceptance_count counter
)

We also need to know if we have counted this user for this meeting before because we don’t want to count the same user twice. Lets create another table, meeting_users.

1
2
3
4
5
CREATE TABLE meeting_users (
  meeting_id uuid,
  user_id uuid,
  PRIMARY KEY(meeting_id, user_id)
)

Internally to Cassandra the table looks a bit like this.

meeting_id partition key
user_id column name
CEC01088-2F5A-41E1-93D7-88AAD61B022F column value
1421668551 timestamp
ttl not used

So how do we set if not exists without using some kind of distributed lock? Another feature of the Cassandra INSERT statement is the ability to provide the value that will be used for the timestamp, instead of it being generated silently based on the current time. Given that the highest timestamp value wins, we want a value that gets smaller over time so that the first write always wins.

We can do this by using (long.MaxValue – current timestamp) as the timestamp value along with the INSERT request. Lets say we calculate this to be 9223372035433107000, then our INSERT will look like this.

1
INSERT INTO meeting_users (meeting_id, user_id) VALUES (379CCAB3-9773-44A7-8520-BF6E0D5CD56A, B874527A-DB0F-499C-BB5F-80C76DFBAAE1) USING TIMESTAMP 9223372035433107000;

Did my write win?

Now we need a way to know the result of that INSERT, did it already exist? This can be easily achieved using the WRITETIME function.

1
2
3
4
SELECT WRITETIME (user_id)
FROM meeting_users
WHERE meeting_id = 379CCAB3-9773-44A7-8520-BF6E0D5CD56A
AND user_id = B874527A-DB0F-499C-BB5F-80C76DFBAAE1

If this result matches the timestamp we provided (9223372035433107000), then this was the first write of that primary key and we can increment our counter. If it doesn’t match then the insert will effectively be ignored and the data will be dropped during compaction. Extra cost is incurred during this read to merge-out the subsequent writes until that happens.

Compaction

Compaction merges data and consolidates SSTables. To reduce the amount of read required during a query it’s probably a good idea to use the DateTieredCompactionStrategy on the unique set table (meeting_users in this example).

Feedback

If you have any opinion on why this is a terrible idea, I really love to hear it!

An Embedded Shopify App With Node.js

When I decided to scratch my own itch and build a Shopify App, I quickly realised that unless you are using Ruby there is a shortage of instruction on how to do so. In truth the Shopify API documentation is pretty good but necessarily generic, so I embarked upon my new project keen to not only build my app, but also to help the next guy too.

The source code for a bare-bones Node.js app which authenticates using OAuth 2.0 is shared here on GitHub. This is just enough to get you started through OAuth authentication to rendering a “Hello World” view in the Shopify admin panel.

Prerequisites

Before proceeding you first need to follow the Shopify developer’s Getting Started guide.

Once you have an app icon appearing in the “Apps” panel of your account you are ready to continue.

How it works

The OAuth process is a little more involved for an Embedded App. This is because Shopify prevents admin URLs from being loaded inside an iframe, so the intial request must escape the iframe to initiate the process.

Here is a slightly simplified version of events:

  1. The user clicks on your app icon in their account’s App page
  2. A request is made inside the iframe to your app which is hosted at the URL you defined when you created the app, for example: http://localhost:3000
  3. This requests arrives at “http://localhost:3000/
  4. If there is already an Access Token on your current session, redirect to “/render_app”
  5. If there is no current Access Token then redirect to “/auth_app”
  6. “/auth_app” redirects to “/escape_iframe” which will render a view which performs a JavaScript top window location change to “/auth_code”
  7. “/auth_code” performs the first OAuth request to Shopify, providing “/auth_token” as the redirect URL
  8. “/auth_token” receives the Access Token, saves it to the session and redirects to “/render_app”
  9. “/render_app” renders your main app view
  10. You may now use this Access Token to perform further API requests as required by your application.

Mixed content

When developing your app locally you may be mixing secure and insecure content, therefore your browser will block these requests. You may temporarily disable protection on the page to allow your requests to proceed.

I hope this helps you get started with your Shopify App.

Building a Scheduled Queue With Redis

Something that comes up every now and again is, “How would I build a distributed, scheduled queue?”.

Lets say I have an online order system that can accept two types of orders, those that must be processed immediately and others which can be scheduled to be processed at some arbitrary time in the future.

One approach might be to store all of the orders in an enterprise RDMS and poll it frequently to find orders whose schedule date is now met. If there was a huge amount of data this could put an unneccessary strain on the database.

Redis to the rescue!

Redis can be used to act as a queue of orders. There will be two queues, the “Process Now” queue and the “Process Later” queue. On these queues I will store just the ORDER_ID, the full order details are still stored in our RDBMS. Everything in the “Process Now” queue is ready to be processed, so we can just pop the next one off that queue (using BLOP) and perform billing and order fulfilment. This queue is just a Redist List that we are treating like a queue. The orders in the “Process Later” queue must be processed when the order’s schedule date has been met. This queue is simply mapping a key to an empty value with an expiration date, it’s not even really a list.

How do we know the schedule has been met without searching the queue? An upcoming feature of Redis is Keyspace Notifications. Amongst other things, this means we can subscribe to the expired key event. Lets be clear, Redis must be searching for keys to expire and the documentation clearly states there is an overhead associated with enabling Keyspace Notifications, so this feature is not free in terms of CPU time but it releases some load from the RDBMS.

Get Redis

Unfortunately Keyspace Notifications are currently only available in development versions of Redis. The examples I toyed with here worked with a development build of Redis 2.8, and here’s how I got it working.

Clone Redis from Github

Go to the Redis GitHub page and clone the repository, then checkout branch 2.8 and follow the instructions in the README file to build Redis locally. It’s very easy to do and doesn’t take long.

Enable Keyspace Notifications

Because keyspace event notification carries an overhead, you have to enable it as an option when Redis starts. I did so like this.

1
./redis-server --notify-keyspace-events Ex # E = Keyevent events, x = Expire events

You can also do this permanently by editing the redis.conf file, which also contains further comments about other events you may be interested in. It is worth reading those comments if you want to play with this feature.

The Redis website has a topic on notifications.

Schedule and observe expiring keys

For simplicity I’m using the native Redis Client to demonstrate these examples. There is a complete node.js example at the end of this post.

Open two Redis Clients in separate terminals.

1
./redis-cli

In the first client schedule an order id in the “Process Later” queue. Use an expiry time of 10 seconds in the future to give us enough time to switch to the other terminal and watch the magic!

1
set order_1234 '' PX 10000 # set the key 'order_1234' with it's expiry time in 10000ms

Here we are not interested in the value, just the key order_1234 and that it expires in 10000ms. The expiry time would be the number of milliseconds in the future until the order fulfilment date.

More on the Redis command SET.

Subscribe to the key expiring events

In the second client subscribe to be notified when keys expire like this.

1
psubscribe '__keyevent@0__:expired'

More on the Redis command PSUBSCRIBE.

Now we are subscribers to the queue of “Process Later” orders. We will be notified when the schedule date is reached in the form of an event telling us that the ORDER_ID has expired. This is nice as the key is automatically being removed from the “Process Later” queue too. Now all we have to do is place the expiring ORDER_ID onto the “Process Now” queue and the order will be fulfilled.

This is what the notification looks like when the key expires.

1
2
3
4
1) "pmessage"
2) "__keyevent@0__:expired"
3) "__keyevent@0__:expired"
4) "order_1234"

Making it distributed

You can do a lot with Redis running on a single server, but if we wanted to make this fault tolerant to network partitions and scale horizontally then we would probably look at sharding and replication. Some clients support automatic sharding using consistent hashing which would work well with our ORDER_ID in this example.

The Redis website has a topic on partitioning.

Here is a good post about replication with Redis.

An example in node.js

Here is a crude but working example I created as a GitHub Gist.

Caveat emptor

Okay so here is the disclaimer. I have rather unfairly demonstrated a development build of Redis, so be aware that everything could change and although I encountered no problems, it is only reasonable to expect a pre-release quality experience.