Snowflakes

The ID type that is unique, roughly sortable, timestamped , and works at any scale.

Published

Swerves

Note: This is a companion article to my FlakeId and Goflake libraries

I can hear you say “do we really need another ID type?“. There’s the standard UUID, its lexicographically sortable cousin the ULID, and even good ol’ auto-incrementing integers. Great options, but they all come with rather big tradeoffs; be it their 16-byte size, predictability, or their inability to sort or partition respectively.

These tradeoffs are acceptable if you’re designing a system that’s not expected to grow beyond a few hundred thousand data entries. Not so much when you grow to billions of records produced by many worker processes.

This is where the Snowflake ID originally introduced by Twitter but perfected by Discord shines. Discord is an extremely popular messaging service that is mostly used in gaming. It processes billions of messages from users all over the world every single hour. Even more so, it doesn’t rely on a central server to hand out these IDs; any node can generate an ID while guaranteeing its uniqueness. How do they do it?

The secret is in the structure of a snowflake ID. It consists of a number of components: a timestamp being the most important, followed by some pseudo-random information. In Discord’s implementation, these random components are a process ID, worker ID and an increment.

None of these values are by themselves, unique. Two machines in the same timezone obtaining a timestamp at the same millisecond are guaranteed to observe the same value. However, the chances of two machines doing that from a process with matching process IDs, worker (or thread) IDs, and having generated an exactly identical number of IDs before, is nigh-on impossible.

In other words, the composite of these values is what makes Snowflakes unique. Combine this with the fact that they fit an int64 data type, are time-sortable, and you’ll quickly see why they make for an interesting key type.

Layout

Every Snowflake is essentially a 64 bit integer, shifted around. This means that while a Snowflake can be represented as any regular 64 bit integer, it comprises of four distinct components which have been masked and bitshifted into it. Of course, this process works both ways, and all these components can be obtained from the Snowflake as well.

You can theoretically use any layout so long as it fits in an int64. Discord’s implementation uses the following layout:

111111111111111111111111111111111111111111 11111 11111 111111111111
64                                         22    17    12         0

Which means every Snowflake contains:

  • 42 bits for the timestamp
  • 5 bits for the processId
  • 5 bits for the workerId
  • 12 bits for the increment

Consequently, if you want to obtain the timestamp from a Snowflake, you would simply:

timestamp := (snowflake >> 22) + epoch_ms

Where epoch_ms is the moment in time you chose as your starting point. For Discord, this is the first of January, 2015.

The 22 bits we shift right are the bit count of the processId, workerId and the increment combined, leaving us with the timestamp in the least significant 42 bits, while the most significant bits are zeroed.

When represented as an integer, you may get a value that looks something along the lines of: 111773356109402112.

Increment

The increment component is interesting in the sense that various implementation of flake-like IDs tend to implement it differently. Some implementations, like RobThree’s IdGen only increase the increment counter whenever a duplicate generation is detected at the same exact millisecond.

While this is perfectly fine, it does mean generating an ID must of necessity involve locking or some other concurrency mechanism in order to guarantee unique IDs are generated. In my opinion, this overhead isn’t worth the penalty in complexity and performance overhead.

Instead, I suggest to increase the counter for every ID generated. This will eventually cause the increment counter to overflow, but since we’re only using the lower 12 bits, this won’t cause any issues unless we manage to have a single thread generate an obscene amount of IDs every millisecond, which outside of a benchmark situation seems unlikely.

Generation

We can generate a snowflake by obtaining all its components, and shifting them into a single int64, usually sourced from a timestamp in milliseconds.

The first step is to obtain a timestamp from a monotonic time that has been running since epoch. Accessing the system clock is a relatively expensive call, so we can abstract it by calculating the starting point of our instance in millisecond, and just counting the milliseconds that have elapsed with a Stopwatch. This is all neatly wrapped in MonotonicTimer:

internal static class MonotonicTimer
{
    private static readonly long s_epoch = GetInstanceEpoch();
    private static readonly Stopwatch s_stopwatch = Stopwatch.StartNew();
    
    internal static DateTimeOffset Epoch => new DateTimeOffset(2015, 1, 1, 0, 0, 0, 0, TimeSpan.Zero);

    public static long ElapsedMilliseconds => s_epoch + s_stopwatch.ElapsedMilliseconds;

    private static long GetInstanceEpoch()
    {
        // Omitted for brevity
    }
}

Of course, the Epoch here can be any point in time. Because my implementations closely resemble the Discord implementation, I have opted to use their epoch.

Now that we have a monotonic timer, we can use its masked timestamp as the starting point for our ID:

long milliseconds = MonotonicTimer.ElapsedMilliseconds;
long timestamp = milliseconds & TimestampMask;

And simply shift the rest of our Snowflake components in to construct our Snowflake:

snowflake = (timestamp << (ThreadIdBits + ProcessIdBits + IncrementBits))
            + (threadId << (ProcessIdBits + IncrementBits))
            + (processId << IncrementBits)
            + increment;

At the end, we’ll have the timestamp in the most significant (“leftmost”) 42 bits, followed by 5 bits of the threadId, 5 bits of processId and ultimately 12 bits of increment. All neatly wrapped into a single int64.

These IDs are K-ordered, meaning that even if they were generated by disparate nodes, we can still roughly sort them close to the order in which they were generated.

Collisions

When using a monotonically incrementing increment counter, the chance of the same ID being generated twice is close to zero.

Using 2015 as the epoch on int64, unique IDs will continue being generated until at least 2084. However, since the timestamp component occupies the Two’s Complement bit, we realistically won’t run out until 2154. Of course, choosing a later epoch will shift the expiration date.

Theoretical collision could occur if we manage to generate 2^31, or 2147483647 IDs on the same thread in a single millisecond, at which point the increment counter overflows back to a value it has previously used to generate an ID — all other components being equal, it would theoretically generate a duplicate ID.

I don’t think this is possible on current hardware.

Primary Keys

From a system architecture perspective, here are some considerations you should take into account when you are considering using Snowflakes as a key to uniquely identify an entity in a system.

The ability of a Snowflake to fit into any int64 type makes it a very “natural” primary key for SQL databases, and doesn’t require language support.

Much like UUIDs, Snowflakes enable you to decentralise your ID generation. This means that your application code can generate IDs, rather than the database. This decouples your application from relational concepts such as foreign keys, because you can now generate the ID of a related object prior to inserting it into the database, making constructs such as foreign keys or table relationships, optional.

When used in indexed relational databases, insertions are much more predictable than e.g. UUID insertions, because Snowflakes tend to increase over time. This leads to less index pressure and fragmentation.

Snowflakes are not a standard. It is very much a niche concept implemented by internet strangers such as myself. UUIDs are standardised in every single language and database engine. Databases know what a UUID is, and they know how to deal with them. Snowflakes are just a number.

The more recent UUID V7 specification has UUIDs move towards timestamp-based IDs rather than the pure randomness of the now common V4 UUIDs. When these become more widely adopted, it may make more sense to use the more standardised option.

The ULID specification hasn’t been updated in several years, and at this point, I personally don’t see the benefit over ULIDs over Snowflakes. Having said that, oklog/ulid is still a very solid and performant implementation of ULIDs if you require string keys.

Conclusion

I hope this article has given you some insight in what Snowflakes are, what makes them unique and how they can be useful. Choosing a primary key type for any system should be a careful consideration, and I hope this article helps you make the correct decision for your given situation.

If you have any questions about my specific implementations, the best way of contacting me is by opening an issue on GitHub.

Good luck!