Hybrid Logical Clocks

February
14,
2020
ยท
local-first

This is the first in a series of posts digging into James Long's talk CRDTs for Mortals, and the accompanying demo app he created.

So, you're writing a destributed system local-first app, and you're sending events back and forth between various clients and the server. One thing you'd really like to be able to do is determine the order in which events happened -- this is important for the "last" part of "last-write-wins" CRDTs, for example. So you add a timestamp to each event as it's created, and all is well.

Until someone points out to you that the source you're using for a timestamp -- the local machine's "wall clock" -- isn't completely reliable. Clocks drift, sometimes they go backward, and a malicious or particularly incompetent user could set their device's clock to a wildly inaccurate value. For example, if one client's clock is a day ahead, and they send some changes to an object, other clients would be completely unable to make changes to that object until the following day, when their local timestamps finally become "later" than the bad-actor's timestamps.

Now there are decent odds you could get away with just using timestamps, if your requirements for overall correctness are not too strict. Computers and phones tend to auto-update their clocks fairly frequently, and bad actors are likely to be rare. On the other hand, if it's easy to fix, why do the thing that's worse?

Fortunately, there's a tight little algorithm that does a pretty good approximation of telling you "the order in which events happened" in a distributed system: the Hybrid Logical Clock (HLC).

Let's be clear up-front about the promises this clock is making; it cannot divine the actual real-life ordering of all events. It does however make the following guarantees:

  1. All events created on a single machine will be correctly ordered with respect to each other (even if the wall clock jumps back a second or is otherwise squirrely).

  2. Once machine A sends events to machine B, all events subsequently created on machine B will be ordered as after those events from machine A. So if A sets their clock one day ahead, makes some changes, and sends them to B, B will still be able to make changes even though its own wall clock is 'behind'.

HLC Implementation

The HLC algorithm is simple enough that we can walk through the whole thing right here. If you want to see it all together, here's the full source.

The Hybric Logical Clock consists of two elements -- a timestamp (ts) and a counter (count).

export const init = (node: string, now: number): HLC => ({
    ts: now,
    count: 0,
    node,
});

In practice, you'll want a third element, a "node ID" that is unique per device, that you can use to break ties if the timestamp and counter are, by some incredible fluke, exactly the same between events generated on two devices.

The clock is initialized with ts as the device's wall clock timestamp, and count at 0.

Inc

When an event occurs, you call inc on the HLC to get an updated version to attach to that event. This function gives us guarantee #1.

export const inc = (local: HLC, now: number): HLC => {
    if (now > local.ts) {
        return { ts: now, count: 0, node: local.node };
    }

    return { ...local, count: local.count + 1 };
};

If the wall clock is later than the HLC's timestamp, then we can just return a new HLC with the wall clock timestamp and a count of 0. If however the wall clock's timestamp is not later than the HLC's (e.g. because our wall clock jumped back), then we re-use the HLC's timestamp, and increment the counter.

Recv

When receiving events from another node, we'll need to update our HLC using the HLC from that other node. This function gives us guarantee #2.

export const recv = (local: HLC, remote: HLC, now: number): HLC => {
    if (now > local.ts && now > remote.ts) {
        return { ...local, ts: now, count: 0 };
    }

    if (local.ts === remote.ts) {
        return { ...local, count: Math.max(local.count, remote.count) + 1 };
    } else if (local.ts > remote.ts) {
        return { ...local, count: local.count + 1 };
    } else {
        return { ...local, ts: remote.ts, count: remote.count + 1 };
    }
};

If the local wall clock is later than both the local and remote timestamps, great! Use it, and reset the count to 0.

Otherwise, the wall clock is somehow behind -- either behind the remote HLC (it could be that the remote device has a fast clock), or behind the local one (if the clock jumped backward, or our HLC was previously updated from another node that had a faster clock).

If both HLC's timestamps are the same, we need a new count that's larger than either of theirs. Otherwise, use the later timestamp of the two, and a count that's one greater than the associated HLC's counter.

Cmp

Comparing HLCs is quite simple: first compare the timestamps, then the counts, then the node ids as the final tie-breaker.

export const cmp = (one: HLC, two: HLC) => {
    if (one.ts === two.ts) {
        if (one.count === two.count) {
            if (one.node === two.node) {
              return 0;
            }
            return one.node < two.node ? -1 : 1;
        }
        return one.count - two.count;
    }
    return one.ts - two.ts;
};

If you're unfamiliar with the cmp function convention, it's a function that takes two arguments, and returns a negative number if the first argument is "less", a positive number if the second argument is "less", and 0 if they are equal. You can pass this cmp function to .sort() of an array of HLCs, and it will order them from "earliest" to "latest".

In practice, it's convenient to "pack" the HLC into a string representation, such that you can do normal lexical comparisons (< and >) on them, and generally pass them around more easily. The original paper sacrifices some precision on the timestamp to pack all the information into an integer, but for my purposes strings generally work fine.

export const pack = ({ ts, count, node }: HLC) => {
    // 13 digits is enough for the next 100 years, so 15 is plenty.
    // And 5 digits base 36 is enough for more than 6 million "out of order" changes.
    return (
        ts.toString().padStart(15, '0') +
        ':' +
        count.toString(36).padStart(5, '0') +
        ':' +
        node
    );
};

export const unpack = (serialized: string) => {
    const [ts, count, ...node] = serialized.split(':');
    return {
        ts: parseInt(ts),
        count: parseInt(count, 36),
        node: node.join(':'),
    };
};

But wait

A case that it doesn't cover is this: if A sets their clock ahead & makes some changes, but doesn't send them, and then B makes some changes, and then A sends their changes, A's earlier changes will be ordered after B's later changes. Upon closer inspection however, this isn't a huge problem. From B's perspective, if it hasn't seen the changes, they might as well have not yet happened. Is there a real difference here between A making changes, waiting, then sending; and A waiting, then making changes and sending them immediately? You could think of them being logically equivelent, and in fact the HLC considers them to be.

A real world example of this issue would be if a user has two devices, both offline, but device A is somehow an hour ahead of device B. The user makes a change on device A, then walks over to device B and makes a conflicting change, logically thinking that the change on B will win, because it is "last". Once both devices come online, the change from device A has won, much to the user's surprise. In practice, you'll want to do some checks on timestamps your server is getting from clients, and raise a red flag if their too out of sync.

For the most part, if users clocks aren't too far off, then changes that happen "in parallel" (while both are offline, for example) will be ordered correctly.

That's it!

I made a demo app to help myself gain some intuition about how HLCs interacted with each other in a distributed system (source here) -- feel free to play with it, and let me know what you think on twitter!