Three Clocks are Better than One
"Whenever you find yourself on the side of the majority, it is time to pause and reflect." - Mark Twain
We recently had the pleasure of presenting TigerBeetle on Zig SHOWTIME, where we ran a live Twitter poll to ask: Which Linux monotonic clock "stopwatch" is best to measure elapsed time?
The contenders were CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC and CLOCK_BOOTTIME, all monotonic clock stopwatches provided by the Linux kernel through the clock_gettime(2) syscall to measure elapsed time, but with critical differences as we will see. The majority said CLOCK_MONOTONIC and only 7% said CLOCK_BOOTTIME:
What would you have picked?
Why might the majority answer of CLOCK_MONOTONIC be catastrophic?
Why reach for a monotonic clock stopwatch in the first place? Why not choose a "wristwatch" like CLOCK_REALTIME to use the wall clock to measure elapsed time?
And why does a distributed database like TigerBeetle need to think about time?
"Time is Money" - Benjamin Franklin
TigerBeetle can process a million financial transactions per second, and takes safety even more seriously than performance. Our design set out to solve several fault models, such as a storage fault model, the well-understood network fault model, a human fault model and lastly a clock fault model, because the timestamps of our financial transactions must be accurate and comparable across different financial systems.
The challenge is that physical imperfections in hardware clocks (called quartz crystal oscillators) cause our software clocks to tick at different speeds, so that time passes faster or slower than it should, with these "drift" errors also accumulating into significant "skew" errors within a matter of minutes.
Most of the time, Network Time Protocol (NTP), would correct our clocks for these errors. However, if NTP silently stops working because of a partial network outage, and if TigerBeetle keeps transacting, then we would be running blind, in the dark, while disconnected from true time.
We need to know that our clocks are being synchronized, or that they are within our tolerances for clock error if not. We also can't afford to shut down only because of a partial network outage.
Following the insight that "three clocks are better than one", TigerBeetle solves the problem by combining the majority of clocks in the cluster to construct a fault-tolerant clock called "cluster time". We use cluster time to bring a server's system time back into line if necessary, or shut down safely if we see too many faulty clocks.
"How did it get so late so soon?" - Dr Seuss
To arrive at cluster time, TigerBeetle servers measure their clock offset to other server clocks whenever they send keepalive probes.
To do this, there are three things we need to know.
The stopwatch time to send a probe to the other server (S2) and receive its reply (called round trip time, M2 - M1 in the sketch below), the wristwatch time of the other server when it sends its reply (T1), and our wristwatch time when we receive the reply (T2):
Working out the clock offset between two servers
All we need to do now is add half the round trip time (called the one way delay) to T1 to compare server S2's clock with our server's clock and arrive at our clock offset.
We then collect these clock offset samples in short windows of up to 20 seconds, to get the best samples with the least network interference (i.e. minimum round trip time), and then we pass these samples to Marzullo's algorithm, which can estimate upper and lower bounds on cluster time, from a number of noisy clocks, by returning the smallest interval consistent with the largest number of clocks.
Crucially, because we use stopwatch time (i.e., elapsed time between communications) to measure the round trip time, we can avoid errors in our calculations when the wristwatch time jumps around (e.g. when corrected by NTP). We can also use stopwatch time to know the cluster time since our last synchronization window, without misapplying our cluster time offset interval to a wristwatch time that might since have jumped.
"All we have to decide is what to do with the time that is given us" - Gandalf
To return to our Twitter poll, the Linux kernel is like an expensive watch store that gives us a myriad of time sources. There are wristwatches on display, but also a collection of stopwatches, such as CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC and CLOCK_BOOTTIME.
As we saw, most programmers, understandably, believe that CLOCK_MONOTONIC is the monotonic clock stopwatch to use, as the name suggests.
We were just as surprised to learn that in fact CLOCK_MONOTONIC fails to measure elapsed time while the system is suspended (e.g. during a VM migration), and that CLOCK_MONOTONIC_RAW was exposed by the kernel only for synchronization protocols like NTP to measure the quartz crystal oscillator drift error, with little resemblance to the actual passing of time.
There was indeed a Linux patch to fix CLOCK_MONOTONIC's behavior during a suspend, but it got reverted.
The right stopwatch then, we believe, is also the minority answer: CLOCK_BOOTTIME, which according to the man page is:
"...identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock..."
Naming things is hard. And while the man page does mention CLOCK_MONOTONIC's suspend error, it does so only under CLOCK_BOOTTIME (if one reads that far).
"Great Scott!" - Dr Emmett Brown
It's not enough to rely on external clock synchronization protocols like NTP if you care about accurate timestamps. You also need a mechanism to detect when your clocks start failing beyond your tolerance for error.
It's possible for a distributed database to exploit redundant hardware for fault-tolerance, even to mask a drifting quartz crystal oscillator for clock fault-tolerance.
Knowing when you need wristwatch time and when you need stopwatch time makes for cleaner algorithms and safer software.
Finally, always read the (whole) man page.
If you want to learn how TigerBeetle processes a million financial transactions a second, here's the rest of our Zig SHOWTIME talk:
Thanks to Loris Cro for suggesting and reading drafts of this post.
- TigerBeetle's (beta) clock synchronization algorithm.
- Thomas Gleixner with the raison d'être for CLOCK_MONOTONIC_RAW.
- Jon Moore on How to Have your Causality and Wall Clocks, Too.
- Romain Jacotin explains Marzullo's algorithm.