August in TigerLand
Dear friends,
Hope your August was awe-inspiring! Last month, we doubled down on performance, improved continuous integration and testing, and enhanced our client libraries. SD’25 continued on YouTube, we published a blog post, and turned three!
Let’s go!
As boundaries, interfaces are not only what’s exposed, but also what’s imposed. Joran Dirk Greef
TigerBeetle’s consensus protocol consists of two recovery sub-protocols that help lagging replicas catch up: Write-Ahead Log (WAL) Repair and State Synchronization. WAL repair entails repairing missing or corrupted prepares over the network. However, the WAL is a finite, on-disk ring buffer wherein old prepares are overwritten after they’re checkpointed. Consequently, replicas may fall so far behind that WAL repair isn’t possible at all. In that case, replicas must perform State Synchronization, which entails fetching the latest checkpoint and using it to repair missing or corrupted LSM tree blocks over the network. Last month, we made various improvements to online verification and performance in both our recovery subprotocols!
A serialization point in WAL repair was removed: the prepare header is now repaired concurrently with the corresponding prepare, rather than before. Additionally, lagging backups cache prepares from the next checkpoint, as opposed to dropping them. This obviates the need to repair these cached prepares when the laggard advances to the next checkpoint.
State sync now requests blocks more aggressively, driven by the rate of incoming requests rather than an infrequent timeout. Progress is ratcheted across multiple checkpoint targets, avoiding the earlier behavior of restarting from the beginning on each checkpoint change, which caused duplicate work.
Anomalous behavior was fixed in which the WAL repair timeout was not reliably firing. Since the timeout resets state, its absence meant that requesting a prepare from a replica that lacked it would prevent further requests to other replicas. This severely degraded repair performance for applications sending small batches to TigerBeetle.
The VOPR now verifies that a failure to read a block is due to either a corruption or state sync. Additionally, the internal consistency of GridMissingBlocks is now asserted, which is used to maintain state about which blocks are currently being requested due to corruption or state sync. This is an example of how the VOPR is not only deterministic as a simulator, but also protocol-aware. It’s able to reach in and verify the system under test, almost understanding consensus!
At TigerBeetle, we approach performance engineering by first getting the high-level performance architecture right, as that’s what can enable multiple orders of magnitude improvement. Then , we employ low-level techniques like static memory allocation, amortization, zero-deserialization, memory locality, etc. to extract additional performance. Since these optimizations compound, we believe in investing in all, big or small!
Last month, we optimized tigerbeetle format by making it concurrent and avoiding writing out the entire 1 GiB WAL during format. This brought about a 10x improvement in first-time format performance.
Code generation in our AEGIS implementation was improved by rewriting it in a semantically equivalent form that keeps state in registers, reducing loads and stores. The result is a 2x speedup in microbenchmarks and a 3–10% throughput increase in tigerbeetle benchmark.
Our set associative cache is now 5% more performant, due to SIMD-friendly iteration over the set’s ways, and replacing the expensive modulo operation with fastrange, a faster alternative for modulo reduction.
In the Zipfian distribution, a small percentage of candidate items have a high probability of being selected. We fixed our shuffled Zipf generator, which was deviating from the Zipf distribution for large key ranges.
We also landed various improvements to our continuous integration (CI) and testing infrastructure; across fuzzers, unit tests, VOPR and Vortex!
Every 6 hours, a CI job via GitHub actions validates the latest TigerBeetle release on GitHub, to catch bugs due to systems out of our control. Last month, we added a new validation to check release determinism by comparing the SHA-256 hash of the release on GitHub, and a freshly built release. Additionally, we now validate the checksum of the Zig binary that we download in zig/download.sh!
To reduce friction when writing new fuzzers, we introduced the fixtures module, which eliminates duplicate initialization logic across all fuzzers to initialize storage, grid, superblock, etc.
We added a test that verifies all components’ unit test files have been imported in the global unit test file. The test was soon after changed to use a quine – a program that takes no input and produces a copy of its own source code!
The VOPR – our deterministic simulator where a TigerBeetle cluster running real code is subjected to network, storage and process faults – now generates CreateTransfers requests for transfers that already exist, and asserts that they return CreateTransfersResult.exists. Additionally, VOPR’s cluster upgrade logic was improved to mimic production upgrades, wherein replicas coordinate and automatically restart once they all have the new binary.
Vortex – our explicitly non-deterministic test suite that checks safety and liveness properties for multiple client languages through fault injection – was added to our CI suite that runs against each GitHub pull request, and can now test our Rust client via a new driver.
We added documentation for Adaptive Replication Routing, our replication strategy to detect and overcome Gray Failure – anomalies in cloud environments caused by subtle fail-slow faults as opposed to fail-stop faults. We also added an explanation for how applications can submit fuller batches to TigerBeetle using automatic batching in clients.
Finally, we implemented various improvements and fixes to our client libraries:
The Rust client now includes extensive documentation and better tests.
An overzealous assertion in the asynchronous Python client, which fired because the close() function was synchronous, was fixed.
The Node client is now compatible with v24, which introduced a backwards-incompatible API change.
- Tamas added type annotations to our Python client, making it compliant with mypy in “strict” mode. Type safety for the win, kudos Tamas!
- Yinka corrected some typos in our documentation about Performance. Thank you for your contribution, Yinka!
- Joe added examples to our data modelling documentation, to better explain asset scales. Many thanks, Joe!
Last month in IronBeetle, we discussed how a TigerBeetle cluster safely orchestrates an upgrade when the old binary is replaced with a new one on all replicas. Further, we explored how we test upgrades in the VOPR, our deterministic simulator, and even discovered a disparity between upgrades in a real cluster vs. a simulated cluster! Finally, we started discussing Adaptive Replication Routing, our strategy for combating one of the fallacies of distributed computing – changing network topology.
Join us live every week on Twitch or catch up on the
TigerTube!
Code Review Can Be Better (Aug 4) An unusual but resonant post: Matklad wrote about GitHub’s code review process and the conclusion of our experiment with a different take a.k.a. git-review tool, which we decided to shelve, at least for the time being.
Scaling Correctness: Marc Brooker on a Decade of Formal Methods at AWS (Aug 7) A lovely interview between Marc Brooker and Antithesis on DST, how EC2 also came out of Cape Town, and Marc on his and Joran coming from the same university. Catch the shout out for TigerBeetle; ‘grats again on the decade milestone, Marc!
Not So Direct I/O (Aug 8) Jack picked up on Joran’s DBMS Interview Challenge on X, and wrote a post about it. We enjoyed his goose chase and thinking. Here’s to curious cats, including Jack!
Funding news for Profs. Ram Alagappan and Aishwarya Ganesan (Aug 15) In 2018, “Protocol-Aware Recovery” set the standard for Durability in ACID, making Profs. Ram Alagappan and Aishwarya Ganesan giants in the field. Grateful to stand on their shoulders—and to support their ongoing work.
Reserve First (Aug 16) Matklad also wrote about a coding pattern that is relevant for people who use the heap liberally and manage memory with their own hands (including a Dante Gabriel Rossetti painting).
SD25 Online continued! In August, the second crop of SD25 talks! Watch Prof. Hannes Mühleisen’s talk on DuckLake (Aug 4), Hillel Wayne’s What Isn’t Your System Supposed to Do? (Aug 6), Andrew Kelley’s Don’t Forget To Flush (Aug 11), Prof. Ram Alagappan’s New Shared-Log Abstractions for Modern Applications (Aug 13), Kyle Kingsbury’s Jepsen 18: Serializable Mom (Aug 18), this year’s Lightning Talks (Aug 20), Amod Malviya’s A Systems View to AI (Aug 25), and Dr. Thea Klaeboe Aarrestad’s Big Data and AI at the CERN LHC (Aug 27). Just one talk crept into September…
P99 CONF: The Tale of Taming TigerBeetle’s Tale (Oct 22-23) TigerBeetle’s Tobias Ziegler will be storytelling at P99 CONF next month: performance wins via modern superscalar CPUs, patterns CPUs love, and speed in production, not just on paper. Go, Tobi!
‘Till next time… smile with the rising sun!
The TigerBeetle Team
Time flies, when you are
Powering OLTP—
TigerBeetle,
three!









