December in TigerLand

    Dear friends,

    Wishing you a very merry December. This month, we published our brand new Python client, introduced more fault injection to VOPR and Vöʀᴛᴇx, and improved our logging, CLI, and REPL for a better developer experience. We also discovered and fixed a subtle memory swapping issue that would have potentially allowed the Linux kernel to undermine TigerBeetle’s storage fault tolerance.

    Let’s go!

    The silver lining of correctness bugs: they are fun to debug .

    • TigerBeetle adopts an explicit storage fault model—it detects and recovers from latent sector errors, disk corruption, and misdirected I/O where firmware or filesystem bugs might read or write the wrong sector. This behaviour is continuously tested by the VOPR, which uses Deterministic Simulation Testing to precisely control (and accelerate) time, concurrency, and fault injection, even knowing how to push faults close to the theoretical limit of TigerBeetle.

    • We recently discovered that it was possible to circumvent our storage fault model on systems where swap is enabled—a section of a disk that the OS uses to store inactive data from memory. With swap enabled, corrupt data on disk could be swapped into memory by the kernel!

    • This behaviour could not have been detected by the VOPR as it tests whether TigerBeetle correctly adheres to its storage, process and network fault models, but does not inject memory faults! TigerBeetle’s memory fault model explicitly recommends using memory protected with error-correcting codes (ECC), and we explicitly make no mitigations against memory corruptions (this would require byzantine fault-tolerant consensus).

    • To prevent swap from bypassing TigerBeetle’s storage fault-tolerance, we now invoke mlockall Linux syscall on startup, which locks all of TigerBeetle’s virtual address space into RAM, preventing the memory from being paged to disk in the swap area, where it might be vulnerable to storage faults. Equivalently, we invoke SetProcessWorkingSetSize on Windows. However, TigerBeetle running on macOS is still vulnerable to swap, as an equivalent syscall isn’t available on macOS! (At present, we support only Linux for production, with Windows and macOS for development)

    • For predictable latencies of queries over two (or more!) indexes, TigerBeetle employs the Zig Zag Merge Join algorithm – a technique that intersects correlated values across indexes, zigging and zagging between them, without having to buffer or sort anything in memory. As a result, no pathological query can “explode” in time (or space) or take forever to execute!

    • We recently discovered and fixed a correctness bug in our Zig Zag merge algorithm implementation wherein some queries erroneously end a scan too early, truncating the query results. This bug could manifest through the use of get_account_balances, get_account_transfer, query_accounts, and query_transfers when over multiple secondary indexes. To learn more, do read through the phenomenal PR description that explains not only the bug in greater detail, but also how it made it past all four (!) of the fuzzers that test TigerBeetle’s scans. 🤯

    • At TigerBeetle, we continue to remind ourselves of the need to practice the principle of defense-in-depth. This is why critical components of the code are tested using test suites of varying granularity. For example, Scans (the component that powers queries in TigerBeetle) are tested via unit tests, a dedicated scan fuzzer, auxiliary LSM tree and forest fuzzers, and the VOPR (via this exhaustive workload). This month, we continued to invest in testing, from improving unit tests and fuzzers, to adding new failure modes to VOPR and Vöʀᴛᴇx.

    • We added a new network failure mode to Vöʀᴛᴇx – packet corruption. Vöʀᴛᴇx now randomly shuffles & zeroes out bytes in network packets, all while asserting correctness – by performing a set of reconciliation checks that assert creation of all accounts & transfers is successful, and querying accounts & transfers fetches expected results.

    • Currently, VOPR tests for safety and liveness of our VSR implementation in the presence of processes crashing (and restarting). We added a new process fault mode to the VOPR – pauses. The motivation behind adding this new failure mode is to simulate the real-world scenario of VM migration and to unveil new, interesting interleaving of events!

    • Motivated by a correctness bug that we fixed in our Zig Zag Merge Join algorithm, we rewrote our Scan fuzzer. The fuzzer now tests the Zig Zag Merge Join implementation more effectively by choosing random objects, as opposed to carefully choosing the prefix of generated objects based on what query they should match.

    • We now assert that the CacheMap, a hybrid between a SetAssociativeCache and a HashMap (stash) is exclusive; keys in the cache must not be present in the stash.

    • Enabling assertions in production means that TigerBeetle either runs correctly as far as our expectations encoded in the code or else crashes to remain safe, as opposed to remaining available but compromising correctness. However, it is important to exercise caution while choosing which asserts to enable in production. Since the data plane is performance-critical, we gate this new assertion behind the constants.verify flag. To learn more about how we decide whether an assert should be enabled in production, do read through this amazing comment! 🚀

    • We finally landed the heavily requested Python client for TigerBeetle, which supports both synchronous and asynchronous usage for all APIs (create_accounts, create_transfers, create_accounts, get_account_transfers, get_account_balances, query_accounts, query_transfers, lookup_accounts, and lookup_transfers). It has no runtime dependencies, and bundles in the C libraries - much like the Go client. Be sure to read through the documentation, and start building! 👨‍💻 🐍

    • Vöʀᴛᴇx is our full-system integration test that also covers the language clients. It runs the TigerBeetle binary on actual infrastructure (real OS, network, and storage) and stresses it using real client libraries while injecting crashes and faults.

    • Currently, Vöʀᴛᴇx only runs on Linux, but we are taking steps towards making it platform independent. To that effect, this month we added custom network fault injection to Vöʀᴛᴇx. This change involves placing a TCP proxy in front of each replica which intercepts all communication and injects basic network faults (packet loss, corruption, and delays). This rids us from our dependency on Linux tc/netem!

    • We also fixed a resource leak in Vöʀᴛᴇx wherein successfully initialized proxies weren’t deinitialized in the case where some proxies couldn’t be successfully initialized.

    • TigerBeetle cluster upgrades are designed to happen without external operator coordination. Multiple versions of the code are compiled into a single TigerBeetle binary, so that a replica itself can decide which version should be running, and deploy its own upgrades. Specifically, once operators place a new TigerBeetle binary on enough replicas, they coordinate the release amongst themselves. This approach optimizes for safety (eliminating a large category of storage non-determinism) and operational simplicity.

    • We changed the count of replicas considered enough __ by the cluster to start coordinating an upgrade. Specifically, we now require all replicas in the cluster to agree to upgrade (as opposed to the old value – all - 1). The motivation behind this change is that in most cases, not upgrading all replicas together would be a mistake, leading to replicas lagging and needing to state sync. If an upgrade is needed while the cluster is compromised, then it should be a hotfix upgrade, which is a build tagged with the same release.

    • This month, we landed enhancements to operator and developer experience.

    • --cluster is now an optional parameter for the tigerbeetle format. This allows operators to opt into a randomly generated cluster ID, which is logged by the format command.

    • Operators can now pass --log-debug –experimental to the tigerbeetle start command and enable debug logging at runtime, without recompiling the binary.

    • Cluster logs are now prepended with a UTC timestamp, allowing for easier correlation of logs between TigerBeetle replicas. 🕒

    • We improved the error logged when an operator accidentally runs a TigerBeetle client and cluster with incompatible releases (client_release_too_low/client_release_too_high).

    • We fixed a bug in the REPL wherein attempting to use multiple objects on operations that do not support them (for example get_account_transfers account_id=1, account_id=2) would cause it to panic.

    Join us live every Thursday to walk through TigerBeetle.

    Thursday / 9am PST / 12pm EST / 5pm UTC twitch.tv/tigerbeetle

    Be sure to check out episodes 51, 52, and 53 from this month, where matklad covers the theory and practice of State Sync—the mechanism which lagging replicas use to catch up to the rest of the cluster!

    IronBeetle YouTube Playlist Episode 001: Intro and (Absence of) Message Parsing

    Looking back over the year, there’s much to be grateful for.

    Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet Tweet

    Till next time… have yourself a merry little Christmas!

    The TigerBeetle Team

    P.S. We hope you enjoy these TigerTivities…

    It’s been a pleasure, King. Cheers from us all to your next mission.

    RSS iconRSS
    An idling tiger beetle Speech bubble says hi