September in TigerLand

    Dear friends,

    Hope your September was sublime! We introduced a lightweight tracing API, continued to add more ways to query TigerBeetle, doubled down on idempotency, further optimized availability and durability, released two new TigerTalks (Biodigital Jazz! as well as Durability and the Art of Consensus), and appeared on Software Engineering Daily!

    Let’s go!

    A highlight this month was all the effort that went into enhancing the experience of using TigerBeetle! From polishing log messages to introducing stronger guarantees for idempotent requests, we’ve made improvements that benefit developers and operators.

    • Tracing is a special kind of log that combines observability and debugging by exposing deeper details about the system’s execution path. This gives engineers (and curious users 🤓) valuable insights to troubleshoot issues, verify implementation correctness, and anticipate pathological behaviors under specific workloads.

    However, unlike observability tools that monitor a limited set of system “vital signals,” tracing is inherently far more verbose and resource-intensive. To fully harness its benefits, tracing must be lightweight—minimizing its performance impact—and, most importantly, the giant output file resulting from hours of tracing must be easy to navigate and visualize.

    • We introduced a new lightweight tracing API to TigerBeetle, designed to be included in the regular release binary (no special build required!). To enable tracing on the fly, simply add the --trace= argument at the CLI. Otherwise, there’s no additional overhead when tracing is disabled.

    Another benefit of having the tracing logic always exercised (whether enabled or not) is that it provides an additional layer of assertions to ensure the correctness of control flow, which is often non-linear due to asynchronous callbacks. In other words, to enforce that every I/O operation that emits a trace.start() event reaches its corresponding completion with one (and only one) matching trace.stop() event.

    • Speaking of JSON output, we adopted Google’s Trace Event Format for TigerBeetle traces. This was an excellent choice for its simplicity (which enables fast writes) and the wide range of tools available for navigating and searching within the format, matching the ease of use we aim for.

    Here’s a screenshot of a TigerBeetle trace file visualized by Perfetto:

    • Beyond tracing basic events such as CPU and I/O, we added specific tracing for Queries! For example, it’s now possible to observe how often Scans demand disk reads during query execution—a valuable tool for finding the right grid cache size for a particular workload.

    • One minor but welcome change was standardizing the use of logging verbosity levels error and warning! Now, we only print logs of errors during unexpected behavior that will likely cause a replica to crash. Everything else is classified as a warning or simple info, visible only in development builds. As a result, we’ve increased the signal-to-noise ratio of console logs, giving users and operators more confidence when reporting issues.

    • To make it explicit to operators why clients are being evicted from the cluster, we’ve clarified the messages and labels for error codes related to version mismatches between the client and replicas (both client_release_too_low and client_release_too_high).

    • Finally, we completely removed the inconvenient log messages in the client libraries. 🗑️

    • Users can now use the secondary indexes user_data_32, user_data_64, user_data_128, and code to filter account transfers and account balances! This minor API change aligns get_account_transfers and get_account_balances with the same filter capabilities already present on query_accounts and query_transfers, and unlocks new ways to query the database.

    Thanks to our friends from Super Payments for this suggestion! 😎

    • Idempotency is a computer science appropriation of a mathematical concept. Instead of delving into what it means in the context of distributed systems, let’s quote a definition that motivates why the application needs idempotency: 📖

    Retrying an operation that has side effects is usually not safe. […] But how do we decide if a retry operation is safe or not? The idempotency characteristic of the system answers this question. An operation isidempotent if it results in the same outcome, regardless of the number of invocations.”

    Software Mistakes and Tradeoffs (by Tomasz Lelek, Jonathan Skeet)

    TigerBeetle uses the id field as the idempotency key to give the application the guarantee that the operation will execute only once. However, an operation can execute and fail because of transient conditions exclusively caused by the database’s state at that moment. For example, when creating a transfer, it might fail at first due to exceeds_credits and then succeed if retried a few moments later when the account received enough credits to fulfill the operation. But wouldn’t this violate the guarantee that the outcome of an idempotent operation must remain the same regardless of the number of invocations? Yes, it would!

    To better understand why, let’s first explore why an application might want to retry an operation. When orchestrating distributed systems (e.g., the user interface, API layer, and TigerBeetle cluster), components might fail to communicate due to network issues or crashes. If the user interface or API layer fails to acknowledge TigerBeetle’s response, it must retry the idempotent operation and recover from either success (results ok or exists) or an appropriate error code. Things get complicated when components retry operations independently. For example, a transfer might fail for lack of funds, with an error being shown to the user in the user interface, while at the same time, an internal middleware might still have a retry of that same transfer in flight, which silently succeeds as part of the internal retry, leaving the outcome of this operation completely non-deterministic in the face of a transient failure!

    To address this issue, we formalized the concept of transient failure and introduced a new error code id_already_failed, which is returned when a transfer’s id is associated with a previous attempt that failed due to one of the transient error codes, providing strong idempotency guarantees, not only in the positive case but in the negative case too.

    • Note that the “retry” word here is employed in the sense of “ confirming the unknown outcome of a previous attempt ”. There are other situations when the application might use the same term with a different meaning:

    Resolving a transient issue: If the application intends to send the same transfer again after resolving the underlying transient issue, it cannot use the same idempotency id for the new attempt to succeed.

    Handling malformed requests: Unlike transient errors, malformed requests (e.g., zeroed or out-of-range values, misused flags or fields) will always be rejected regardless of how many times they are submitted. As long as the operation was never executed in the first place, the id can safely be reused once the application is fixed and a valid request is sent.

    Please refer to our documentation for the complete list of transient errors.

    • We landed many improvements to consensus availability and durability:

    • We’ve implemented interesting heuristics to allow the newly-elected primary replica to forfeit the view change if its latest checkpoint lags behind the cluster, allowing another more up-to-date replica to step up as the primary instead and maximize availability.

    • Also, a primary can now abdicate if it cannot process requests due to a broken clock! We also documented how TigerBeetle is able to provide more accurate timestamping of financial transactions and also detect when the external clock sync service fails. ⏱️

    • We redesigned the commit dispatch logic to handle asynchronous completions more naturally, converting the chain of callback completions into something that resembles linear control flow. Not only is it more readable, but we could also insert more assertion points to guarantee that the execution flow driven by the asynchronous callbacks is running as expected.

    • We improved cluster availability through another insight from our deterministic simulation tests! When recovering from crashes that happen while a replica is checkpointing, instead of restarting in “repairing” mode, we can ensure that the log includes the last operation from the checkpoint the replica started with, allowing it to just resume from where it left off before crashing.

    • New recipes on TigerBeetle’s docs:

    • We’ve updated the balance-conditional recipe for scenarios where the application needs to perform a transfer based on whether an account meets a minimal net balance threshold.

    • Additionally, we’ve introduced a new recipe for balance-invariant transfers, enforcing debits_must_not_exceed_credits or credits_must_not_exceed_debits on individual transfers instead of on the entire account (a special case of the balance-conditional recipe).

    Biodigital Jazz! was presented by Joran at Software You Can Love ‘24 in Milan.

    A behind the scenes look at the intersection of business and people (“bio”), engineering (“digital”) and art (“jazz”) at TigerBeetle. This is a new kind of talk, containing ideas we haven’t shared before, but which have influenced our engineering the most.

    Biodigital Jazz is the secret behind everything we do at TigerBeetle—and you can watch it here:

    Biodigital Jazz! - Joran Dirk Greef

    A new take on consensus through the lens of durability:

    “Availability is a function of durability, and consensus is this function.”

    This new talk, presented by Joran at SD’24, is the distillation of 3 years of our team’s learnings in implementing consensus for production.

    What we’ve come to find fascinating about consensus, is not so much consistency (which is table stakes), but rather durability, and the relationship of durability to availability. In other words, how you can take the physical and make it logical:

    • to exchange redundancy for availability,
    • where the more efficient the exchange,
    • the higher the availability.

    TigerBeetle’s VSR has really been the story of striving to preserve and maximize durability as much as possible, and so exchange logical redundancy for logical availability. (If you’re curious about how CAP and physical availability fit into all this, we hope you enjoy!)

    Joran’s SD’24 talk

    All SD’24 live premieres are now available! Whether you missed the in-person conference in NYC or whether you want to rewatch all the sessions, the entire SD24 lineup is ready for replay!

    In this episode of Software Engineering Daily, Building a Fast Financial Transactions Database, Joran joined Gregor Vand to share the story of the creation of TigerBeetle, starting with the name! :)

    Mark your calendars for P99 Conf, happening online from October 23-24.

    Matklad will look at TigerBeetle’s just-in-time compaction algorithm for LSM-tree storage:

    • using only static memory allocation,
    • with perfect pacing to solve write stalls for predictable P100s,
    • and guaranteeing deterministically identical data files across all replicas for faster recovery.

    Register free today to reserve your spot!

    The 2nd TigerBeetle hackathon takes place soon!

    Join the whole TigerBeetle team in person at the Interledger Hackathon, Saturday and Sunday, October 19-20, to see how you can use Interledger, the Open Payments API, and Web Monetization to power payments for the future. Register to take part.

    The official Interledger Summit will also kick off the following weekend on October 26 at the Century City Conference Venue in Cape Town, where Joran will be speaking on “The Next 30 Years of Transaction Processing”.

    Book your tickets.

    Tweet Tweet Tweet Tweet Tweet Tweet

    Till next time… a “tight beat and a rare groove”!

    The TigerBeetle Team

    Did you notice something new ‘atop TigerBeetle’s YouTube channel? :)

    RSS iconRSS
    An idling tiger beetle Speech bubble says hi