Random Fuzzy Thoughts

Mar 28, 2023

Tigerbeetle riding on bunny to zapping bugs with lasers
Tigerbeetle riding on bunny to zapping bugs with lasers

I have been doing a lot of randomized testing and fuzzing lately, and I want to document some lessons learned. This isn’t going to be a grand unified theory of fuzzing, but rather a collection of observations useful for people who already do randomized testing. This post will explore three related questions:

  • How to make generative fuzzing failures robustly reproducible?

  • How to use existing randomized testing infrastructure for specifying the tests manually?

  • How to check that generative fuzzing indeed covers interesting scenarios?

Setting the stage

The basic idea behind fuzzing is that a fuzzer can generate a slice of random bytes, a []u8 which then serves as an input to a system under test. As a bonus point, the fuzzer can also observe the coverage of the program as it processes the input, and use that to smartly devise potentially interesting variations for input data. This technique is particularly appropriate for parsers, which work on []u8 directly.

For example, if the fuzzer provides the generated input via stdin, the test could look like this:



As we don’t know what the input data will be, we can’t assert anything specific about the result. We can assert general properties, however, like “it does not crash”. And, depending on the assertion density and depth of the code under test, “it does not crash” can cover quite a bit of ground!

A related idea is randomized, property based testing. Often, the input to the program is not just a pile of random bytes, but rather something with more structure to it (trees, lists, sequences of events, etc). For these cases, it’s often convenient to generate the entire structure randomly:



A third variation which comes up in distributed systems is what I call randomized interaction. Here, rather than generating a test input upfront and then running the system under test, the system and random generator are run concurrently, responding to each other’s actions. This often comes up in distributed systems: a randomized test models a system with several nodes connected by an unreliable network. Randomization determines which messages are delivered and which are dropped. A simple implementation would model the network as a set of in-flight messages, such that a random message is delivered (or dropped) on every step:

const Message = struct {
    src: u8,
    dst: u8,
    payload: []u8,
};

const Network = std.ArrayList(Message);

test "distributed system" {
    const seed = 92;
    const drop_message_probability = 0.1;

    var random = std.rand.DefaultPrng.init(seed);
    const rng = random.random();

    var network = Network.init(testing.allocator);
    defer network.deinit();

    while (true) {
        const message_index = rng.uintLessThan(
            usize,
            network.items.len,
        );
        const message = network.swapRemove(message_index);

        if (rng.float(f32) < drop_message_probability) {

Finite PRNG

One very fruitful idea is that the distance between “let’s ask an advanced coverage-guided fuzzer to generate a slice of bytes” and “let’s use PRNG to generate a tree-shaped input” is actually quite small. It’s possible to join the two approaches together. Input from the fuzzer can be treated as a PRNG which generates a finite number of random bytes.



I learned this idea from the most excellent arbitrary crate. There are two benefits here:

First, we can use something like AFL to generate entropy for our FiniteRng. That way, we teach a coverage guided fuzzer to generate structured inputs. By flipping bytes in the random unstructured array, the fuzzer would effectively apply local transformations to the structure we generate from the input. In particular, if the fuzzer can minimize the input array, it now can minimize the structured input as well! That’s not necessarily the most effective way to get minimization for free (I think Python’s Hypothesis is doing something smarter), but it definitely punches above its weight!

Second, the minimization idea works even without a smart fuzzer. That is because the length of the entropy slice passed to FiniteRng measures the complexity of the structured result. This works especially well for interactive tests. For example, when simulating a distributed system, you can use entropy to simulate random network faults. After you run out of entropy, the network becomes reliable and the state of the system converges in a finite number of steps (or there’s a liveness bug which you can flag). Binary searching the smallest amount of entropy which still triggers the bug yields a minimal test case.

I codified my favorite simplistic setup for these style of tests into the arbtest library. Each test is fully defined by a seed — a u64. This u64 is actually two u32 joined together — one component defines the length of a slice of random bytes, the other is the seed for a PRNG to fill the bytes. As the seed is just a number, it’s easy to share it with colleagues in a chat message. And as the seed encodes the length, it’s possible to employ a naive, but practically effective minimization strategy of just trying random seeds with smaller lengths.

And this is a nice segue into the next idea:

Serializing an instance of a fuzz test

We’ve already seen two ways to do this. If the raw input to the test is a slice of bytes generated by the fuzzer, you can save that directly. More conveniently, the sequence of bytes can be generated from a small seed for the PRNG, in which case the seed is all you need to reproduce the test.

Both approaches are problematic though — if you use raw random bytes to generate structured data, then small tweaks to the generation procedure can drastically change the meaning of a seed. Really, you need two numbers for reproducibility — the seed, and the commit hash. Trying the same seed on a different branch would probably trigger a semantically different test.

To stay concrete, let’s say we are generating random WASM programs to fuzz a WASM runtime. We start with raw bytes, feed them through wasm-smith to get a syntactically valid WASM module, and feed that into our compiler. If we found a bug with a particular seed, how do we add it to our test corpus? Again, just adding a seed won’t be robust enough — cargo update -p wasm-smith would probably silently break all our tests. The answer here I think is serializing not the unstructured seed, but the actual generated structured data (the .wasm module in this case; see also camshaft/bolero#49). If the test infra is formulated as three nicely separated phases:



Then you can re-use the same driver code for:

  • fuzzing with random data

  • replaying serialized fuzz failures

  • fully manual tests with hand-crafted inputs

Nifty!

However, this strategy breaks when we go from generating a structured input to an interactive test. Again, consider a test for a distributed system, where a PRNG is used to decide which messages in the network should be dropped. We can’t just serialize the sequence of decisions of which packets to deliver — small updates to the system might lead to slightly different sequences of messages emitted. For example, changing the heartbeat timeout changes the sequence of messages every node sends, but, intuitively, this shouldn’t meaningfully affect the set of interesting interleavings. I don’t know a good way to handle this.

Predicate coverage

Or maybe I do — last week I got another interesting idea in this space. I was fighting (or rather just chasing) a particularly nasty bug in TigerBeetle. It was nasty, because it was a liveness bug, rather than a correctness bug — nothing was wrong in particular, but nothing good was happening either. I didn’t have immediate success with smoking the bug out using our randomized testing infrastructure. I even began to doubt whether the bug existed at all.

So, I decided to manually reproduce the issue. The problem is that manually working as a nefarious mailman for a bunch of rather chatty nodes is not fun. There was a particular sequence of interesting messages I wanted to try, but to do that, I needed to handle a whole bunch of auxiliary normal messages as well. And then I realized that our randomized testing infrastructure, our simulator, our beloved VOPR can handle the boring parts for me. In particular, I can start the simulator with all failure probabilities set to zero, such that it simulates a perfect network. This network then handles the routine of relaying all the auxiliary messages such as heartbeats or confirmations. On top of this normally ticking network, I can then write my nefarious test in a somewhat prologish way. Rather than dropping a particular message or otherwise executing a particular series of steps, the test operates with predicates. It drops all messages of a particular kind, waits until the nodes are in a particular state (this again is a predicate), then changes the condition to drop the messages. The end result turned out to be quite concise — I was able to state, more or less directly, the scenario which would lead to a livelock, and the simulator filled in all of the blanks!

I think I’ve seen something similar before, in IVy or mypyvy. Those tools can prove correctness of various distributed systems. You specify the system as a logical formula, TLA+-style, then you spell-out the inductive invariant, and the system tries to automatically derive the invariant (or find the case where it doesn’t hold). An interesting feature there is that you can help the prover along by providing it with intermediate results. You dot the outline, and the tool connects the dots.

I think this gives us a way to robustly “serialize” interesting execution traces for interactive randomized tests. A serialization would be a series of predicates over the state of the system. This serialization can’t be replayed directly, but we can just pick a random trace and see if it hits the series of predicates in order. Moreover, we can construct the trace piece-wise — we don’t have to find a single seed which goes through all of the transitions, we can use separate seeds for each transition, and discover them incrementally (this requires the predicates to be rather tight, such that they guarantee the existence of piece-wise transitions).

The regression test suite can be a collection of such trace-predicates. Curiously, this test suite will also give an answer to an important question: “how long do I fuzz each PR?”. You fuzz just enough to hit all of the recorded traces.

Conclusions

That’s all I have for today! No grand theorems, but let’s recap some of the ideas

  • Randomized and fuzz tests are great as ever!

  • Finite PRNG is a useful concept to think about:

    • It glues unstructured, coverage guided fuzzed to structured property based testing.

    • It provides a natural measure of complexity for structured data. The “length” of a graph is the size of the random seed needed to produce it.

    • It provides shrinking (test case minimization) for free.

  • For interactive simulations, FPRNG gives a natural answer for when to wind the simulation down.

  • A PRNG seed is enough to reproduce a random test at a specific commit. For robust reproduction across different commits, you need to persist the derived structured data, and not just the seed.

  • For interactive random tests, it’s useful to think about key frames — predicates over the state of the system, which sufficiently narrow down what the test is about, without specifying all the events in exhaustive detail.

Copyright © 2024 TigerBeetle, Inc.
All Rights Reserved.

newsletter

Get monthly highlights and carefully crafted updates.

newsletter

Get monthly highlights and carefully crafted updates.

Copyright © 2024 TigerBeetle, Inc.
All Rights Reserved.