Index, Count, Offset, Size

Wherein we make progress towards solving one of the most vexing problems of Computer Science — naming things.

I am at a point in my career where the bulk of my bugs are stupid — I simply fail to type in the code I have in my mind correctly. In languages with shadowing (like Rust), I will fail to use a shadowed variable from the outer scope. In languages without shadowing (like Zig), I will use the wrong version of a variable.

Pests like these are annoying, so I am always on the lookout for tricks to minimize the probability of bugs. One of the best possible tricks is of course strong static typing. Types are good at preventing me from doing stupid things by accident. But types have limitations. The text of a well-typed program is a two-in-one artifact — a specification of behavior of a machine (the algorithm), and a proof that the behavior is not unacceptable. Zero cost abstractions are code without behavior, just proofs!

The art of skillful typing lies in minimizing verbosity of the proof, while maximizing the amount of unwanted behaviors ruled out, weighted by the probability and the cost of misbehavior. But this ratio is not always favorable — the code can be so proof-heavy that it becomes impossible to understand what it actually does!

There’s one particular cranny where types don’t seem to usefully penetrate yet: indexing and associated off-by-one errors.

If you don’t need indexing arithmetic, you can use newtype pattern to prevent accessing oranges with apple-derived indexes. You can even go further and bind indexes to specific arrays, using, e.g., Gankra trick, but I haven’t seen that to be useful in practice.

If, however, you compute indexes, you need to be extra careful to stay in bounds of an array, and need to be mindful that the maximum valid index is one less than the length of the array. While we don’t solve this problem perfectly at TigerBeetle, I think we have a naming convention that helps:

Thanks @marler8997 for the illustration idea!

We consistently use count whenever we talk about the number of items, and index to point to a particular item. The positive invariant is index < count. Consistency is the trick — there are certain valid and invalid ways to combine indexes and counts in an expression, and, if there’s always an _index or a _count suffix in the name, wrong combinations immediately jump out at you, dear reader, even if you don’t understand the specifics of the code.

In low-level code you often need to switch between a well-typed representation []Tand raw bytes []u8. To not confuse the two index spaces, the “count of bytes” is always called a size. By definition,

size = @sizeOf(T) * count;

And offset is the bytewise counterpart of index.

We don’t use length in our code, as its meaning is ambiguous. Rust str::len is the byte-size of the string, but Python’s len(str) is the count of Unicode code-points!

Here’s an example of the naming convention in action from NodePool:

pub fn release(pool: *NodePool, node: Node) void {
    comptime assert(meta.Elem(Node) == u8);
    comptime assert(meta.Elem(@TypeOf(pool.buffer)) == u8);

    assert(@intFromPtr(node) >= @intFromPtr(pool.buffer.ptr));
    assert(
        @intFromPtr(node) + node_size <=
            @intFromPtr(pool.buffer.ptr) + pool.buffer.len
    );

    const node_offset =
        @intFromPtr(node) - @intFromPtr(pool.buffer.ptr);

    const node_index =
        @divExact(node_offset, node_size);

    assert(!pool.free.isSet(node_index));
    pool.free.set(node_index);
}

You can see that the node_index calculation is correct mechanically, just from the names of the variables.

And here’s an index/count example from our ewah implementation:

pub fn decode(
    source: []align(@alignOf(Word)) const u8,
    target_words: []Word,
) usize {
    assert(source.len % @sizeOf(Word) == 0);
    assert(disjoint_slices(u8, Word, source, target_words));

    const source_words = mem.bytesAsSlice(Word, source);

    var source_index: usize = 0;
    var target_index: usize = 0;
    while (source_index < source_words.len) {
        const marker: *const Marker =
            @ptrCast(&source_words[source_index]);
        source_index += 1;

        @memset(
            target_words[target_index..][0..marker.uniform_word_count],
            if (marker.uniform_bit == 1) ~@as(Word, 0) else 0,
        );
        target_index += marker.uniform_word_count;

        stdx.copy_disjoint(
            .exact,
            Word,
            target_words[target_index..][0..marker.literal_word_count],
            source_words[source_index..][0..marker.literal_word_count],
        );
        source_index += marker.literal_word_count;
        target_index += marker.literal_word_count;
    }
    assert(source_index == source_words.len);
    assert(target_index <= target_words.len);

    return target_index;
}

Note well that the index/count convention synergizes with two other TigerStyle shticks. We use “big endian naming”, where qualifiers are appended as suffixes:

source
source_words
source_index

And we try to make sure that dual names have the same length:

source
target

The code aligns itself, and makes the bugs pop out:

source_index += marker.literal_word_count;
target_index += marker.literal_word_count;

Of course, a simple naming convention by itself won’t make software significantly better. But grains of sand add up to Dune: there’s no one trick to get rid of the bugs, but you can layer your defenses to exponentially decrease the probability of a failure.

Enjoyed this post? Add our RSS feed.

An idling tiger beetle Speech bubble says hi