$$\newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\bra}[1]{\langle{#1}|} \newcommand{\braket}[2]{\langle{#1}|{#2}\rangle} \newcommand{\C}{\mathbb C} \newcommand{\F}{\mathbb F} \newcommand{\ket}[1]{|{#1}\rangle} \newcommand{\Odd}{\textrm{Odd}} \newcommand{\powerset}{\mathcal{P}} \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\symdiff}{\mathbin{\Delta}} \DeclareMathOperator*{\Symdiff}{\lower{0.3ex}{\Large\symdiff}} \newcommand{\where}{\ \middle|\ }$$

bitwise operators when you can’t bit (fronttick part 2)

25 Sep 2024

I like code that makes you go “wtf?”. Code that, no matter how many times you re-read it, just doesn’t make any sense. It’s not held together by duct tape, but by magic dust.

// Computes `a^b` for a, b any numbers 0, .., 7.
public static int Xor8(int a, int b) {
    int a4 = a % 4;
    int b4 = b % 4;
    int res = 177 % (a4 + b4 + 3);
    if (a4 == b4)
        res = 0;
    if (a4 < a ^ b4 < b)
        res += 4;
    return res;
}

This bit of code can be found in my c# to Minecraft compiler, Fronttick12. Last time I wrote about this compiler, I gave an overview about how to handle control flow in the… unique setting mcfunction brings. That post was pretty compiler-focused. This time, it won’t be as bad.

The problem

In mcfunction, your only operators are int += int, int -= int, int *= int, int /= int, and int %= int. That’s much less than .NET’s System.Int32 gives you! But what if, for some reason, you really wanted to do some bitwise math in Minecraft, not caring about performance? Well, you just have to do it yourself.

In this post, I’ll discuss how to implement int ^ int, int & int, and int | int. I’ll focus in particular on the XOR, as the AND goes very similar, and the OR can be derived from the AND with some NOTs.

As this is part of “reimplementing the System namespace”, we also need to be careful with what tools we allow. It’s really easy to create a circular dependency that recurses itself to death3. As such, the only things I allow are the following:

In particular, boolean arithmetic is implemented completely independently from ints, so we do not get a cyclic dependency if we use bool ^ bool in our implementation for int ^ int4.

A tiny diversion

Minecraft’s /= and %= are special. In ordinary programming languages, these round towards zero, and maintain the sign, respectively. However, ever since versions 18w31a and 18w32a, Minecraft deviates from this by internally using Java’s floorDiv and floorMod respectively, which result in rounding down, and giving a positive result5 respectively, instead.

This gives my Int32 class two versions of the two operators. One which conforms to c#’s expectations, and one that uses mcfunction’s directly. Naturally, the mcfunction version is more performant.

namespace System {
    public struct Int32 {

        ...

        public static int PositiveMod(int a, int b) {
            int res = a;
            Run($"scoreboard players operation {VarName(res)} _ %= {VarName(b)} _");
            return res;
        }

        public static int operator %(int a, int b) {
            int res = PositiveMod(a,b);
            // Flip to the other remainder if signs are wrong
            if (a < 0 & res != 0)
                res -= b;
            if (b < 0 & res != 0)
                res += b;
            return res;
        }

        ...
    }
}

Here, the Run() method is a compiler intrinsic that turns a string into a raw command, and the VarName() method is a compiler intrinsic that converts a c# variable to the name it has in mcfunction.

I must say, it was fun debugging these bitwise operators when I wasn’t aware this was an issue! If you’re ever going to make a compiler from c# to mcfunction, make sure to take this into account. Yes, this advice is very applicable to everyone’s daily lives.

The lazy solution

Now, let’s get to the meat of this post. First, note that instead of implementing XOR full 32-bit ints at a time, we can also just handle smaller blocks of bits. We can then loop over those smaller blocks until we have handled the full int.

Doing blocks of one bit at a time is the easiest, but also very costly.

public static int operator ^(int a, int b) {
    int res = 0;
    int power = 1;
    for (int i = 0; i < 31; i++) {
        // 1 bit xor at a time: just add the two.
        // If both are true, we should be 0 instead.
        int temp = int.PositiveMod(a, 2)
                 + int.PositiveMod(b, 2);
        if (temp == 2) temp = 0;
        // Put this into the result.
        res += temp * power;
        power *= 2;
        a /= 2;
        b /= 2;
    }
    // I'm disregarding the sign bit here, as that's just
    // a headache the same for all cases.
    return res;
}

Each iteration of the loop has overhead. While calculating the XOR of one bit takes only 4 commands in mcfunction, there’s also all the scaffolding around it. If we unroll the loop into goto form6, we see what truly takes up our time.

int i = 0;
loop_label:
    // <XOR calculation omitted>                    // 4 mc commands
    // Consolidate temp result into result so far
    res += temp * power;                            // 2 mc commands
    power *= 2;                                     // 1 mc command
    a /= 2;                                         // 1 mc command
    b /= 2;                                         // 1 mc command
    // Explicit for-loop handling          
    i++;                                            // 1 mc command
    int cond = i < 31;                              // 1 mc command
    if (cond) goto break_label;                     // 1 mc command
    else goto loop_label;                           // 1 mc command
break_label:

Each loop iteration adds nine commands of overhead, no matter what our XOR calculation looks like! If we can even just go from one bit at a time to two bits at a time, we would half this overhead. Let alone an even larger amount of bits.

Chinese remainders

Well, if one bit at a time doesn’t cut it, what about two? As any sane programmer would do, I looked around the internet whether someone else has had similarly insane problems. And in fact, yes! This stackoverflow answer had a very interesting, very under-elaborated answer for AND.

// Computes `a&b` for a, b any numbers 0, .., 3.
public static int And4(int a, int b) {
    return ((9 * a) % 16) % (b+1);
}

I was intrigued. With just four arithmetic ops, you’re able to calculate something that would otherwise require a look-up table with 16 entries. But secretly, this will also be just a look-up table! Let’s dissect what’s going on here.

It’s quite a common trick to store data in an int, and look values up with bitwise operators. If your problem is neat enough, you may even have the possibility to overlap these look-ups, so that for instance look-up #1 uses bits 1 through 4, look-up #2 uses bits 3 through 7, whatever. While they’re the easiest, bitwise operators aren’t the only ways you can do look-ups.

Enter the Chinese remainder theorem. It states that if we have a system of equations7

\[\begin{equation*} \begin{cases} x\ \%\ n_1 = a_1, \\\phantom{x\ \%\ n_2\ } \vdots \\x\ \%\ n_k = a_n, \end{cases} \end{equation*}\]

there’s always a solution if the $n_i$ don’t share any factors. This can be used as a look-up table! Your data is stored in $x$, and you do a bit of a whacky lookup with the keys $n_i$. These $n_i$ are a bit more restrictive than you’d expect though. Doing $n_1 = 1$, $n_2 = 2$, $n_3 = 3$ already runs into problems with $n_4 = 4$, as now $n_2$ and $n_4$ share a “two” factor. Your keys are also the upper bound for your value, which is also somewhat annoying.

But let’s try this out. Suppose we want to calculate 1^b, where b is either zero or one. It’s a bit of a lame example, but it will demonstrate the above. Our two output values are 1^0 = 1 and 1^1 = 0. This means we want some function $f(b)$ and a number $x$ so that

\[\begin{equation*} \begin{cases} x\ \%\ f(0) = 1, \\x\ \%\ f(1) = 0. \end{cases} \end{equation*}\]

If we define $f(b) = b + 2$, we have $n_1 = 2$ and $n_2 = 3$, which are coprime. This allows us to use the Chinese remainder theorem, and gives us an $x$ that solves the above. In this case, $x=3$ (or $9$, or $15$, or …) works. Thus, 3 % (b+2) implements 1^b in ordinary arithmetic.

But this is not a very large look-up table. Doing two arithmetic to get two values? That’s lame. Suppose we now want to calculate 2^b, where b may be any number between zero and three. That’s a more interesting example! Our four output values are 2^0 = 2, 2^1 = 3, 2^2 = 0, and 2^3 = 1. This means we want some function $f(b)$ and a number $x$ so that

\[\begin{equation*} \begin{cases} x\ \%\ f(0) = 2, \\x\ \%\ f(1) = 3, \\x\ \%\ f(2) = 0, \\x\ \%\ f(3) = 1. \end{cases} \end{equation*}\]

Now, creating a function $f(b)$ whose values are mutually coprime is a bit of a hassle. Especially if you want to do it in very few operations! The Chinese remainder theorem gives us a solution in case everything is coprime. But this doesn’t mean that all hope is lost if we allow some values to share factors, we just need to do some work ourselves.

Take for instance the following two systems.

\[\begin{equation*} \begin{cases} y\ \%\ 2 = 1, \\y\ \%\ 4 = 3; \end{cases} \qquad \begin{cases} z\ \%\ 2 = 1, \\z\ \%\ 4 = 2. \end{cases} \end{equation*}\]

Even if $2$ and $4$ share factors, the systems may sometimes still be solved. The “$y\ \%\ 2 = 1$” requirement scales up to “$y\ \%\ 4 = 1$ or $y\ \%\ 4 = 3$”. This is compatible with the second requirement, and the system can be simplified into just one equation “$y\ \%\ 4 = 3$”, which has solutions. However, the other system is not so lucky. The requirements “$z\ \%\ 4 = 1$ or $z\ \%\ 4 = 3$” and “$z\ \%\ 4 = 2$” are actually incompatible. There is no $z$ that satisfies the system.

Returning to our example of calculating 2^b, the $f(b)$ don’t need to be fully coprime if the values we assign are compatible enough, like the $y$ system. Some clever trial and error8 tells us $f(b) = b+3$ does the trick:

\[\begin{equation*} \begin{cases} 78\ \%\ 4 = 2, \\78\ \%\ 5 = 3, \\78\ \%\ 6 = 0, \\78\ \%\ 7 = 1. \end{cases} \end{equation*}\]

Here, $4$ and $6$ share a factor, but that doesn’t matter, because their outputs are “compatible enough”. In other words, we have now implemented 2^b as 78 % (b+3). We just got four values with only two operations! That’s pretty good.

But we’re not really interested in just calculating 2^b. No, what we really want is a^b. We can repeat the above process, and (miraculously!) we get 416 % (b+3) = 0^b, 345 % (b+3) = 1^b, and 7 % (b+3) = 3^b. In general, there is no reason for the same modulus (b+3) to work for all of them, but it did in this case.

If we can now create a simple function $g(a)$ that spits out $(416,345,78,7)$, we have implemented a^b as g(a) % (b+3). As we’re working with modular arithmetic, we can also add $420$ to all of these freely, as $420\ \%\ (b+3) = 0$ for all values of $b$.

I will choose values $(-4, 345, 78, 427)$. The differences between these entries are, in order, $349$, $-267$, and $349$. If we now choose to work modulo $616$, we have $349\ \%\ 616 = -267\ \%\ 616$, so that these are spaced apart by the “same amount”. Modulo $616$, our four values are just $-4 + 349\cdot k$, with $k$ going from $0$ to $3$.

With this approach, you must alternate adding and subtracting; doing $+a$ and $+b$ can’t really be justified as “actually the same modulus” the same way as doing $+a$ and $-b$. This is why my four values look like that. Additionally, your modulus needs to be large enough to actually store all the values. If you choose to do $(416, 345, 498, 427)$, you subtract by $71$ and add $153$, so that your modulus is only $224$. Your values won’t fit then.

But now, we can efficiently generate these four values by calculating (349*a % 616) - 4. This is our $g(a)$. Putting everything together, we obtain the following code.

// Computes `a^b` for a, b any numbers 0, .., 3.
public static int Xor4(int a, int b) {
    return (((349 * a) % 16) - 4) % (b+3);
}

With five arithmetic operators, we implemented what would be a look-up table with 16 entries. This XOR is unfortunately one more operator than the AND, which can’t really be avoided with this setup.

Exercise (Easy): I worked through the XOR operator above. Try to explain the constants in the AND operator implementation at the top of this section, which appear for similar reasons:

// Computes `a&b` for a, b any numbers 0, .., 3.
public static int And4(int a, int b) {
    return ((9 * a) % 16) % (b+1);
}

Why does the XOR implementation have an additional -4 that does not have an equivalent in the AND implementation?

Exercise (Hard): Implement the following OR operator with at most six arithmetic operators:

// Computes `a|b` for a, b any numbers 0, .., 3.
public static int Or4(int a, int b) {
    return ...;
}

Hint for programmers: The OR operator is pretty devious. You might want to write some code that automatically validates whether a given $f(b)$ is valid for all four systems at once, and then play around a little.

Hint for mathematicians: Directly find all $c$ such that $f(b) = b+c$ is valid for all four systems at once.

(Or just don’t do these exercises. I actually hope you won’t need this stuff! It’s a little too cursed to encounter in your daily life.)

Why not just use look-up tables?

So, for two bits, we now have the AND operator in four Minecraft commands, the XOR operator in five, and the OR operator in six. And honestly, they are pretty funky implementations. Why not just do the easy thing, and grab a look-up table?

The previous post already went into this, but if you don’t want to interface with the world or nbt, you don’t have $\mathcal O(1)$ look-up tables. You must use an if-else tree. If your look-up table has size $N$, this gives you $\mathcal O(\log N)$ performance. Not bad at all.

But now for a statistic you don’t even consider in any sane language. If your look-up table has size $N$, this gives you $\mathcal O(N)$ generated files. To drive the point home, let’s write out the look-up table for the 2-bit XOR.

# File datapack:xor
execute if score Var_a _ matches ..1 run function datapack:a01
    # File datapack:a01
    execute if score Var_a _ matches 0 run function datapack:a0
        # File datapack:a0
        execute if score Var_b _ matches ..1 run function datapack:a0b01
            # File datapack:a0b01
            execute if score Var_b _ matches 0 run scoreboard players set Var_a _ 0
            execute if score Var_b _ matches 1 run scoreboard players set Var_a _ 1
        execute if score Var_b _ matches 2.. run function datapack:a0b23
            # File datapack:a0b23
            execute if score Var_b _ matches 2 run scoreboard players set Var_a _ 2
            execute if score Var_b _ matches 3 run scoreboard players set Var_a _ 3
    execute if score Var_a _ matches 1 run function datapack:a1
        # File datapack:a1
        execute if score Var_b _ matches ..1 run function datapack:a1b01
            # File datapack:a1b01
            execute if score Var_b _ matches 0 run scoreboard players set Var_a _ 1
            execute if score Var_b _ matches 1 run scoreboard players set Var_a _ 0
        execute if score Var_b _ matches 2.. run function datapack:a1b23
            # File datapack:a1b23
            execute if score Var_b _ matches 2 run scoreboard players set Var_a _ 3
            execute if score Var_b _ matches 3 run scoreboard players set Var_a _ 2
execute if score Var_a _ matches 2.. run function datapack:a23
    # File datapack:a23
    execute if score Var_a _ matches 2 run function datapack:a2
        # File datapack:a2
        execute if score Var_b _ matches ..1 run function datapack:a2b01
            # File datapack:a2b01
            execute if score Var_b _ matches 0 run scoreboard players set Var_a _ 2
            execute if score Var_b _ matches 1 run scoreboard players set Var_a _ 3
        execute if score Var_b _ matches 2.. run function datapack:a2b23
            # File datapack:a2b23
            execute if score Var_b _ matches 2 run scoreboard players set Var_a _ 0
            execute if score Var_b _ matches 3 run scoreboard players set Var_a _ 1
    execute if score Var_a _ matches 3 run function datapack:a3
        # File datapack:a3
        execute if score Var_b _ matches ..1 run function datapack:a3b01
            # File datapack:a3b01
            execute if score Var_b _ matches 0 run scoreboard players set Var_a _ 3
            execute if score Var_b _ matches 1 run scoreboard players set Var_a _ 2
        execute if score Var_b _ matches 2.. run function datapack:a3b23
            # File datapack:a3b23
            execute if score Var_b _ matches 2 run scoreboard players set Var_a _ 1
            execute if score Var_b _ matches 3 run scoreboard players set Var_a _ 0

Welcome back, that scrolling must’ve taken a while.

Using a look-up table like this, we get 7 Minecraft commands instead of the four/five/six we get when doing arithmetic. (Or 4 commands if you don’t count failed execute commands.) This is not only not worth it, but the amount of files generated hurts. I simply prefer the elegance of plain arithmetic:

# File datapack:xor
scoreboard players operation Var_a _ *= 349 _
scoreboard players operation Var_a _ %= 16 _
scoreboard players operation Var_a _ -= 4 _
scoreboard players operation Var_b _ += 3 _
scoreboard players operation Var_a _ %= Var_b _

Allowing for control flow

Except for the OR operator, all $f(b)$ so far were of the form $f(b) = b + c$ for certain $c$. But we have another variable at our disposal. What if we did something like $f(a,b) = a + b + c$? The XOR operator is very symmetric, so having the two variables take different roles feels… kinda off.

\[\begin{equation*} \begin{array}{r|llll} \text{XOR} & 0 & 1 & 2 & 3 \\ \hline 0 & 0 & 1 & 2 & 3 \\ 1 & 1 & 0 & 3 & 2 \\ 2 & 2 & 3 & 0 & 1 \\ 3 & 3 & 2 & 1 & 0 \end{array} \end{equation*}\]

At this point, the notes I’m copying over into this post are getting a bit thin, and it’s been a while – I don’t know how exactly I got to the next step. Probably just trial and error? I had an entire framework set up for testing various operators and $f(a,b)$ at this point, so it couldn’t have taken long.

If we ignore the diagonal, you can implement the XOR with just three operators.

\[\begin{equation*} \begin{array}{r|llll} 177\ \%\ (a + b + 3) & 0 & 1 & 2 & 3 \\ \hline 0 & 0 & 1 & 2 & 3 \\ 1 & 1 & {\color{red}2} & 3 & 2 \\ 2 & 2 & 3 & {\color{red}2} & 1 \\ 3 & 3 & 2 & 1 & {\color{red}6} \end{array} \phantom{123\ \%\ (x + y)} % For centering the table \end{equation*}\]

By looking at $a+b+c$, large parts of the table become just one equation in our system. For instance, all threes in the table have their remainder taken by the same $a+b+3 = 6$. Only the diagonal doesn’t play nice with this, as this forces $(1,1)$ and $(2,0)$ to have the same results, while these have different XOR values.

But this can just be solved with a simple check.

// Computes `a^b` for a, b any numbers 0, .., 3.
public static int Xor4(int a, int b) {
    if (a == b)
        return 0;
    return 177 % (a + b + 3);
}

This translates to just four commands of arithmetic and control flow in mcfunction! Pretty nice, if I do say so myself. I don’t think you can do better than this, as it’s pretty hard to exploit the symmetry of XOR even further. I would gladly be proven wrong, though.

Beyond two bits at a time

Remember, the whole reason we wanted to do more bits at a time, is because every block of bits has quite some overhead at 9 commands. If we handle one bit at a time, each two bits contribute 22 commands to the overall runtime. If we handle two bits as a single block with Xor4 as above, each two bits contribute only 14 commands to the overall runtime. These are significant savings! Can we take this further?

The basic idea of “try to create a look-up table of the form $g(a,b)\ \%\ f(a,b)$” quickly runs into a bit of a practical dead end. The $f(a,b)$ we’re using should be coprime (enough), and easy to calculate. With around four inputs, this is still manageable. But add enough bits, require eight or even sixteen different numbers, and you either lose coprimeness or ease of calculation9.

And that’s not the only problem with $f(a,b)$. We work modulo the least common multiple of the different $f(a,b)$. If we assume coprimeness for a bit, that means the product of the $f(a,b)$ grows quickly for every bit added. And we have a ceiling. Integers cannot be larger than $2^{31}$, so if you have eight different $f(a,b)$, they’re soft-capped at around $2^4$, each.

But put in enough effort, and the methods listed above do generalize properly, once. The two-bit approach without control flow can be extended to three bits with only a minor headache10.

Instead of generalising, we can add some sauce on top. Let’s take a look at the larger XOR table.

\[\begin{equation*} \begin{array}{r|llll} \text{XOR} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \hline 0 & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 1 & 1 & 0 & 3 & 2 & 5 & 4 & 7 & 6 \\ 2 & 2 & 3 & 0 & 1 & 6 & 7 & 4 & 5 \\ 3 & 3 & 2 & 1 & 0 & 7 & 6 & 5 & 4 \\ 4 & 4 & 5 & 6 & 7 & 0 & 1 & 2 & 3 \\ 5 & 5 & 4 & 7 & 6 & 1 & 0 & 3 & 2 \\ 6 & 6 & 7 & 4 & 5 & 2 & 3 & 0 & 1 \\ 7 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \end{array} \end{equation*}\]

Each quadrant is just the two-bit XOR table, but the off-diagonal quadrants have four added to them. We can double(!) the function size to take this into account, which still outperforms the for-loop overhead, slightly.

// Computes `a^b` for a, b any numbers 0, .., 7.
public static int Xor8(int a, int b) {
    int a4 = a % 4;
    int b4 = b % 4;
    int res = 177 % (a4 + b4 + 3);
    if (a4 == b4)
        res = 0;
    if (a4 < a ^ b4 < b)
        res += 4;
    return res;
}

We still have the exact same Xor4 core in here. But around this, we also take into account inputs $4$ through $7$ now.

Now to repeat the hand-wavy analysis from before: consider the XOR of 6 bits – this is six iterations of working bit-by-bit, three iterations of Xor4, and two iterations of Xor8. Including loop overhead, these take respectively 66, 42, and 38 commands. The third bit gave us a very minor gains, but it’s a win nonetheless. Doing all 32 bits then takes ~200 commands, and not-a-ton-of-files.

Note that at 4 bits11, look-up tables outperform this approach. At 8 bits per iteration, you are significantly below the 100 command-barrier, but your amount of files blows up hard, at half a kilofile. I don’t like that number, at all, and having halved performance in exchange for “not exorbitantly many files” is worth it in my book.

Oh, I mentioned it in the introduction, but… just leave it at implementing AND and XOR like this. As OR is so annoying, just use a|b = ~(~a & ~b). Taking the complement is just one subtraction, so that’s negligible overhead.

Conclusion

In this post, I discussed some obscure tricks I hope no-one will ever need in their lives to do something no sane person should ever have to do in their lives. And all that for some operations no one is ever going to use.

Honestly, I just think it’s neat, using numbers to store data and the remainder operator to access it. It’s very limited in use, and very quickly outperformed by literally anything else, but Minecraft commands are just obnoxious enough in the sweet spot and make this the way to go.

If you actually landed on this post because you have this exact same problem, godspeed. I hope whatever environment enforces these constraints is nicer to you than what I had to deal with. And I also hope that you discover a better way of doing this, even if it just scrapes off one operation.

footnotes and references

  1. I still don’t know whether I write it as “fronttick”, “Fronttick”, or “FrontTick”. I guess the name will never be consistent. 

  2. Two-thirds of the linked code is explanation already… 

  3. Not that I allow non-tail recursion, but that’s besides the point. 

  4. For reference, bool ^ bool is implemented with this lovely mixture of c# and mcfunction:

    // Note: in mcfunction, we interpret 0 = false, 1 = true, and other values = invalid.
    public static bool operator ^(bool a, bool b) {
        // res = a + b
        // if (res == 2) res = 0
        bool res;
        res = a;
        Run($"scoreboard players operation {VarName(res)} _ += {VarName(b)} _");
        Run($"execute if score {VarName(res)} _ matches 2 run scoreboard players set {VarName(res)} _ 0");
        return res;
    }
    

    The basic integer operations also come down to code that looks like this – I have to start somewhere. This also has the funny side-effect that bool and int don’t hold any data as far as c# is concerned, even though their functionalities are completely defined in c#. This is completely opposite to the CLR implementation where ints do have an internal value, but no operators are implemented. (There’s not even any extern operators! Cheaters.) 

  5. It actually maintains the sign of the second argument, but what mad-lad would take the remainder with respect to a negative value? For all intents and purposes, the result will be positive. 

  6. See the previous post for more details than you could ever ask for. 

  7. I’m writing the remainder operator instead of working with modulo classes so that programmers stumbling upon this don’t get completely blindsided. 

  8. Did I say clever trial and error? I’m just brute-forcing it. Programmatically. 

  9. Theoretically, this is of no problem. There’s arbitrarily long sequences $a + b\cdot k$ with $k$ from $1$ up to whatever, where every element is prime. These are uhh… generally large. Their products (our moduli) easily exceed the 32-bit limit.

    (Also, the theorem’s non-constructive. For five bits, you would need a progression 32 long, and if Wikipedia’s not listing one…) 

  10. We still alternate $+a$ and $-b$, but halfway through, the tables for all three operators shift significantly. This forces us to add an extra if (a >= 4) +z statement. You can see this happen in an older version of the code. 

  11. Yes, I did in fact implement a 4-bit version combining all the pains of footnote #10 and all the pains of the extra sauce on top. It gained me a whole 5% speed-up. I didn’t bother after that.