CatCompiler.Dev

I made a silly maths problem 1000x* faster

So yeah, I did a thing yay me. A few weeks ago (December 2024) I participated in the international nerd competition “Advent of Code”. If you’re the kind of person who reads my blogs, you have certainly heard of it, if for some reason you haven’t you must be lost here, click here to leave this hell. Today I will be documenting my journey completing day 11 of advent of code 2024. My steps for coming to a valid solution will be described out of order to help the flow of this article.

TL;DR

Put simply, being familiar with your dataset and choosing a datastructure is everything. Some friends we will meet along the way are:

The choice of programming language doesn’t always make your code faster or slower
Do not take a problem description at face value, it may often make misleading or false statements
The patterns in the data should impact your choice of datastructure and how you operate on it
Using more threads cannot fix a bad approach
Brute forcing problems can help you solve small datasets but large ones may require completely reworking your logic.

Advent of Code

Each year in December leading up to Christmas, computer science enthusiasts with too much time sit at their computers solving obscure programming problems. While this is normally not special, at this time of year it’s called Advent of Code (AOC). On each of the “days of Christmas,” there is a computer science problem given, embedded in a whimsical Christmas story. There are also some niceties like competitive leaderboards and international communities that sprout up in this atmosphere. Each day, a new problem is unlocked with 2 parts, the 2nd being an extension of the first. Each problem has 2 sets of inputs, one as a worked example and the other is the input you need to compute a result from. The real thrill of this event is submitting your solution, if you’re lucky you get marked correct and get a gold star! The more stars you get, the more sparkly the website looks!

This year I completed many of the problems in the Go programming language, as I spoke about in my last post. However, after the event was completed and a slow January rolled around I decided to revisit a few that piqued my interest. Among these was day 11, a wildly contrived problem about stones that duplicate under oddly specific circumstances. I found this problem particularly interesting for the difficulty I had with it combined with the conceptual simplicity. Today I would like to document the adventure I had with such a problem.

Day 11

You can read the AOC Day 11 problem on Advent of code. Go do it now, I’ll wait. the description I’m going to provide will be gutted of any charm and you will miss out on some details.

Briefly, you are given a few numbers which the problem calls stones. You must transform them according to some rules and then sum the count of numbers you end up with. The transformations are repeated a number of times. The rules for each transformation are:

0 is replaced with 1
even numbers are split into a left and right half
else, the number is multiplied by 2024
This algorithm is repeated a few times and the final result is calculated by summing the count of stones.
Furthermore, the problem states some stipulations given by the problem:

They must stay in a straight line. Honestly, I’m not exactly sure what this one means.
Order is preserved, this means that you must store the stones in the order that they’re created
For part one, this must be done 25 times and for part two, it must be done 75 times. Easy right?

Part One Was a Walk in The Cake

(Do you get it? It’s a combination of a walk in the park and a piece of cake. It’s world-class comedy)
Now I finally get to discuss the fun bit! After deciphering the problem, I wrote a prototype in GO. Some bug squashing ensued, but after not too long it worked out the example. Soon, I got marked correct which was a real thrill! That’s it, this was the entirety of the depth to part one, it took some time to get all the details correct, but in the end, it was easily solvable with brute force on contemporary hardware.
Part one was so easy! Surely part 2 can’t be that bad I thought… How bad can it be?

Part two is a nightmare

It really was. My naive solution for part one did not get anywhere near close after several hours. My first solution was rather simple, throw more threads at the problem! It was a perfect opportunity to delve deeper into the Go programming language. Go is famous for “goroutines”, a high-level concurrency framework. Well, let’s just say it didn’t help. The crux of the problem isn’t just that it’s slow, as I would soon find out.

Part two appears to be only 66% bigger. It’s only 40 more than the 25 we always solved. However, this problem does not scale linearly.
We can estimate the actual change in scale. (Here I will round all numbers down to the nearest order of magnitude for convenience, these are only ballpark figures). Let’s say each iteration multiplies the count of stones by an average of 1.5 times, so the actual stone counts go from 1×1.5^25 to 1×1.5^75. This works out as 25,000 to 16,000,000,000,000. This is about 600 million times larger! So, while the increase in scale looks demure, it’s actually 8 orders of magnitude more! To nail this point home, let’s estimate the RAM usage to store all of these stones. If an integer takes up 8 bytes, that’s 128,000,000,000,000 bytes, or 128 Terabytes! Do you know anyone with enough RAM or even hard-drive space to store that many values?! Part 2 takes a simple idea that’s at a small scale to its logical extreme such that it can no longer be solved through trivial means, this is why part 2 is a nightmare!

Throwing More Threads at The Problem

One of GO’s selling points is its concurrency model. It can transparently manage multiple asynchronous threads for you, all you have to do is tell it to go on a function and the runtime will give it time on other threads. The benefits are obvious, it’s one of the easier asynchronous and multi-threaded runtimes to use and its use of green threads leaves room for transparent performance optimizations by the runtime, with garbage collection and all defined behavior, the complexity is low and the debugging experience is very forgiving. What piqued my interest at first was that I’ve done very little multi-threaded code and it’s an area I’d like to explore more. This problem seemed like a golden opportunity to try “goroutines” out!

With plenty of trial and error, I hacked something together. To my great disappointment, the performance increases were minimal and they would amount to only overhead after I would make further optimizations. This was greatly disappointing and I would proceed to write a blog post that spoke unkindly of go-routines. Now with clearer hindsight, I can make some educated guesses as to why it didn’t work. The issues come down to:

This was a memory-bound problem, which means the memory latency would always be the bottleneck.
The overhead of multi-threading is greater than the overhead of arithmetic on a single core. Every calculation requires packaging and transferring the results across thread boundaries which is not free
This is the kind of problem that is not partitionable into large self-contained parts.

The First Rewrite

I decided, hey this is slow let’s rewrite it in C to be faster. Well, the C rewrite was certainly faster, but it seemed the speed issue was just a canary in the coal mine. The C rewrite was quite fast, yes, but it was also crashing quickly and with absolute determination. Why was it crashing?

Too much RAM! My first 2 solutions assumed that I could fit the stones in a normal array, and indeed this held for 25 transformations, but the dataset got too large at 75. All though I did not know this at the time, at 75 transformations the count of items would turn out to be over 100 trillion stones, this is too many!
I did in fact make some important optimizations in the first C version. Firstly, this time the numbers were stored as integers. While It is more complicated than working with strings in GO, strings are much slower. For the C version, I took it upon myself to perform the necessary steps to convert to using plain integers. This introduces a non-trivial amount of complexity since integers are not stored in base ten, any base-10 operations must be done mathematically This was my first big optimization step. It was subtle but important progress. Later I would go on to rewrite the C version again and iterate on it until I would finally pass the problem!

Another rewrite

My first C version was absolutely rubbish, so I decided to rewrite it again. This time, it was important for me to create a definitive abstraction layer between the data store and the ‘business logic’ of the application. This was not a glamorous step but it was an important one that allowed me to experiment more freely with the data storage without having to concern myself as much with how it is accessed. The initial implementation details are that all the stones are stored in order in an array.

The red herring

Making my code reasonably fast required lots of changes. Earlier I mentioned the stipulation “Order is preserved”, well in no uncertain terms this was a lie! It was a red herring and facing this fact formed the foundation of my solution. The most important optimizations were made as a direct result of storing the stones out of order.

The first implication of this is that you don’t have to insert new items into the middle of an array. Up until this point, all the stones were being stored in order of creation in an array of integers, the biggest performance implication of this is that the majority of the program’s time was being spent moving items in the array. This was very slow and completely unnecessary. While array manipulation is reasonably fast on contemporary computers, it’s not free and for large arrays that are too big to fit in the cache, it’s slower than you may expect.

Next, this implies that duplicates do not need to be stored. Since integers of the same value hold the same information, and no information about order needs to be retained, it’s semantically equivalent to storing them as a count of each unique type. You may know where this is going, but if not let me give you an example. Consider the stones 1, 2,2, and 3. The information we need can be stored as 1:1, 2: 2, and 3:1. The benefit of this may not be entirely obvious at first sight but it is vital to solving this problem. A particular quirk of this problem is that almost all stones end up being duplicates, thus instead of using terabytes to store all the stones, we can store them in a few kilobytes. This really is the crux of my solution, if you would like to solve the problem yourself, now would be the time to do so because the rest of this article quite plainly spoon-feeds you the answers.

The final big impact of this is important. Because duplicates can be stored together, we can process them in batches and since the dataset is full of so many duplicates the difference is huge. This one is the last hurdle to making the solution stupid fast.

All of these points are equally important for this problem, the first means you will not run out of memory, the second means you aren’t wasting most of the time on shuffling memory and the last means you can omit 99% of the reads, writes and number crunching.

The saucy implementation details

You should go and read the code, I think it’s turned out quite elegant.
As you may have guessed, the best data-structure to store this in is a dictionary/hashmap.
For my particular implementations, I decided to use a hashbag which is a small wrapper type over a hashmap, it’s slightly more convenient to work with because it provides helper functions relevant to our usage. With it, I can insert, remove, or modify a huge count of stones at a time.

How Many Damn Rewrites Does a Person Need?

I ended up rewriting my solution in several languages for fun. In the end, I made 5 complete solutions. There were 2 in C, one in C++, one in Rust, and one in GO.

The C ones use 2 synchronized arrays for the dictionary keys and values, one particular take even uses a unique hashing function that turned out to be disappointingly slow.
The C++ one uses a hashmap
The rust one uses a crate with a prebuilt hashbag
The go one has my own data-structure resembling a hashbag built on top of a builtin hashmap.

All the solutions ended up quite fast in the end no matter which language I picked, though a special mention goes to C++ and Rust for being slightly faster than the rest. They all run in a few milliseconds. Regrettably, I won’t be publishing any benchmarks here because all these solutions are fast enough that it’s difficult to measure reliably, you’ll just have to take my word for it or do them yourself.

This project was a lot of fun to iterate on and prefect, I hope you have enjoyed my overview!

The Stupid Stones Problem