Just another programming blog

Factual accuracy is not guaranteed

Nobody Knows What a String Is


The title might seem a bit nonsensical… and awkwardly long too. You think you know what a string is right? You’ve used them more times than you can count.
Perhaps you’re most familiar with JavaScript, and you can define a string, check what’s in it and even pass it into a function or get one out. Perhaps you’ve used a string in python and had a similar experience or maybe in C#.

But let me ask you, do you know your string’s encoding? – Maybe you do, the language you write in for your day job probably uses UTF-8 encoding. This means that every ‘character’ takes up 1-4 bytes. If you’re wondering why I’ve put character in quotes, you should look up the utf-8 specification – It does not attempt to define a character in an sense.
That aside, we can say we know what a string is well right? Well, not really – do you know where your string is stored? I’m not talking about the memory address or which virtual machine or garbage collector keeps track of it, I’m asking you if your string is heap or stack allocated. I’m asking, is your string backed by a ‘string pool’ data structure. Does this matter? It can sometimes matter if you’re using an object pool when you’re using reference based equality, but let’s sweep that under the rug for now…

How Long is a Piece of String?

by Steve Johnson, pexels.com/@steve/

Do you know what metadata is stored about your string? In rust, a String has 3 pieces of data stored, length capacity and a pointer to the first character of the string. I am not familiar with the intricacies of how JavaScript and Python would store their strings but I can imagine it would be similar. Well that’s simple at least, we can be confident of what a string would look like in memory right? Erm, no not really.

The C language famously has a very problematic way of reprinting strings. A string in C is just a pointer of type char and that’s it. so how do we know how long it is? We don’t really but we can calculate it. It has a 0 byte aka “Null terminator” at the end. This byte is very easy to loose due to the buggy C standard library not always caring if it’s there or sometimes not placing a new one when it should be. Nonetheless, if we need the length we can get it by walking the entire string and counting every single step until we find a null byte. This has more problems like that you cannot have a zero’d byte in the string except at the very end.
It’s so problematic in fact, that the string has been reinvented in C countless times. My favorite approach I’ve seen is “Simple dynamic strings” https://github.com/antirez/sds where the ‘start’ of the string isn’t actually the start, but a region just after the metadata. This obviously has its own downsides but I think it deserves an honnourable mention.

The rust community has taken the approach to add a new string wrapper class/struct for each type of string you may want to interact with. This can smooth over the edges by adding some type safety so you don’t put the wrong thing in the wrong place, but it’s also got its downsides. The most obvious downside is how beginner hostile this is – String, &str and &[u8] are a few different ways of encapsulating a string. The difference between them all takes a lot of knowledge to understand and work with. This is bad because, beginners who just want to print “hello world”, are slammed with a huge amount of confusing information. It’s clearly a problem to me because the question “What’s the difference between a &str and a String” is very common. I don’t think most people even know the full answer – A String is a struct with length, capacity and a pointer. When used, it allocates bytes on the heap and includes associated functions to modify and read it.A &str is a reference to a string literal with length. &str is not a struct, it’s a compiler intrinsic and the length associated with it is removed at compile time, it’s a kind of ‘smart pointer’. Confused yet? That is the correct response.

What Even is A Character?

Well, let’s put that nonsense aside, at least we can assume that a string something that gets at utf-8, right? No… no not really. Allow me to introduce you to windows’ WCHAR, it’s a unit of 2 bytes used to encode utf-16. So, if you want to interact with the windows kernel directly with utf-8, you have to use one of thier wrapper functions that end in A (only available in windows’ C++ compiler toochain).

So let’s recap. We’ve talked about how a string can be:

  • A pointer to null-terminated utf-8
  • A struct with length, a pointer to utf-8 and (not all the time, but often) capacity
  • A pointer to null-terminated utf-16
  • A pointer to utf-8, with length available at compile time
  • A pointer to a region of memory with metadata preceding it and null terminated utf-8 following it.

Let’s put aside the fact there are countless encodings with varied support, strings can be stored left-to-right or right-to-left, Haskell strings are different again, and that there are other niche approaches too! Check out my hybrid heap/stack string implementation if you like https://github.com/largenumberhere/short_string .

Strings aren’t Real

Now let’s quickly dip our toes in some assembly, don’t be scared it’ll be quick!
What’s a string in assembly? There is no such thing.
In assembly there are no types, just bits and bytes and their meaning differs depending on what function (or system call) you’re calling. The most common way to store a string in assembly is reserve some bytes in a row, which you could say is a crude array. The Linux system calls sometimes expect a null terminated string and other times expect a string and length, so you’re best off to use a C string and store its length. keep in mind, this string is a fixed size. Any attempt to write past the last position, will cause strange behavior or a fault.

Somebody knows What a String is

Okay, I think that’s enough torture.
Now, take step back and breathe.
Most of these quirks about “Strings” are not important in higher level languages and there strings are somewhat of a solved problem that you rarely have to think about. Just use whatever your standard library hands you and if you have any issues, learn about the quirks of it.
In C, just use a malloc’d char array like god intended or pick a nice library that’s compatible with C’s builtin functions – trust me don’t try to fiddle with char arrays on the stack, it’ll bite you.
In rust, use a String and clone it around as necessary. If you learn about the ins and outs of lifetimes and ownership, then you will understand the complexity increase introduced by &str and when it may be problematic and when it may be a worthwhile tradeoff.
It’s not as bad as it sounds, but it’s good to be aware that strings are complicated and there’s many ways to approach them


Leave a Reply

Your email address will not be published. Required fields are marked *