Quantcast
The Smallest Code

Is it a single numerical digit? A line of assembly language? Let's find out.

This question was put to me by Motherboard Editor-in-Chief Derek Mead and I can't stop thinking about it: What is the smallest code?

It's an interesting question in large part because it can have several meanings all pointing back toward what code is in the very first place. Is small code the actual code that we as developers and engineers write onto a screen? Or should we measure code by what code is translated into and how it actually executes on a real machine. Better, I think, is a combination of sorts: The smallest code is the smallest amount of programming language syntax that we can write to produce the largest machine-level effect.

So, let's look at all three of the abovementioned perspectives, starting with the easiest.

Smallest syntax

In terms of characters, what's the shortest piece of valid programming language syntax I can write?

To be clear, I don't know every programming language, but I have a reasonable handle on most all of the major ones and then some. So-called interpreted languages are what make this an easy question. These are languages whose syntax (what a programmer actually writes) is fed to some intermediate piece of software that functions as a translator (or interpreter) between our higher-level code and pre-built units of machine instructions. It's like executing programs within programs.

The alternative to an interpreted language is a compiled language, which is where we write out a bunch of code in a file (like a Java or C++ file) and then send that file to be converted into a whole new arrangement of machine instructions representing that input file and only that input file. The difference is a bit like building a castle out of Legos (interpreted) vs. building a castle out of a big single piece of molded plastic (compiled). Both approaches have their advantages and disadvantages. Generally, if you're going to write a big-ass piece of software that's going to be installed onto a computer, you'll write in a compiled language.

Python and JavaScript are both interpreted languages. We are free to write big-ass, old-school-style programs in either one, but we can also just feed tiny bits of syntax directly into either language's interpreter, which exists as a command line that looks like your operating system's command line (which is also an interpreter, but for a different set of commands). That is, Python is a language but it's also a piece of software that's installed onto our system like any other piece of software.

A single numerical digit. This is probably the smallest piece of valid syntax I can write in any programming language.

I can enter a single digit into either Python or the Node.js interpreter (which is a shell that interprets JavaScript) and either interpreter will simply echo it back to me without errors or warnings. I can also enter nothing into either one and not get an error, but I'm not sure that's in the spirit of the question.

In a compiled language, a lot more is needed, relatively speaking. We at least need the shell of a function providing the operating system with an entry point into the program, so we're talking a half-dozen characters, at least. The basic C++ program skeleton looks something like this:

int main() { return 0; }

It's not much, but still more than:

0

Smallest footprint

I don't think the smallest syntax measure above is a very honest way of looking at things. To execute that "0" will actually take a whole lot of system resources, relatively speaking. According to my MacBook's activity monitor, the Node shell I used to interpret that single digit is occupying around 11 MB of system memory. A single character, however, can be represented by a single byte of memory. So, we're holding on to 11 million bytes to echo a byte of data.

#include int main() { cout << 0; return 0; }

The C++ code above modified to output the single digit "0" occupies about 28,000 bytes of memory at its peak (according to the code profiling tool Valgrind). That's a much smaller footprint.

Still, 28,000 bytes is 28,000 bytes. I might be able to improve things by ditching "iostream," which is a standard C++ library for dealing with input/output operations. Including it means that I'm including some extra code from another file, and then more code from other files that the iostream code depends upon. The iostream library itself isn't enormous, but it has to bring in a bunch of other stuff to work. This all gets planted into system memory when the code is actually executed.

In the above program, iostream just gives us cout ("cee-out," but I'll forever say it "kout" in my head). This is just a piece of syntax useful for outputting data to the screen. We can do the same thing is a slightly different way, as in:

#include int main() { printf(0); return 0; }

We swapped libraries for a standard library used in C programming. C is generally more raw and minimal than C++.

The memory usage is about the same, but we wind up making the program itself smaller. The iostream (C++) version is about 9 KB, while the leaner library stdio, which is built for the C language, lets us trim about 1 KB from the program size.

We could also measure footprint by assembly language representation.

Assembly language is the final stop in any program's trip from programmer to programming language to actual machine instructions. We can say that it's the final "human-readable" step in the process. Code is compiled to assembly language by a compiler and is then assembled from that by an assembler. The job of the assembler is basically to make a one-to-one conversion from the human-readable assembly code to binary machine instructions.

In comparing the two snippets of code above, the difference strictly in terms of file size—assembly file to assembly file—is about 50 bytes, though a quick inspection reveals that the actual assembly code produced by the compiler is pretty similar. Assembly language is the great equalizer. It could care less about your favorite programming language or patterns or paradigms and is only interested in hardware efficiency given a particular system architecture.

I could also just write some assembly language code from scratch to do the same number outputting thing above, though what I'd write winds up being about the same thing as what the compiler delivers, give or take a few lines.

A problem exists with the example, however. Outputting a single digit to the command line window is not all that simple of an operation when we start talking about code at these scales. To accomplish it, assembly language actually has to refer back to the operating system—there's no built-in machine instruction for "print."

Leaving print behind, we can ask what the smallest code is that does anything. The best I came up with this is:

xor eax, 0x01

This is assembly language instructing the processor to flip a single bit in the register named eax. In practice, it's not as lightweight or direct as it looks, but in theory we're telling the machine more or less at the machine level to change a single bit within not even a memory address, but a processor register. In theory, it's not doing anything to system memory at all then, just the small trays of memory that the processor reserves for itself to do computations.

Small code, huge footprint

For no good reason, I was stumped on thinking of a good example of small code (syntax) having huge effects (machine-space). This very thing is the bane of every software engineer, usually taking the form of a memory leak. As they execute, programs (C and C++ programs, in particular) should be deallocating all of the memory they allocate as they actually run. This happens: A program launches and claims some amount of memory, and, then, as it's running it needs to grab up some more for whatever reason. It's up to the programmer to make sure that memory is deallocated when the program quits. (Many languages have what's known as "garbage collection," in which some automated process cleans up after programs, usually at the expense of a larger upfront memory grab.)

Memory leaks are dangerous because of how they accumulate. While every time a leaky function is called,it may only leak a few bytes, that function might be called a million times. Suddenly, misallocated memory is accumulating exponentially and this becomes a problem.

From a memory or processor standpoint, all it takes to nuke a system is a loop. The most basic form of a loop looks like this, which loops forever until the thing referenced by the "while" construct becomes not true.

while(1){}

In our case, this thing (the number 1) will never become not true, so it loops forever. If we were to, say, add some code to this loop that allocates a word of memory (64 or 32 bits, usually) on every iteration, that's gonna be a problem.

The other night I nuked my system, a midlevel MacBook Pro circa 2015, in a pretty spectacular fashion. I was solving a HackerRank challenge that groups and maps numbers in a particular way. There's a naive way of doing this where you basically just map every smaller number up until the number we're supposed to be mapping. The idea is that you're counting up one by one, but before you can get to the next highest number, you have to take the current number and then decrement it one by one until you get to zero.

That is, to get from 9,999,999,999 to 10,000,000,000, I actually have to count down by 9,999,999,999 before I'm allowed to increment by one.

It's sort of hard to explain, but by the time we're up to the tens of billions, we're flexing an insane amount of clock cycles just to do these simple subtractions and additions. So, I wound up with a frozen, unusually warm computer that had been sunk in a matter of seconds by a half dozen lines of code.

So, in this case, I'm talking about CPU percentage rather than memory, but you get the idea of how a bit of code can sink a CPU like nothing.

Finally, I'm not sure I can really pick a best answer. The small-code-big-impact idea is nice, but where it's likely to be at it's most extreme is in the presence of bad coding. Trying to conserve hardware resources, whether memory or processor cycles, is also a nice idea, but computers now don't usually require programmers to be all that stingy with either. As for small syntax, as in the first example, that's not ideal either. Super-terse code often comes at the cost of readability and maintainability. There are a lot of ways to make code small, but, as with most things, what we probably want is moderation.