A Love Letter to the Original Algorithm

The Euclidean algorithm can still crush numbers better than most anything.

|
Nov 26 2014, 3:00pm

​Image: ​John-Graham Cumming's factorization diagram generator.

In 2012, a pair of research teams successfully hacked the unhackable: the RSA public-key encryption scheme. The teams, a pool of cryptographers and mathematicians drawn from France and the United States, analyzed a set of some 7.1 million public keys, hunting for the unlikely shared prime factors that would allow the encryption to be unraveled. They "stumbled" across some 27,000 keys with this vulnerability, meaning that roughly two out of 1,000 keys was (is) unsafe. Was their tool some overdriven supercomputer or quantum computer? No, just an ancient Greek mathematician.

It's usually assumed that the oldest algorithm still used today is the Euclidean algorithm. First discovered, or at least written down, by the Greek mathematician Euclid circa 300 BC, the algorithm describes a method of super-efficiently finding the greatest common divisor of two numbers. That is, what is the largest number that divides two other numbers evenly? It's not flashy, but the Euclidean algorithm is the bedrock of number theory—and the theorems found within number theory are the bedrock of high-speed calculation and computation. First arithmetic, then the information age.

Computers are born stupid, and they can only die stupid

One of the more abstract pleasures of programming is that no fact or axiom is ever taken for granted, at least if you're doing it right. Computers are born stupid, sublimely ignorant of their conditions and duties. And mostly, they remain stupid, allowed only provisional, fragile memories to which they have no free access beyond the barest scraps stored within a processor's registers (the tiniest slots holding data that's currently being operated upon). There is no instinct, and learning remains only the implementation of some learning algorithm. Computers are born stupid, and they can only die stupid.

What this means is that everything must be taught, again and again: axioms and grade-school intuitions. As such, programming (or computer science) involves a whole lot of decomposition. In creating instructions that a computer might be able to comprehend, the question is forever, "What is this really?" Even with the correct hardware, there is no computer in existence that "just knows."

For computers (and beyond), knowledge is decomposed into algorithms. An algorithm isn't a computer program, however; an algorithm is just a series of steps one might take to reach a specific goal in a specific time-frame. You and I have no doubt followed several algorithms just today: the drive-to-the-store algorithm, the making-coffee algorithm, the walk-the-dog algorithm. A formal algorithm is a bit different in its approach to solving general problems. It's portable: person to person, logic unit to logic unit, microprocessor to microprocessor, programming language to programming language.

An algorithm:

Division is intuitive. Give me a handful of marbles and tell me to divide them into three different groups and I will understand that instruction without pause. But how do you explain division to a computer processor? If I were writing a computer program, I'd simply use the symbol "/." That symbol will be converted to the "DIV" operation in assembly language, which is the machine-readable code that C++, Java, Python, and really any other language gets converted into by compilers. (A compiler's job it is to optimize your garbage code for actual computation.)

An even deeper algorithm will then go to work on that DIV command, probably using a "shift-and-add" algorithm, where rows of binary digits are scooted around and added together in such a way to get multiplication. As multiplication is repeated addition, division can also be seen as repeated subtraction. This is the crucial component of the Euclidean algorithm, in which division is decomposed into subtraction, e.g. Euclidean division.

First, Euclid's division theorem, a subroutine of the larger Euclidean algorithm, asks that we consider numbers in a certain form, where a divided by b looks like:

a = bq + r

So: a number decomposed (divided by another number) is the combination of two numbers multiplied (a certain number added to itself a certain number of times) and a third number, which is the remainder when we subtract the product of those two numbers from the number being decomposed. If I have 10 marbles and split them into three groups, I'd have three groups of 3 (3 * 3) and 1 marble left over: 10 = 3 * 3 +1. There are other possibilities, of course: 2 groups of 4 with a remainder of 2: 10 = 2 * 10 + 2. How about 2 groups of 5? 10 = 2 * 5 + 0. 0!

To make an actual division operation/algorithm out of this is easy. Start with two numbers, the smaller to be divided into the larger. Subtract the smaller from the larger once, and record what's left over and write down a "1" because we've only done 1 subtraction. Take the remainder, what was left over, and subtract that same divisor from the beginning. Record what's leftover and mark down a "2" because we've done 2 subtractions. Do it again and again and again, stopping only when the remainder/leftover from the subtraction operation is smaller than the original divisor. The answer is however many subtraction operations it took to get to this point, with whatever remainder is left. That's division, or division with whole numbers.

This is intuitive not because it's the form of our thoughts, but because it is an elegant form.

There are different designs for the division components of arithmetic logic units, but they do the same general thing: subtract one number from another, store the remainder. Subtract from the remainder, store the new remainder. Eventually, the unit returns a quotient (how many times one number goes into another number) and a remainder (what's left over when you can't subtract anymore and still get a positive number).

Given this process, it's easier to see that there are in fact two divisions. One division is the usual division of how many times does a number go into another number, and the other is the modulus (mod) operation, which returns the remainder. The mod operation is hugely important in computer science for all kinds of stuff, including but not limited to encryption, error correction, and data structures. It's also how we understand time: Twice a day we apply a mod 12 operation on the current time, magically transporting 11:59 back to 0. Modulus is the result of a = bq + r. It's r.

The lack of concurrency in factorization—with every step depending on the step before it—makes the problem hugely work-intensive

So far we've just talked about the division part of the Euclidean algorithm, which is a sub-algorithm or routine. Remember how excited I was a couple paragraphs above about finding a 0? That's because a 0 remainder is the jackpot.

If I divide 5 into 10, the result is 2, with no remainder. This 0 remainder means that 5 and 2 factors of 10. 3 is not a factor of 10, nor is 4. 2, 5, 1, and 10 itself are factors, but 1 or 10 don't really count. When a number is a factor of a another number, we can say that it divides the other number. 5 divides 10, 2 divides 10, but 3, 4, 6, 7, 8, and 9 do not divide 10. They are not factors of 10.

A number that only has 1 and itself as factors is prime and this is the property that allows for most all of digital security, e.g. RSA encryption, which is based on the computational difficulty of factoring huge numbers that only have 2 (huge) prime factors. It's surprisingly easy to bog down even a current multi-core system with a basic prime factor-hunting code. (Multi-coring shouldn't make a difference in factoring actually, because the process isn't easily broken up into parallel or concurrent tasks, though there are some pretty cool-looking algorithms out there for just that.)

The Euclidean algorithm in its entirety is designed to hunt for those 0s. It utilizes the above division scheme to hunt for factors shared by two numbers, particularly the greatest common divisor, or the largest number that divides both evenly. It's sort of like hunting for a common ancestor. Both 10 and 15 have a great-aunt 5, but that's about it, save for 1, which would be like the original human ancestor way back in the Pliocene.

In RSA encryption, a public key hides the private key in what's basically a factorization problem: Find the two prime factors that multiply together to result in the public key, and you can access the private. The whole thing depends on computers being weak enough such that this factorization would just take too long to matter. The lack of concurrency in factorization—with every step depending on the step before it—makes the problem hugely work-intensive.

Your computer, and even computers astronomically more powerful, can only handle factoring about 100 (base-10) digits. I'm not actually sure what would happen because nothing I have on my computer would take an input that big and, when asked to factor a random 100 digit number, Wolfram Alpha told me to go get fucked. The Euclidean algorithm could hack it, however. It decomposes the problem of finding common factors into manageable bites, simply based on the observation that you don't have to keep applying division to the original two (huge) numbers. Instead, you can do the initial division and then take the remainder of that division and then divide it into the original divisor. You can do this again and again, so the problem is mostly just doing operations on smaller and smaller remainders, until finally there's a 0 remainder and the factorization is done.

The greatest common factor is just the last remainder before the algorithm got to 0. The whole thing is pretty smooth.

To bust open an RSA encryption scheme, a hacker can take advantage of the fact that applying the Euclidean algorithm to a pair of public keys will return the keys' shared prime factors. The problem is usually or ideally impossible to crack given current computers (but not quantum computers!), but that depends on an adequate random number generation scheme. And random numbers are pretty hard to come by in the real-world. Real random keys should never share a prime number, but a less-than-ideal number generator scheme can get this wrong, exposing all of your secrete junk to the world.

Thanks, Euclid.