By 1980, ARPANET, the US Defense Department’s pre-internet internet, had spread across the United States and on into Europe. Over the course of about a decade, it had grown from a four-node network to a system supporting thousands of users, floods of email messages, and even something of proto-reddit collection of special interest message-groups.
America’s not-quite-internet was growing up quickly and, thus, it was high time for something to go really, really wrong. And so it did, on October 27, 1980, with the first proper network crash in the history of the proto-internet.
The failure didn’t leave any warships adrift, but the event, which left ARPANET disconnected for nearly four hours, was a milestone nonetheless. It was the result of a pair of subtle screwups having to do with the network’s interface message processors (IMPs), which were basically what we now call routers: intermediate switching devices that process network traffic.
ARPANET’s IMPs were in charge of taking communications from local computers and networks and translating them into the ARPANET standard. Different sites were based on different platforms involving different protocols and standards, and the IMPs generalized these differences so everything moving around the network was generic to ARPANET.
UCLA computer science professor Leonard Kleinrock and the first IMP. Image: Kleinrock
Imagine bouncing messages around between UNIX, Mac, and Windows-based environments, all with their own ways of dealing with information. IMPs would take all of that and just make it ARPANET-based and platform-agnostic.
One of the network’s IMPs, IMP29, was dropping bits (slivers of binary information) as the result of a hardware failure. IMP29’s job was to act as the communications pathway for another node, IMP50, and because of the bit dropping, IMP50 received a wonky status message with a bad timestamp, which it repeated across the network.
Every node was required to send out status messages at one-minute intervals, and the screwy timestamp on IMP50’s message meant that every other node took in this message, prioritized it ahead of everything else, and then repeated the corrupted message. And then they kept repeating it over and over and over.
The network’s “garbage collection” software, which was responsible for deleting those status messages as they accumulate, was forced to deal with messages having multiple timestamps as a result. It didn’t know how to do that, and the result was every node being forced to store every status message. The nodes’ memories were quickly saturated.
Basically, ARPANET DDoS’d itself as more and more messages accumulated in a sort of feedback cycle.
The result was “a naturally propagating, globally contaminating effect,” in the words of Peter Neumann, chief scientist at the SRI International Computer Science Laboratory.
In a sense, the effect was similar to a distributed denial of service attack, or DDoS, which occurs when a network is flooded with traffic from various sources to the point that it’s unable to function normally. To be clear, the crash wasn't due to an actual attack, but a cascading series of technical failures. Still, it's an illustrative comparison.
ARPANET as of 1977. Image: Wikipedia
Basically, ARPANET DDoS’d itself as more and more messages accumulated in a sort of feedback cycle. Every status message was stamped with the highest priority code, so any other sort of message sent between nodes was ignored in favor of the junk status updates.
This also meant that any message sent to the nodes was ignored as well and thus it was impossible to deal with the problem remotely. Every single node had to be shut down and restarted manually, and only then could the full network go back online.
If just a few IMPs were restarted, rather than every single one, the result was that the restarted IMP would receive a copy of the corrupted message from one of the nodes that wasn’t restarted, and it would once again go down. There was some trial and error in fixing the problem.
The IMPs supporting the 1980 ARPANET actually had an onboard system for finding bit-dropping errors, but they’d all been deactivated, according to a subsequent report published in SIGSOFT Software Engineering Notes.
Bit dropping was usually a “spurious” occurrence and a detection meant having to restart individual IMPs manually, so it didn’t seem worth the trouble, at least until that bit dropping fouled a timestamp such that the entire network collapsed. The next generation of IMP took care of the problem by including a new “loader/dumper” fault state that could be controlled off-site.
The easiest fix suggested in the SIGSOFT report is almost comically simple. When the faulty garbage collection utility checked message timestamps, it calculated “later” using a greater-than-or-equals sign, rather than a plain greater-than sign, thus allowing a whole bunch of messages to effectively share a timestamp and flood the network. Hindsight, eh?