FYI.

This story is over 5 years old.

Tech

MIT Researchers Offer a New Defense for an Old Target: Integer Overflow

DIODE goes where no debugger has gone before.

Computer scientists at MIT's CSAIL program have developed a novel defense against an old yet still omnipresent programming fuckup: allowing software users to overrun the (physical) limits of an allocated memory address. This is a vulnerability beloved by hackers and one that often leads to programmer self-sabotage via innocent yet catastrophic errors. Potential overruns are highly adept at code camouflage and notoriously difficult to debug.

Advertisement

In the age of anything-goes poser-proof JavaScript it's probably a bit too easy to forget about the actual guts of computing and its rules and persistent breakability, which are still governed by wires and electricity and binary bits. That's the risk of programming at super-high levels (with large amounts of coder-friendly abstraction and interpretation, like JavaScript)—buying into the illusion of limitless machines. And one of the more profound limits of these machines is in data types. Some data types can stash more information than others, and it's here that we find our old-school bug: data overflow. It only takes one to sink an entire program.

From x86 assembly code to JavaScript to Processing, programming is all about variables. Whether it's a string of alphanumeric characters, a long decimal sequence, or a user-defined object (which could be anything), values are associated with names, which are associated with types. The most basic type is an integer, usually referred to as an "int" in programming languages, and there are also "doubles" and "floats," which are usually for decimal (not whole number) values, "chars" for characters (letters and symbols), arrays (lists, basically), and a few others, depending on the level of abstraction given by a particular language.

These types represent sizes (in bytes, which are units of eight binary digits). An integer will represent four bytes of information or 32 bits. (A 32 bit computer is one that uses that as a standard size for manipulating and moving around data.) A "long long" data type can store 8 bytes, which widens the range of possible stored values immensely. An int can store a range of values between –2,147,483,648 and 2,147,483,647, while a long long can store numbers between –9,223,372,036,854,775,808 and 9,223,372,036,854,775,807.

Advertisement

Image:  cprogrammingexpert.com

The size of an int, in particular, is a direct or close to direct correspondence to the actual machine and its capabilities for data handling and processing. If a machine is presented with value that's too large to represent as an int—too large to store in an int's memory address—it will truncate it, lopping it off as necessary like the odometer in a car. This is how a program can wind up with bad data, which can break things, causing a program to terminate or worse.

In research being presented this month at the the Association for Computing Machinery's International Conference on Architectural Support for Programming Languages and Operating Systems, the MIT researchers will present an overflow debugging system, called DIODE (for Directed Integer Overflow Detection), that has so far successfully identified 11 whole new bugs within five common open-source programs, on top of three that had already been known.

"Integer overflow errors are an insidious source of software failures and security vulnerabilities," the researchers note in an accompanying paper. "Because programs with latent overflow errors often process typical inputs correctly, such errors can easily escape detection during testing only to appear later in production. Overflow errors that occur at memory allocation sites can be especially problematic as they comprise a prime target for code injection attacks."

Code injection is a bit what it sounds like. A hacker will identify a vulnerability in some piece of code, and use it to introduce their own code, which might be a worm or any other sort of exploit.

Advertisement

"A typical scenario is that a malicious input exploits the overflow to cause the program to allocate a memory block that is too small to hold the data that the program will write into the allocated block," the MIT group explains. "The resulting out-of-bounds writes can easily enable code injection attacks."

This sort of overflow debugging can be viewed as a search operation. A given program is taken as a graph (as in graph theory) that represents the possible flows of a sample input. Historically, this has meant probing or attempting to probe every possible path that an input might take, which adds up to be a whole lot of effort and time, with imperfect results.

"What this means is that you can find a lot of errors in the early input-processing code," offers Martin Rinard, a computer science professor and co-author on the new paper, in a separate summary. "But you haven't gotten past that part of the code before the whole thing poops out. And then there are all these errors deep in the program, and how do you find them?"

The DIODE solution also begins with a sample input, but, in this case, the input has a memory of sorts. As it probes deeper and deeper into the program, extra digits are added to its value. These digits are "symbolic expressions" that tell the human debugger what happened to the integer and where. As Rinard explains, the resulting values can reveal pages and pages of information about what happened as the input bounces around the system. While the sample input itself won't trigger an overflow, the resulting symbolic expression finds out what will.

The whole process is recursive. The problem with earlier debugging schemes is that once the sample input runs into an input check—put there by an attentive programmer—the process is over and so the debugging starts again at the beginning. With DIODE, once the sample input fails one of these checks, a new value is immediately computed using the symbolic expression and the checkpoint is tested again. This continues until either the revised sample input makes it through and on to the next potential vulnerability or it's determined that the programmer did a really good job at data validation and getting through the check is impossible.

Finally, Rinard and co. conclude, "Our results show that, for our benchmark set of applications, and for every target memory allocation site exercised by our seed inputs (which the applications process correctly with no overflows), either 1) DIODE is able to generate an input that triggers an overflow at that site or 2) there is no input that would trigger an overflow for the observed target expression at that site."