FYI.

This story is over 5 years old.

Tech

OK, Maybe Correlation Does Sometimes Imply Causation

The statistical model that finds hidden connections between correlated events.
​Image: Steve Jurvetson/Flickr

​The world would be a much simpler—or at least more obvious—place if correlation did in fact imply causation. Just like that the fundamentals of the universe would be laid out like a buffet; science could take a rest. Imagine how many wrong ideas would just evaporate.

Correlation doesn't imply causation, of course, but deep down the issue isn't quite settled. The world of statistics isn't satisfied with the apparent symmetry between correlated occurrences. That is, if X occurs and Y occurs, does that mean "if X, then Y" or "if Y, then X." Without proper controlled experimentation, all we can really say is that X and Y both occured. ("And" is symmetrical, "if-then" is not; try it for yourself.)

Advertisement

A group of statisticians from the University of Amsterdam recently posted a paper to ​arXiv (via the physics arXiv bl​og) offering something of a solution to the problem of event symmetry, a method of analysis that can carve out causes and effects using just observational data. The key is noise.

Image: xkcd/2014.

"While the gold standard for identifying causal relationships is controlled experimentation, in many cases, the required experiments are too expensive, unethical, or technically impossible to perform," the Amsterdam team writes. So: correlation.

The most successful scheme the researchers found for making such in-the-wild causal determinations is known as the additive noise model. Start by taking two phenomenon out in the world, X and Y. X might be a broken stop sign pole and Y might be a car crash. Did the missing stop sign cause the crash or did the crash knock over the stop sign?

This could probably be determined easily enough just by giving the scene a quick look, but let's pretend the accident aftermath is just a pile of smoldering rubble and no one knows what happened. So, we take measurements, build datasets.

Imagine our X (car crash) and Y (stop sign) datasets as piles of measurements and other observations of the crash scene. Each dataset is going to have some amount of statistical noise because that's life. This unexplained variation would be found in really any data sample. Except it's not quite noise—it's noise with history.

Advertisement

The key is in examining the different patterns of noise found within the X dataset and the Y dataset

The key is in examining the different patterns of noise found within the X dataset and the Y dataset. If there is indeed a causal relationship, it will be manifested in these asymmetrical patterns. Noise from one can influence the other, but not the other way around. It's like the two events each have a set of hidden metadata that only makes sense within the context of the other event's own dataset.

Additive noise models have been around for a while, but what the Amsterdam team offers is a solid demonstration using 88 datasets of IRL cause and effect pairings. Examples of these pairs include altitude vs. precipitation, age vs. body-mass index, and drinking water availability vs. infant mortality rates. These causal relationships are already well established, offering a way of seeing how right or wrong the model is.

The model was correct up to 80 percent of the time in predicting the causal relationship, which is pretty impressive, even if the benchmark set was simplified. For one thing, the Amsterdam experiments don't allow for a crucial alternative possibility: two correlated events both being caused by the same third thing. Perhaps some Godzilla-esque monster crashed the car and toppled the stop sign with one furious foot-stomp.

Godzilla or no, that alternative is pretty real and it's part of where the whole correlation-not-causation business comes from. Things that might appear to have a causal relationship very often have other things in common, perhaps things that are hidden or at least not accounted for. Like Godzillas.