A new study adds more evidence to the claim that it's nearly impossible to anonymize big data.
Image: Thomas Kohler/Flickr
There are a two things we know about publicly releasing large sets of data:
1. It's really hard (maybe impossible) to completely anonymize big data.
2. You're really easy to identify in most anonymized data sets, using just a small amount of information.
A team of researchers provided more evidence to these two points in a paper published today in Science. They showed that if they have four pieces of time and place data about a person within a set of anonymized credit card metadata, they can uniquely identify that person 90 percent of the time.
Yves-Alexandre de Montjoye, a PhD candidate at MIT and lead author on the study, and his team looked at an anonymized set of three months' worth of credit card records for 1.1 million people. The data did not include any names, account numbers, or other obvious identifiers. If they had just four spatiotemporal points—time and space information, like making a purchase at a certain address at a certain time—for any given individual, they could uniquely identify them. If one of those points also included the price of a transaction, the risk of re-identification increased by 22 percent.
The researchers illustrated their findings in an example; imagine you're looking for a guy you know named Scott in a simply anonymized credit card data set like this:
Let's say you know two things about Scott already, maybe from his Twitter or Instagram or Swarm feed: that he went to a bakery on September 23 and that he went to a restaurant on September 24. Bam, there's only one person in this data set that has both of those spatiotemporal points.
"Scott is reidentified, and we now know all of his other transactions, such as the fact that he went shopping for shoes and groceries on September 23, and how much he spent," the paper reads.
Of course, in a bigger data set there will be other people who also went to the same places as Scott, which is why more information is required. But, again, 90 percent of the time, the researchers only needed four pieces of information to uniquely identify a single person. Chances are, if you're a social media addict like me, you easily have four or more time and place identifiers in your public data over the last three months, making you highly susceptible to re-identification in a simply anonymized big data dump like this.
De Montjoye had previously shown this to be the case within a phone record data set—in fact, it was even higher, with four data points being enough to uniquely identify an individual 95 percent of the time.
There's a lot of debate over what findings like this actually mean for big data. Publicly releasing large datasets can lead to important understandings and advancements, but is the risk to personal privacy worth the benefits? Well, it's not that simple. As Ann Cavoukian, the former information and privacy commissioner for Ontario, Canada, pointed out in a paper last year, while it's true you only need four pieces of information to uniquely identify a person, getting that information isn't all that easy.
"In addition to having access to the comprehensive dataset of mobility traces, an adversary would have to know at least four spatio-temporal pieces of information (e.g., the person's home address, work address, etc.) about each individual in the sample in order to re-identify 95 per cent of that population," Cavoukian wrote. Identifying your friend Scott is one thing. Identifying 1.1 million people is another.
"Needless to say, amassing such information from publicly available sources would not be a trivial undertaking."
Two computer science experts at Princeton refuted a lot of Cavoukian's arguments in a rebuttal paper, noting 50 percent of unique individuals were identified in the cell phone database with just two factors, like home and work addresses, which are relatively easy to scrape online.
But de Montjoye says the difficulty or ease with which someone might be able to get that information isn't really the point (and it's also difficult to quantify). He does his research to identify where big data sets might be vulnerable so that more effective anonymization systems can be created.
"We try really hard for this not to be framed as some kind of Big Brother-fearing type study. There's a lot of potential in big data and we really do not think that we should stop using this data," he told me. "It means we need to reform the way we approach data protection."
One possible solution that MIT cooked up was to make big data available to the public and researchers, but behind a screen. They created a system that allows you to ask questions to find out trends and explore the data without actually seeing it all in one set. de Montjoye says it's not a one-size-fits-all solution, but it's a step.
"There are amazing things we can do. We can learn about humans, about society, about the economy from this data," he said.
"We just need to make sure we understand what the risks are and make sure we find the right balance between privacy and utility."