Big data needs to be anonymized to protect privacy, but it's increasingly apparent that anonymization isn't truly possible.
Image: Eric Fischer
Big data is supposed to lead to the utopian cities of the future, but what happens if we can't anonymize the data properly?
Better raw computing power, cheaper storage, and cheap sensors everywhere has led to an explosion of data, especially in cities. Cities now collect tons of information about bus ridership, crime, utilities services, and so on, which is supposed to help local governments decide where to put more resources.
Case in point: Last year, New York City's 187 million taxi cab trips were compiled and put into a database, complete with start and finish locations (GPS-based), distance traveled, fare, and tip amount. But it also included poorly anonymized taxi driver information, including their license and medallion number (the permit needed to operate a cab in the city), making it possible to determine a driver's annual salary and name.
The release of the information, obtained by data artist Chris Whong through a freedom of information law request, is a good chance to look at big data's promise and pitfalls. As a whole, a spreadsheet with 187 million tabs isn't going to help anyone learn anything, but the computing technology that allowed the collection of the data in the first place also allows it to become both incredibly useful and incredibly insecure, from a privacy point of view.
Eric Fischer, a data visualist over at Mapbox, took the raw data and turned it into this stunning map that not only looks sweet but is also pretty telling.
Starts of trips are in blue, ends of trips are in orange. As you can see, the vast majority of trips start and end in Manhattan. The Taxi and Limousine Commission says that 94 percent of pickups occur in Manhattan or at one of the airports servicing the city.
But, beyond that, Fischer notes that if you're in Brooklyn and want to get a taxi, you're best off if you stand on a street that eventually hits one of the Brooklyn-Manhattan bridges, and you're even more likely to get a cab if you stand on the side of the street headed back to the city.
"There are dropoffs all over the city and at Newark Airport in New Jersey, but if you want to catch a taxi, it helps to be in the right place and even on the right side of the street. The pattern is especially clear in Brooklyn, where taxis drop off passengers on the sides of the streets leading away from the Brooklyn and Manhattan Bridges and then pick up new passengers on the other side of the street," he wrote.
Street corners, as you'd expect, tend to be more popular spots to grab a cab than the middle of a block, and major spots like Penn Station have an incredible number of pickups and drop-offs. Once you start going deeper into Brooklyn, Queens, the Bronx, or New Jersey, you see almost exclusively drop-offs, which makes sense. It's also one of the reasons your cab drivers are likely to be annoyed taking you home if you live far from Manhattan.
That's the kind of information that residents can use and cities need to know—after it learned that cabs are essentially never meandering through Brooklyn looking for customers, the taxi commission decided to create the "Boro Taxi," a new type of license that allows its drivers to only work in parts of the city that its standard Yellow taxis don't want to go to.
In an email, Fischer told me that he used a scripting program to make the map, which isn't surprising at all considering there's 127 million data points. But even with such a huge database, you don't need to be a data manipulation expert to create something like this. Some simple Excel sorts would let you know the most popular times taxis operate, the most expensive trips, and that sort of thing.
"There are a lot of questions that anyone could answer for themselves from this data using something as common as a spreadsheet if they wanted to," he wrote. "Public data empowers curious people to ask their own questions and get their own answers."
That's all well and good, but in this case, one of those questions was "who are these taxi drivers and how much money are they making?" On other data sets, the questions could be much more malicious—"whose credit card number is this?" is an obvious one that comes to mind.
Data security expert Vijay Pandurangan points out over at Medium that the data was improperly anonymized, so potentially malicious people could decrypt the city's data very quickly, despite the fact that its anonymyzation program creates as many as 22 million possible taxicab medallion numbers. But ripping through those is a piece of cake with today's processors:
"One can completely deanonymize the entire data. Modern computers are fast: so fast that computing the 24M hashes took less than 2 minutes," Pandurangan wrote.
On his full post, you can get a bit more background about how he was able to work backwards from the city's data, but the point is, the city didn't anonymize the information properly, and it's possible that no one ever will.
Back in 2006, AOL and Netflix also released huge datasets that allowed people to work backwards to identify users—based solely on their search terms for AOL and their watch list combined with IMDb ratings for Netflix. And that's one of the problems with big data—for it to be useful, data has to be somehow attached with whoever is doing the thing you want to track, otherwise it's probably useless.
In a blog post following the AOL and Netflix incidents, internet security guru Bruce Schneier wrote that there are some inherent problems with anonymizing data, namely that people can be identified based on shockingly little information.
"Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth," Schneier wrote. "This has profound implications for releasing anonymous data. On one hand, anonymous data is an enormous boon for researchers … On the other hand, in the age of wholesale surveillance, where everyone collects data on us all the time, anonymization is very fragile and riskier than it initially seems."
And so too, with big data. Releasing it lets people like Fischer make this awesome, useful map. Letting people know how much money taxi drivers make (and telling them who is one) isn't necessarily the worst thing in the world, and in this case appears to be relatively harmless. Passenger information appears to be all but impossible to figure out, in this case at least. But what happens next time?