Yahoo Labs, the research wing of Yahoo, just released what the company is calling the “largest ever” machine learning dataset for artificial intelligence researchers to use in their work, for free. For example, to create a Facebook-like recommendation algorithm.
In doing so, Yahoo also released information that could potentially be used by researchers who download the database—and anyone they share it with—to identify Yahoo customers.
The behemoth dataset consists of 13.5 terabytes of user interactions with news items from some 20 million users, which the company says have been “anonymized.” While there are no names attached to the data, seven million users in the database also had information about their age, gender, the city they were in when they accessed the page, whether they used a mobile device or a desktop, and a timestamp of when they accessed the news item, included in the dataset.
This kind of auxiliary information, although it does not identify a user by name, is called metadata, and it’s anything but anonymous. Researchers have found time and again that if you have enough of these digital breadcrumbs, one may easily trace them back to the person who left them.
“The idea is that you should not be able to identify a user by looking at the data"
Using metadata from Yahoo’s dataset alone, for example, you could easily surmise that a woman under 20 in Akron, Ohio used a mobile device to access news about the latest ISIS attack at 3:15 PM local time on February 13, 2015. Yahoo’s data could also be used by someone with access to other datasets or public records that can further thin the pool of potential candidates in order to infer someone’s identity.
“It’s sometimes information from several sources, not just one other source, but three or four different public records sources,” said David O’Brien, a researcher at the Berkman Center for Internet and Society at Harvard, referring to how someone might re-identify someone whose name was scrubbed from the dataset by Yahoo.
Yahoo only allows researchers, faculty, and students at an accredited university, and with .edu email addresses, to dowload its datasets. According to Suju Rajan, director of research for personalization science at Yahoo Labs, the company went through a rigorous process of checks and balances to make sure user anonymity was maintained in the data set.
“The idea is that you should not be able to identify a user by looking at the data,” Rajan said. “It’s fairly innocuous in terms of what it reveals about a person’s information.”
If you’re looking at Yahoo’s dataset on its own, this may indeed be true. Unlike the infamous AOL data leak in 2006, in which AOL inadvertently released sensitive customer information to anyone who wanted to download it—AOL intended the data to be used by researchers—Yahoo has taken steps to obfuscate user identities and put a firewall around who can get the data. Still, it’s a question of trust, O’Brien said.
“It comes down to trusting that these computer scientists won’t do anything nefarious with the data,” O’Brien added. “For example, that they won’t release it to other people, and that they won’t attempt to re-identify people and use that to do harm to somebody else.”