Just because data is sort-of public, doesn't mean that it's ethical to collect en masse.
A student and a co-researcher have publicly released a dataset on nearly 70,000 users of the dating site OkCupid, including their sexual turn-ons, orientation, usernames and more. And critics say it may be possible to work out users' real identities from the published data.
The situation is raising questions about what type of data researchers should be allowed to collect en masse, repackage and perhaps distribute.
Information posted to OkCupid is semi-public: you can discover some profiles with a Google search if you type in a person's username, and see some of the information they've provided, but not all of it. In order to do that, you need to log into the site. Such semi-public information uploaded to sites like OkCupid and Facebook can still be sensitive when taken out of context—especially if it can be used to identify individuals.
"OkCupid is an attractive site to gather data from," Emil O. W. Kirkegaard, who identifies himself as a masters student from Aarhus University, Denmark, and Julius D. Bjerrekær, who says he is from the University of Aalborg, also in Denmark, note in their paper "The OKCupid dataset: A very large public dataset of dating site users."
The data was collected between November 2014 to March 2015 using a scraper—an automated tool that saves certain parts of a webpage—from random profiles that had answered a high number of OkCupid's multiple-choice questions. These include things like whether they ever do drugs, whether they'd like to be tied up during sex, or what's their favourite out of a series of romantic situations.
"OkCupid is an attractive site to gather data from."
The actual information about users collected included their username, age, gender, location, religious and astrology opinions, their number of photos, and more. The pair also collected the users' answers to the 2,600 most popular questions on the site. For the paper, Kirkegaard and Bjerrekær explored things such as whether it was possible to work out users' general cognitive ability from their answers. A third researcher, Oliver Nordbjerg, is also listed as a contributor on the site Open Science Framework.
This was all information available to users of OkCupid once they were signed in. Arguably, the data was public, as it didn't contain direct messages, or anything of that sort.
"It is our hope that other researchers will use the dataset for their own purposes," the paper reads.
But plenty of academics are unhappy with the publication of this data.
Scott B. Weingart, Digital Humanities Specialist at Carnegie Mellon University (CMU), claimed in a tweet that he could with 90 percent accuracy connect sexual preferences and histories to real names of over 10,000 of the OkCupid users.
"The data may be 'public' (though it must require login and agreement to a [terms of service]) but that does not absolve anyone from an ethical responsibility." Rasmus Munksgaard, a researcher who has previously conducted his own scrapes of dark web marketplaces told Motherboard in a Twitter message.
"The data can be used for deanonymization of individuals and very sensitive information, and they can't opt out either," he continued.
The thing is, what Kirkegaard and his co-researcher did was not illegal, and, more generally, research ethics review boards haven't really caught up with the public-scraping of web data either.
"Anything as large and old as the university system will be slow-moving and difficult to change course, usually by design," Weingart from CMU told Motherboard in an email. "We don't want to rush into anything, we want to understand the outlines and ethics first. This is a case of the world moving much faster than the university system, and we're scrambling to catch up."
OkCupid said the pairs' actions violated the company's terms of service.
Other people working with data from public sources have taken steps to address privacy concerns. In a 2008 paper dealing with information gleaned from Facebook, the authors removed subjects' names and identification numbers, as well as making other researchers agree to a set of terms and conditions for future uses.
"The data can be used for deanonymization of individuals and very sensitive information, and they can't opt out either."
This OkCupid data, meanwhile, seemingly hasn't had any sort of anonymisation applied to it.
Kirkegaard, one of the authors of the paper, told Motherboard in an email "Preferably I would like to wait until the heat has declined a bit before doing any interviews. Not to fan the flames on the social justice warriors." Since the publication of the paper earlier this week, he has uploaded a password protected version of the data. But it is still possible to access the open version, by clicking through the various revisions of the data listed on the publishing site, Motherboard found.
Update: This story has been updated to reflect comment from OkCupid.