What the New Reddit and Imgur Research Project Plans to Do with User Data

A new project is going to open up the internet's largest social communities to researchers.

|
Aug 21 2014, 10:00am

Image: Flickr/Beatrice Murch

Reddit, Imgur, Twitch, and a couple other major internet portals are about to hand over reams and reams of user data to university professors to do research on. How can the sites do it while still maintaining users' trust?

Earlier this summer, Facebook ran into trouble with just that, when it came out that the company was altering users' newsfeeds in order to experiment on them. Now, a new initiative, called the Digital Ecologies Research Partnership (DERP) will allow researchers to access user information and posts from Reddit, Imgur, TwitchFark, and StackExchange. 

The idea is to let researchers more easily scrape user posts, data, popularity information and that sort of thing in a cross-platform way, allowing a scientist to study, for instance, how a user base reacts to a particular story as it spreads from site to site.

Image: Derp

It's an idea that is, considering the sites' collective power, quite frankly a bit past due. But it's also one that's going to have to be implemented very carefully. Each platform knows that mishandling user data could have catastrophic consequences for its popularity (or, at the very least, could rile up the digital pitchfork-toting portion of its user base). 

At the same time, it's really important to start understanding, on a large scale, how these massive communities that have become such huge internet destinations work, and what they can tell us about human behavior, online or otherwise.

"I think all of our partners are super sensitive to privacy—we've all gotten together and said, 'how do we do this right?'" Tim Hwang, Imgur's head of special initiatives (Imgur is the one who got the DERP ball rolling), told me. "It comes down to honoring our privacy commitments and generally not being a dick to your user base."

Hwang says that much of the data researchers will be grabbing is already publicly available from each site's APIs, and that a lot of the project will simply be "aiding researchers in navigating what we already make available to the public."

It'S GETTING CLOSER TO A POINT WHERE EVERYTHING IS SENSITIVE—EVEN IF YOU STRIP DOWN AND ANONYMIZE EVERYTHING, YOU CAN REIDENTIFY A LOT OF THE DATA

Even so, that's not always going to be the case. And even when things are publicly available, users of the sites might not necessarily know that (though they probably should).

Several of the researchers involved with the project told me that they're interested in unpacking some of the drama that occurs on each site, research that will certainly require greater access than what's publicly available.

Whitney Phillips of Humboldt State University and Ryan Milner of the College of Charleston, for instance, say that they're hoping to interview Reddit administrators, moderators, and users about trolling behaviors in an attempt to understand (among other things) why people are such jerks on the internet.

"We're interested in the human machinations behind problematic online behaviors," Phillips told me. "This means approaching the subject from a political economic and ethnographic line of inquiry, the idea outcome of which is to figure out how a given group of people—paid staff, moderators, company shareholders, programmers—arrived at whatever decision about how to regulate on-site behaviors (or not)."

"What DERP will do, then, is provide access to precisely the kind of behind-the-scenes data (and by data I am most definitely including narratives, i.e. the stories people have to tell about their jobs) that will allow us to ask more nuanced questions, and come to more nuanced conclusions, about online antagonisms," she continued.

You can see how things like private messages and access to private moderator subreddits and the like might come in useful here. That's not to say the pair will get whatever they want, as Milner pointed out to me, but they can at least ask.

Image: Flickr/Torley

So, is that problematic from a privacy standpoint? It's really up to the users to decide, I suppose. But the platforms are already considering it. To help them navigate the tricky decisions about what is and isn't fair game, Sara Watson, a fellow at Harvard's Berkman Center for Internet and Society who studies personal data, has been brought on to advise the platforms.

"I think it's getting closer to a point where everything is sensitive—even if you strip down and anonymize everything, you can reidentify a lot of the data," Watson told me. "We're trying to figure out, what's the appropriate way to handle this data?"

For now, it'll be given out on a case-by-case basis, and Watson will be helping the sites decide what's cool to do and what's not cool to do, and will potentially develop a framework so the decisions can be made more quickly. 

On the "no" list, at least initially, are requests for information for companies to do marketing research. But how about research that Imgur, or Reddit, specifically wants done in order to improve their services? Isn't that, on some level, what Facebook was doing with its emotions study?

"Obviously, the researchers are working with these companies who have an interest in what the outcomes of their research are. [The involved companies] are going to benefit from people spending time looking at this," Watson said. "These companies often don't employ social scientists or data scientists, so they're going to be finding out new things about the activity on their communities."

And, at that point, the companies would be stupid to not make wholesale changes, if the data suggests they should. Imgur's Hwang told me as much.

"I think our intent is to support the research first and foremost, but what comes out of it, we'll have to see," he said. "It might identify community issues and help improve the platforms."