A range of tactics are going to be needed to solve Wikipedia’s problem with harassment.
Despite its noble goals, Wikipedia is notorious for harassment among its editors. Now, research from tech incubator Jigsaw and the Wikimedia Foundation is looking at how artificial intelligence can help stop the trolls.
The research project, called Detox, began last year and used machine learning methods to flag comments that contain personal attacks. The researchers looked at 14 years of Wikipedia comments for patterns in abusive behaviour. Detox is part of Jigsaw's Conversation AI project, which aims to build open-source AI tools for web forums and social media platforms to use in the fight against online harassment.
The algorithm could determine the probability of a given comment being a personal attack as reliably as a team of three human moderators
A paper published last week on the arXiv preprint server by the Detox team offers the first look at how Wikimedia is using AI to study harassment on the platform. It suggests that abusive comments aren't the domain of any specific group of trolls, and that diverse tactics are going to be needed to combat them on Wikipedia.
"This is not ground-breaking machine learning research," said Ellery Wulczyn, a Wikimedia data scientist and Detox researcher, in a telephone interview. "It's about building something that's fairly well known but allows us to generate this data scale to be able to better understand the issue."
The goal at Jigsaw, an Alphabet tech incubator that began as Google Ideas, is nothing short of battling threats to human rights and global security. Their projects include a map that shows the sources and targets of global DDoS attacks in real time, and an anti-phishing extension for Chrome originally developed to protect Syrian activists from hackers.
To get their algorithm to recognize personal attacks, the Detox team needed to train them on a solid data set. They started with 100,000 comments from Wikipedia talk pages, where editors hash out their disagreements. Next, 4,000 crowdworkers evaluated the comments for personal attacks. Each comment was inspected by 10 different people.
Read More: Wikipedia Editor Says Site's Toxic Community Has Him Contemplating Suicide
The result is one of the largest annotated datasets looking at online abuse ever assembled. It's all available on Figshare, and their code is on Github so it can be used by platforms beyond Wikipedia.
After being trained on the dataset, the algorithm could determine the probability of a given comment being a personal attack as reliably as a team of three human moderators.
The Detox team then ran 63 million English Wikipedia comments posted between 2001 to 2015 through the algorithm and analyzed the results for patterns in abusive comments.
"It's basically simulating labelling every comment in the history of Wikipedia by three people," said Wulczyn. "That's expensive and time consuming, but we can do it with the model in a reasonable amount of time. It opens up all these possibilities for analysis and for gaining a better understanding of the issue."
The results of the analysis surprised Wulczyn, he said. Although comments from unregistered users were six times more likely to contain an attack, more than half of all abusive comments came from registered, identifiable users.
The abuse wasn't coming from an isolated group of trolls
What's more, the abuse wasn't coming from an isolated group of trolls. Almost 80 per cent of all abusive comments were made by over 9000 "low-toxicity users"—people who made less than 5 abusive comments in a year. On the flip side, nearly 10 percent of all attacks on the platform were made by just 34 highly toxic users.
"It shows there will have to be a diversity of tactics to end this problem," said Wulczyn, "which is good to know."
The researchers will use the data to look at how attacks affect editor retention, an ongoing concern for Wikipedia.They're also looking at ways machine learning can help make for friendlier discussions on Wikipedia. For example, they say an AI could be used to flag a comment for human moderation or build dashboards to give moderators a better look at the discussions taking place.
"I myself am very wary of using machine learning methods to make automatic decisions," said Wulczyn. "Everybody is. That's not really up for debate. But the question is: can you use the algorithm to help triage incidents? Those are things that we've talked about."
One caveat is that once people find out a computer is monitoring their discussion, they might try to game the algorithm. 4chan trolls were recently able to get around an algorithmic hate speech filter by replacing racial slurs with the names of Google products.
"We don't know what happens when humans enter into an adversarial relationship with the algorithm," he said. "That becomes a very different problem, where you need to constantly monitor whether the things humans think are personal attacks are still the same things the model thinks are personal attacks."
But it's a challenge that the team is ready for.
"It's hard to see those words, that those things are being said to people on Wikipedia," Wulcyzn said.
Get six of our favorite Motherboard stories every day by signing up for our newsletter.