Not Even AI Can Make Total Sense of a Privacy Policy

Using artificial intelligence, researchers have created a tool that crawls privacy policies on popular websites like Facebook, Reddit, and Twitter. But the software’s findings are not as detailed as those done by humans.

|
Mar 20 2018, 2:00pm

Image: Pexels

Nobody actually reads through the privacy policies of every website, which is why researchers recently used artificial intelligence to create a tool that reads them for you and flags anything you might not be psyched to agree to.

Launched earlier this year as a part of the Usable Privacy Project, the tool uses artificial intelligence to crawl through 7,000 of the web’s most popular sites, including Facebook, Reddit, and Twitter, and parse their privacy policies. That data is available on the project’s website, where you can search for a site and see a breakdown of some of the most pivotal information included in that site’s privacy policy, including whether the company that owns the site is collecting data on its users, and whether it’s sharing that data with any third parties.

The AI's crack at VICE.com's privacy policy. Image: screengrab from Usable Privacy

Most of us don’t bother to read these policies, despite the fact that the majority of Americans feel very strongly that privacy is important. But we have a good excuse: studies have shown the average internet user would need to take a month off of work every year to read through all the privacy policies of websites they use.

“Even after you’ve read it, sometimes you need special training to fully understand the nuances of the language,” Norman Sadeh, the lead principal investigator on the project and a professor of computer science at Carnegie Mellon University, told me over the phone. “We don’t want people to read these privacy policies because that would be highly unrealistic. Instead, with technology, we can extract statements and match with things people care about.”

But it turns out even AI can’t make sense of these dense, jargon-laced documents, and might miss some important context.

“This is really pushing the envelope so it’s very hard to do this overnight,” Sadeh said. “We’re [actually] able to do more than what we’re showing but we’re trying to be careful because machine learning is not perfect and never will be.”

In a paper published alongside the project, Sadeh and his colleagues stated that when searching for entire paragraphs, the AI was able to identify relevant passages with 79 percent accuracy. Its accuracy was 70 percent when looking for individual, relevant sentences.

When left to an expert human to deconstruct, however, privacy policies can get broken down to a much more granular level. Take a look at the AI’s attempt at parsing YouTube’s privacy policy, then check out when an expert human tackled it. The AI missed three-quarters of the third party sharing references that the human caught.

Sadeh told me the AI is always improving, but it can be a slow process because in order for it to learn what different terms mean in a policy, it first has to have multiple examples that have been analyzed by a human. The AI is really good at flagging some common phrases that appear in almost every privacy policy, but more obscure or unusual language can trip it up.

“It’s machine learning so you’re building classifiers and these classifiers are trained on as large of a dataset as you can get,” Sadeh said. “But obviously to have the data, you need to rely on humans annotating policies in the first place, which is a very time-consuming process.”

The researchers noted that at this stage, the AI isn’t able to parse sentences in context with preceding or following sentences. And in case you think AI might be more objective than humans, think again: bias in artificial intelligence is a real issue that researchers are still struggling with.

The team wants to be able to make Usable Privacy available as a custom browser plugin by the end of the year, Sadeh told me. As this kind of technology develops, he said it can be used for other headache-inducing online legalese, like terms of service, which state what a user is agreeing to in order to access the site.

Though there are plenty of valid concerns with automation, if researchers can improve the accuracy of this kind of technology, it’s a job I’m sure all of us will have no problem outsourcing to a robot.

Get six of our favorite Motherboard stories every day by signing up for our newsletter .