Tech

ChatGPT Can Reveal Personal Information From Real People, Google Researchers Show

The popular AI chatbot is divulging sensitive information from its training data, according to a team of researchers at Google.
A laptop on a desk displaying a login screen for ChatGPT
Getty Images

A team of Google researchers have unveiled a novel attack on ChatGPT, showing that OpenAI’s popular AI chatbot will divulge personal information from real people. 

The underlying machine learning model that powers ChatGPT, like all so-called Large Language Models (LLMs), was trained on massive amounts of data scraped from the internet. With training and reinforcement from humans, the program ideally generates new strings of texts without churning out any of the original text it ingested. Previous work has already shown that image generators can be forced to generate examples from their training data—including copyrighted works—and an early OpenAI LLM produced contact information belonging to a researcher. But Google’s new research shows that ChatGPT, which is a massively popular consumer app with millions of users, can also be made to do this. 

Advertisement

Worryingly, some of the extracted training data contained identifying information from real people, including names, email addresses, and phone numbers. 

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The attack identified by the researchers relied on finding keywords that tripped up the chatbot and forced it to divulge training data. The inner workings of AI chatbots are often opaque, and earlier work by independent researchers found that particular phrases can cause the chatbot to totally fail, for example. The Google researchers focused on asking ChatGPT to repeat certain words ad infinitum, for example, the word “poem.” The goal is to cause ChatGPT to “diverge” from its training to be a chatbot and “fall back to its original language modeling objective.” While much of the generated text as a result of this adversarial prompting was nonsense, the researchers report that in some cases ChatGPT diverged to copy outputs directly from its training data. 

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet. 

Notably, the attack was launched against the GPT 3.5 AI model, which is available to free users. Another model, GPT-4, is available to users who subscribe. Motherboard tested the “poem” attack on GPT-3.5 and found that it generated an unrelated string of text, although we could not find it pre-existing elsewhere on the web. When GPT-4 was asked to repeat the word poem forever, it essentially refused. 

The researchers noted in a companion blog post that “OpenAI has said that a hundred million people use ChatGPT weekly. And so probably over a billion people-hours have interacted with the model. And, as far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”

OpenAI did not immediately return a request for comment.