Tech

Zuckerberg Boasts He Will Be AI God King Because We Already Gave Him All Our Data

Meta's plan to overtake the competition in AI hinges on billions of images, posts, and videos that were willingly given.
Zuckerberg Boasts He Will Be AI God King Because We Already Gave Him All Our Data
image: Bloomberg / Contributor via Getty Images

The last several months have been good to Mark Zuckerberg’s Meta, as the company revealed on Thursday that its profits tripled year-over-year to $14 billion due to cost-cutting and a rebound in ads. 

Meta plans on investing heavily in its virtual reality and AI products, and on the latter point, Zuckerberg said during an earnings call the company is “playing to win.” AI has become a crowded field very quickly, with Meta facing stiff competition from OpenAI, Microsoft, and Google. Zuckerberg laid out several components to Meta’s AI “playbook,” not least of which is the fact that billions of people around the world have already given up their data in the form of posts, comments, images, and videos across Meta’s platforms, which include Facebook and Instagram. 

“When people think about data, they typically think about the corpus that you might use to train a model up front,” Zuckerbrg said. “On Facebook and Instagram there are hundreds of billions of publicly shared images and tens of billions of public videos, which we estimate is greater than the Common Crawl dataset and people share large numbers of public text posts in comments across our services as well.”

The reference to Common Crawl—a shared dataset that has continuously scraped the web over the years—is likely a shot at OpenAI specifically, since the company’s GPT-3 AI model was trained on Common Crawl in addition to Wikipedia, two datasets containing books, and an internal dataset composed of Reddit links. OpenAI has not made the training sources for its most recent model, GPT-4, public. Meta has also used Common Crawl for its AI projects, and Google maintains its own version of the data set. 

While Meta has yet to truly compete with its rivals on the scale of GPT, it’s no secret that the company leverages user data for its AI products. The company already admitted last year that it had used public posts—but, it claimed, not private messages—to train its Meta AI assistant. Much furor has been raised in recent months over the unauthorized scraping of the web to train AI models; OpenAI even thanked the faceless “millions of people” who created the data to train GPT-3 in its paper describing the model. But when it comes to data willingly shared with Facebook and Meta, that Faustian bargain was struck long ago. 

Now, along with supposed advantages such as sharing open-source models and taking a long view of product development, Meta is betting that its massive hoard will put it over the top.