FYI.

This story is over 5 years old.

Tech

An Enterprising Redditor Has Archived Almost a Decade of Comments

And it comes to over 1.5 billion contributions.
Rachel Pick
New York, US

Thanks to the efforts of Redditor u/Stuck_In_The_Matrix (real name Jason Baumgartner), you can now download a torrent containing every Reddit comment from October 2007 through May of this year.

After working on the archive for 10 months, Baumgartner started a thread announcing his project and asking for input on how to host and share this massive amount of information. The files include over 1.5 billion comments and make up over a terabyte of data, so users decided it would be best compressed as a torrent file. A direct magnet file is also available.

"I started this project because there have been a lot of interesting and informative comments posted on Reddit," Baumgartner told Motherboard. "My goal was to ingest all of that data into a large database so that people could more easily search through the data to find interesting posts and comments…I wanted to give developers the opportunity to develop their own tools with the data and also give university students the ability to explore research opportunities."

About 350,000 comments were unavailable due to Reddit API issues, but the rest show up as discrete JSON objects once the file is unzipped. Each object includes the commenter's username, the comment score, its position in the thread where it was left, and the date it was posted. (Comments from private subreddits were not included.)

Baumgartner has also uploaded the torrent to the Internet Archive, which is a 501(c)3 founded in 1996 and headquartered in San Francisco. According to the Archive's website, its aim is "offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format."

Reddit often gets a bad rap because of its less intellectual subforums, but it can also be a valuable source of scientific and technological information, and a fascinating study of how Internet communities are formed. It will be interesting to see what comes out of Baumgartner's project.