The Internet Archive Can't Preserve the Web's History by Itself

The Joy Reid saga highlights the strengths and weaknesses of web archiving.

|
Apr 30 2018, 4:56pm

Image: J Countess/Getty Images

Michael L. Nelson works for the Web Science and Digital Libraries Research Group at Old Dominion University.

This weekend on her show AM Joy, Joy Reid stated that security experts had not been able to prove that her blog had been hacked or manipulated, and while she “genuinely does not believe [she] wrote those hateful things,” she did admit that she “can definitely understand, based on things I have tweeted and have written in the past why some people don’t believe me.” Events throughout the week included a denial of hacking the by the Internet Archive, and a growing chorus of experts publicly expressing doubt about her version of events (see articles in my research group's blog, The Atlantic, HuffPost, and The Daily Beast). This will likely be the end of the story for Joy Reid’s now defunct blog (); her detractors on the right may continue to call for her removal but her fans on the left are surely eager to put the episode behind them.

Even though the story of Joy Reid’s blog maybe closing, a similar story likely will unfold again with different characters and minor variations, so what can we learn about both the capabilities and limitations of web archiving in anticipation of “next time”?

Don’t rely on screenshots

Social media is full of screenshots of web pages being used as evidence. Screenshots allow for annotation, highlighting, and circumventing character limits on Twitter, but the ease with which they are manipulated means they are unreliable and may fail to properly document their source. For pictures of kittens or your friends’ children, such provenance is probably not necessary, but if you seek to document a public figure’s malfeasance, the evidentiary threshold is higher.

In the Joy Reid saga, screenshots presented two problems. First, in the original tweets with screenshots of text passages, it was difficult to find the dates the blogs were posted, direct links to the posts were not available at all. For example, for a post about NBA player Tim Hardaway called “Tim Hardaway is a homophobe and so are you,” which included lines like “most straight people cringe at the sight of two men kissing” and “I admit couldn’t go see [Brokeback Mountain] either, despite my sister’s ringing endorsement, because I didn’t want to watch two male characters having sex.”, the direct link to just the article was:

http://blog.reidreport.com/200...

But the content also appeared on 2007-02-15 at the top level of the blog as well:

http://blog.reidreport.com

Image: Michael L. Nelson

The original tweet about the Tim Hardaway post does not show the date, nor does it provide the direct link. In my own research, of the 50 or so screenshots shared on Twitter from her blog, I was initially able to infer post times for only about 12.

Second, despite denials of editing (other than annotations) of the screen shots, the Tim Hardaway tweet shows the image at the top of the blog post next to text that appears midway through blog post; this could only happen via editing the image and understandably creates confusion about the discrepancy.

In summary, when sharing evidence on social media, augment screenshots with links to the live web, and links to those pages in multiple web archives.

Do use multiple web archives

Reid’s lawyers sent letters letters to both Google and the Internet Archive in December requesting that they take down the archived blog and information regarding possible hacks or intrusions. The Internet Archive declined to take the blog down, and in February, someone in Reid’s orbit used the robots.txt exclusion protocol to effectively redact the copies in the Internet Archive (this is a standard, automated method for owners of live web sites to control which, if any, pages at a site should be served from the Wayback Machine.)

Archives have some limitation. Image: Michael L Nelson

What Reid’s team did not appear to anticipate is that copies of her blog would appear in other web archives, one of which was the Library of Congress’s web archive, which does not honor robots.txt exclusion. In fact, three of the example blog posts her lawyers claim were fraudulent:

http://blog.reidreport.com/200...

http://blog.reidreport.com/200...

http://blog.reidreport.com/200...

are contained in a 2006-01-11 archived version at the Library of Congress:

http://webarchive.loc.gov/all/20060111221738/http://blog.reidreport.com/

In this case, there are copies in two distinct (geographically and administratively) systems, but they are not independent observations. The important point is that while the robots.txt redacted the Internet Archive’s version of the page, it did not redact the version in the Library of Congress.

You can decide for yourself if the content contains, as Reid’s lawyers state, “jarring changes in style and substance” or “uncharacteristic HTML/graphics formatting, and font selection, such as quote offsets, paragraph separators,” but this page alone tilts the forensic evidence against Reid’s version of events. If her blog had been hacked, it would have had to been hacked in January, 2006 for the web archives to have captured this page; a hack at a later date (say, 2007) would not alter the 2006-01-11 version in the web archives. You can read my detailed analysis, but the takeaway message is either:

  • Reid did not see the posts in question or recognize them as fraudulent and did not remove them (as well as not changing her password), despite regularly interacting with her blog (sometimes posting 10+ times per day), or
  • After the last post at 4:51pm EST on 2006-01-11 and before the archiving time of 5:17pm EST the same day, an adversary posted the content (including backdating the posts, which is possible in most blogs), and inserted links to “brokeback-committee-room.html” in other legitimate posts. Keep in mind that the Internet Archive did not have the “save page now” function until 2013, so there was no way an adversary could know in advance when the Internet Archive would crawl that page (and in 2006, unlike today, crawls were irregular and infrequent.)

While possible, either scenario appears unlikely, especially when you consider the scenario would have to be repeated for every fraudulent or disputed post over a period of several years.

The importance of other web archives was briefly diminished because the Internet Archive had a URL canonicalization hole that allowed people to circumvent the robots.txt exclusion (in short, swapping “http” with “https”, as in “https://blog.reidreport.com/”) and allowed many people to inspect the Internet Archive’s copies of the blog. However, this hole was quickly closed by the Internet Archive, and we cannot assume similar holes will be open in the future.

How can you use multiple web archives? Services archive.is and perma.cc are on-demand public web archives that allow submission of individual pages (similar to the “save page now” feature at the Internet Archive), webrecorder.io allows for the creation of personal web archives, and the Los Alamos National Laboratory Time Travel service allows for querying of multiple web archives (for example, blog.reidreport.com is held in five different web archives other than the Internet Archive.)

Understand the limitations of web archives

It is important to note there are significant limitations with web archiving. Archived pages can be incomplete or render incorrectly because of Javascript problems, archives themselves can be hacked or otherwise untrustworthy, and archives can contain temporal violations: correctly archived images, Javascript, CSS, and other embedded resources can be combined in an HTML page to render a page that never actually existed on the live web. For example, Reid’s lawyers noted “There are no public comments on the fraudulent postings. If the fraudulent posts had been contemporaneously written, there would have been substantial comments and blow back.” The problem is Reid’s blog (ca. 2006) used Javascript to track comments, and while the HTML page above was archived on 2006-01-11, the Javascript index for the comments was archived on 2006-02-07, nearly a month later. Since these pages are out of sync, we cannot definitively know how many comments each page had on 2006-01-11, and to make matters worse the Javascript interprets missing information as zero comments, giving a false sense of lower public engagement with Reid’s blog. Furthermore, since the comment links were built with Javascript, which the Internet Archive’s crawler did not parse at the time, it is unlikely that the 2006 comments are archived anywhere.

Image: Michael L. Nelson

There are additional concerns when comparing versions from different web archives: services like archive.is radically transform pages at archival time, rewriting the HTML and removing all Javascript, so the resulting page might render differently than it did when it was on the live web. GeoIP, personalization, mobile vs. desktop and other issues can produce different versions of the same page in different archives, so comparisons between archives can difficult.

Ultimately, Reid’s version of events were not supported by the archived pages themselves. When the Internet Archive’s redaction policy was enacted, her argument was further undermined by the existence of additional web archives. Even though Reid’s story has likely ended, it is only a matter of time before a similar story unfolds. For those that seek to hold public figures accountable, a more rigorous interaction with and presentation of archived pages will limit uncertainty. For those on the receiving end of such scrutiny, a more careful consideration of the scope (as well as limitations) of not just the Internet Archive but all public web archives, will better inform their response.