Saving Human Knowledge at 800 Pages an Hour
Inside the Internet Archive's book scanning centre.
Images: Victoria Turk
Book Scanner Helen Claes
On the top floor of the Wellcome Library, 12 people sit in the darkness of blackout blinds, illuminated by the white LEDs of a scanning machine.
This is the Euston Scan Centre, currently home to a team from the Internet Archive. They are part of an ambitious project to digitise the 19th century book collections of 10 UK libraries relating to the topic of medicine. They each aim to scan 800 pages an hour. They've scanned over 2.5 million since they started with a full staff in October, and will have done 16-17 million by 2016.
The room smells of modern library: dust, carpet tiles, warm electronics. It's silent but for the "snap" sound when an image is captured, and the irregular squeaking of the foot pedals that operate the scanner beds.
Chris Booth, the Internet Archive's UK regional digitisation manager, talked through the huge digitisation project. It starts with the books, which come from the Wellcome Library and other partner libraries. The definition of medical books is loose; texts on pseudoscience like phrenology are included. The books arrive in orange crates, having been pre-checked to make sure there are no duplicates already online. Each is given a stable URL from the start as a unique identifier.
On the shelves, they're checked for scanning suitability. Some really thick tomes won't work, as the scanner can't reach right into the "gutter" of the pages, leaving words chopped off—"because they didn't think about digitizing in the 19th century," says Booth. Many have a bandage of white ribbon holding their pages together so they don't crumble apart. Booth tells me some even have uncut pages: After all this time, they've never been opened.
The point of the digitization project is to make sure these books do get read, or at least that they're available to whoever might want to read them.
The machines used by the scanning team, called Table Top Scribes, are based on an open source design. The book scanner puts a book, open, on a V-shaped platform, then uses the foot pedal to lift it to a V-shaped glass plate. Two Nikon cameras snap the two pages at once. "They really don't do much damage at all," Booth said, pointing out that LED light is used to prevent any ultraviolet damage. In fact, he said, it was often "a holiday for the books," which would be dealt with less carefully in regular circulation.
Once they're scanned, the pages appear on a monitor and the scanner makes sure all the text has been captured; a team in the US also does quality checks. The software used, Scribe, was built by the Internet Archive.
It's methodical work, but the team often comes across interesting finds—gory images of diseases, weird out-of-date views on pregnancy from male authors, even polite rejection letters from editors used as bookmarks—and they share the best with each other over Skype. Most scanners have headphones in and I catch one with a small window in the corner of her screen, simultaneously checking pages and watching a YouTube video.
A couple of bigger Scribes can work with larger books, and a separate table is set up to photograph pull-out diagrams, artwork, and maps, which occur frequently in the old texts. Booth picks a couple at random to show me; one depicts some sort of meticulously sketched ophthalmology test, another shows drawings of surgical tools and what looks like a mechanical head.
The digitized books will become part of the UK Medical Heritage Library, hosted at archive.org, and mirrored by the Wellcome Library. They're published under a creative commons license, and Booth said several artists had already been in touch with collages made from images in the library.
Once the books are scanned, they're sent back to their library home to be forgotten or mishandled once again. At least their digital twins won't have to put up with bashed spines and dog-eared pages.