Natural language processing comes to paleontology.
Image: A Guy Taking Pictures/Flickr
Computers are increasingly being used to help us understand our origins, and writing the evolutionary history of life itself could soon be a joint effort between people and machines.
Paleontologists scour numerous journal articles to cobble together enough information about fossils to get a picture of life on earth at specific times. Developing macroevolutionary theory—how birds evolved from dinosaurs, for example—is a data-intensive endeavour dependent on how well researchers can aggregate and organize huge amounts of information from scattered sources.
To aid in this pursuit, researchers from around the globe have banded together to develop the Paleobiology Database. The PBDB is a dynamic database of fossil finds organized by geographic location, taxonomic relationships, and place in the evolutionary timeline. It began in 1998, and more than a decade later, researchers are still manually reviewing papers and entering information. Hundreds of studies have made use of its data.
That's a lot of time and brainpower that could have been used to produce science, instead of extracting facts from a huge corpus of paleontological texts and entering them into a database; an ideal task for a computer.
"I shudder to think—it's valuable, and it's provided a ton of science, so it's totally justified—but all the time and energy that's gone into people keystroking data into the Paleobiology Database… At some level, that's wasted effort," Shanan Peters, a professor of paleobiology at the University of Wisconsin-Madison and co-director of the PBDB's IT team, told me.
"Ideally, we'd like to get to a point where that time, that energy, and that effort, could be put into analyzing the results of data and syntheses and thinking creatively about leveraging them and assessing them," he continued.
Peters and a team of colleagues including Christopher Ré, the Stanford computer scientist who developed the DeepDive natural language processing system, are developing software that can do what the humans at the helm of the PBDB have spent so much time on: analyzing papers, extracting relevant points of information, and making connections between them.
In other words, they're designing a system to accomplish what proponents of automation in the manufacturing sector claim that robots will do: free human creativity from the bonds of manual, menial, and otherwise non-mental labour.
Building off of the DeepDive framework, Peters and his colleagues produced a system custom-tweaked to produce a database similar to the PBDB—PaleoDeepDive. A paper outlining their work is available on the arXiv preprint server.
PaleoDeepDive converts images of journal articles into digital text and processes the language, extracting relevant terms and features. Using rules that tell it what conventions to look for—that "Homo sapien" indicates a binomial genus-species combination, for example—and training data that teaches it how statistically likely it is that relationships between terms identified by the rules are correct or not, PaleoDeepDive can infer complex relationships between points of data.
In a kind of modified Turing Test, Peters and his team presented PDBD contributors with a set of 400 facts and the source documents they were pulled from. Then, they asked the scientists to vet the accuracy of the facts. What they didn't tell their unsuspecting volunteers was that half of them were pulled from the human-curated PDBD, and half were compiled by PaleoDeepDive.
The results were that the machine did just as well as humans
"The results were that the machine did just as well as humans," Peters said. "In other measures, it did better than humans. In simple taxonomic relationships, for example. So, on that level, it turned out pretty well."
Not everything went smoothly, however. Peters and his colleagues discovered that human reviewers often update findings in old papers with modern knowledge, or include dates from other sources that add to the richness of the data.
It appears as though a good deal of human creativity—something that is incredibly difficult, if not impossible, to reproduce with a computer—is at play. Even so, Peters told me, there are ways to improve PaleoDeepDive. For example, improving its ability to recognize text or, at base, feeding it more data so that its statistical calculations can become more robust and its inferences more complex.
In just a few years, Peters hopes to have turned PaleoDeepDive, which is still in its nascent stages, into an effective and indispensable partner for humans in the quest to track the evolutionary history of life on earth.
"I know that there are measurements out there that if I could magic them into a synthetic form in something like an Excel spreadsheet… If I could do that right now, there's all kinds of interesting questions I could address. There's all kinds of things that I would learn," Peters said. "What we're hoping to create is a platform to allow that creativity to not be limited by time and effort."