FYI.

This story is over 5 years old.

Tech

A Massive New Scientific Database Will Map the Spread of Infectious Disease

Vast amounts of data already existed across various databases and scientific literature. What researchers lacked was a means of bringing it all together, and visualizing it.
Image: EID2

Preventing, tracking, and controlling outbreaks requires a variety of tactics (DNA/RNA sequencing, inoculation, education, quarantine), but data is gold in the fight against diseases and pathogens. Now, researchers at the University of Liverpool are building the Enhanced Infectious Disease Database (EI2D), which they describe in a press release as "the world's most comprehensive database describing human and animal pathogens, which can be used to prevent and tackle disease outbreaks around the globe."

The project was launched when the EI2D team realized that vast amounts of data already existed across various databases and scientific literature. The most critical data lies in scientists' sequencing of a disease's RNA and DNA (already uploaded to various searchable databases). What researchers lacked was a means of bringing this data all together in one virtual location.

Advertisement

EID2 accesses and pulls metadata on pathogens' DNA and RNA sequences from NCBI's (National Center for Biotechnology Information at the US National Library of Medicine) nucleotide database, and PubMed, NCBI's scientific publications database.

"[EID2] is matchless in scale, and has the capacity to hold data on all known human and animal pathogens, when detailed information becomes available," said epidemiologist Dr. Marie McIntyre, a member of the EID2 team. "We use largely automated procedures to collate data on human and animal pathogens: where, when, and in which hosts there is evidence of their occurrence."

Dr. McIntyre told Motherboard that EID2 trawls data using scientific names and alternative names or synonyms for pathogens and hosts "to identify where and in which hosts pathogens occur."

"Where a host has a 'difficult' common name such as 'dog', 'cow,' (there are dog foxes and dog fishes and the plant dog mercury, and because there are commonalities like a New York in Lincolnshire as well as New York state," said McIntyre. "We've in some cases had to design special algorithms to correctly identify the hosts and locations of where pathogens occur."

McIntyre said that the team has been building EID2 for the last four or five years. Originally, they designed it for the ERA NET Health and Climate in Europe (ENHanCE) project.

"It was created by ourselves and a few past members of the LUCINDA group," she said. "It's unique in its scale and in the way it links to evidence, although some databases of for example human and domestic animal pathogens have been created before, they don't have the same resolution of information about where lots of pathogens occur, and in which hosts they occur, and the EID2 links directly back to the scientific evidence for each bit of information it contains."

Advertisement

In the design stage, McIntyre said the team discussed databases like GoogleFlu. But, while Twitter has been useful in tracking infectious disease, the social media unique and unexpected utility for virus tracking wasn't a consideration. McIntyre noted that this was because the team was not concentrating on tracking diseases and pathogens then (or now) currently emerging. Instead, they are more interested in identifying the baseline data for diseases and pathogens.

All told, more than 60 million pieces of data funnel into EID2. And since it's open source, more pieces of data are constantly being added. From there, researchers can visualize the desired data (though not with EID2 itself).

One graphic, provided by team member Dr. Maya Werdeh illustrates, shows EI2D data describing "the number and types of pathogens found in EU countries" with circles. The largest circles indicate the largest number of pathogen species the EID2 team has evidence, of, while the various colors represent viruses (blue), bacteria (greenish yellow), fungi (red), helminths (purple blue), and protozoa (orange).

Another data visualization, resembling a color wheel, depicts pathogen species for which EI2D has information:

The innermost circle represents the number of species listed within the NCBI taxonomy database in the major groupings that contain pathogens (denoted as ‘Species’, www.ncbi.nlm.nih.gov/taxonomy). The intermediate circle is those for which sequences (and metadata potentially describing their host origin) are available in the NCBI Nucleotide database (denoted as ‘Sequenced’, www.ncbi.nlm.nih.gov/nucleotide). The outermost circle represents pathogen species for which data has been captured about the hosts in which they occur within the EID2 database itself (denoted as ‘Cargo’).

Advertisement

Data derived from these sources on pathogen hosts, areas of outbreak and occurrence, etc., can be viewed and improved on EID2 over time, giving researchers a clearer picture of the global infectious disease landscape through what is called disease mapping.

"This disease mapping is one of the most important areas where EID2 can be a valuable tool," reads the University of Liverpool press release. "Research has shown that only four percent of clinically-important diseases in humans have been geographically mapped, despite half having a strong rationale for mapping."

The EID2 team hopes to make the disease mapping process quicker and more accurate. Not only will it be able to present a global picture, but it will allow researchers to scale their research down to the county level, as well as map crop diseases, which have the potential to infect animal species before moving onto humans.

EID2 can, according to McIntyre, also leverage network analysis to dive into the interactions between pathogens and hosts. This will allow researchers to study the "possible routes by which pathogens make it into human populations."