Even highly accurate genetic sequences can have hundreds of thousands of errors—and it only takes a few.
Image: Duncan Hull/Flickr
If genomes are going to revolutionize personalized medicine, the first step will be sequencing the genome accurately.
It bears repeating just how far this tech has come: the price of sequencing a genome is rapidly coming down, as is the time it takes to do a sequence. It’s getting so easy that the price point is already well within the means of many middle class Americans, and the technology might soon prove useful enough to save lives. Proponents say that, in the future, personalized medicine will allow doctors to determine the specific genetic variants that predispose their patients to certain diseases, which will then help doctors to devise individualized—and more effective—treatments.
But with roughly six billion base pairs in the human genome, creating a truly accurate gene sequence is no easy task. Even the best sequencing techniques can have an error rate around 1 percent, which adds up to hundreds of thousands of errors. When diseases depend on single nucleotide insertions or changes, those errors can mean the difference between a misdiagnosis and an accurate one.
A group of researchers with the US government’s National Institute of Standards and Technology is trying to solve that problem with a program called Genome in a Bottle. With academic and commercial partners, the group is trying to create what is essentially one “perfect” human genome that can be a reference for sequencing labs. Though every genome is different, the places where sequencing errors most commonly happen are fairly well understood, and by comparing one sequence with a reference genome, doctors and researchers would be able to tell if they’ve made a mistake.
“We’re sitting here with billions of data pairs—it boggles the mind try to get that much information accurately determined,” said Marc Salit of NIST’s Genome Scale Measurements Group. “Even when we think we’re getting it right, a few missing bases or additional ones can make a huge difference.”
Salit and his colleague, Justin Zook, recently published a study in Nature Biotechnology discussing their solution to the problem. According to Salit, by sequencing the same genome many times and comparing the base pairs, they can create a reference that is much more accurate than what we already have.
“To develop a benchmark whole-genome data set, we have developed methods to integrate sequencing data sets from multiple sequencing technologies … the resulting genotype calls are more sensitive and specific and less biased than any individual data set,” they write.
There are already a few ways researchers try to get around the problem of genome inaccuracies today. Genomes aren’t sequences as a whole, from start to finish. Instead, they’re cut up into tiny, overlapping pieces and are sequenced separately, then put back together. The reason for doing this is twofold: It’s much faster than doing it in one fell swoop, and the fact that the sequences overlap means that errors can be caught more easily.
“When every part of the genome is covered 30 different times, that error rate should begin to go away,” said Christopher Black of the New York Genome Center. “But the problem is, when you have more coverage, you have more data so that requires more storage.”
Researchers already know how and why most sequencing errors occur, which is important. For one, there’s the problem of scale: When you’re sequencing what is essentially six billion letters, it’s hard to have 100 percent accuracy from a sheer numbers standpoint. But beyond that, there are certain regions of certain chromosomes that are more difficult to sequence than others. In some parts of the genome, particular nucleotides will repeat for much longer than seems reasonable—this can trip up certain machines. In other parts, the nucleotides vary much more between individuals than in others. Errors can easily occur there as well, the researchers say.
“If you have 15 A’s in a row, sequencers have a problem going through that,” Zook said. “Then there are regions that are really similar to each other, and it can be hard to know where in the genome it came from.”
Then there’s the problem of putting it all back together. Salit says that, because genomes aren’t sequenced in one long line, it’s “like putting a book through a paper shredder then trying to read it.”
Ideally, Genome in a Bottle will be able to test the accuracy of difference sequencing techniques and machines and companies will be able to choose the most accurate one. Creating the reference genome should help us get there faster. If it doesn’t, then the future of personalized medicine might be further away than we think.
“Most of the genome we can be pretty confident we’ve gotten the right answer,” Zook said. “But the challenge comes with these difficult regions of the genome. I think we’ll get there at some point.”