R&D Upgrade From University of Texas at Austin: Power of DNA to Store Information

From The University of Texas at Austin

A team of interdisciplinary researchers has discovered a new technique to store information in DNA – in this case ‘The Wizard of Oz,’ translated into Esperanto – with unprecedented accuracy and efficiency.

The technique harnesses the information-storage capacity of intertwined strands of DNA to encode and retrieve information in a way that is both durable and compact.

The technique is described in a paper in Proceedings of the National Academy of Sciences.

“The key breakthrough is an encoding algorithm that allows accurate retrieval of the information even when the DNA strands are partially damaged during storage,” said Ilya Finkelstein, associate professor, molecular biosciences, and one of the authors of the study.

Humans are creating information at exponentially higher rates than we used to, contributing to the need for a way to store more information efficiently and in a way that will last a long time. Companies such as Google and Microsoft are among those exploring using DNA to store information.

“We need a way to store this data so that it is available when and where it’s needed in a format that will be readable,” said Stephen Jones, research scientist, who collaborated on the project with Finkelstein; Bill Press, professor, jointly appointed in computer science and integrative biology; and John Hawkins, Ph.D. alumnus. “This idea takes advantage of what biology has been doing for billions of years: storing lots of information in a very small space that lasts a long time. DNA doesn’t take up much space, it can be stored at room temperature, and it can last for hundreds of thousands of years.”

DNA is about five million times more efficient than current storage methods. Put another way, a one milliliter droplet of DNA could store the same amount of information as two Walmarts full of data servers. And DNA doesn’t require permanent cooling and hard disks that are prone to mechanical failures.

There’s just one problem: DNA is prone to errors. And when a genetic code has errors, it’s a lot different from when a computer code has errors. Errors in computer codes tend to show up as blank spots in the code. Errors in DNA sequences show up as insertions or deletions. The problem there is that when something is deleted or added in DNA, the whole sequence shifts, with no blank spots to alert anyone.

Previously, when information was stored in DNA, the piece of information that needed to be saved, such as a paragraph from a novel, would be repeated 10 to 15 times. When the information was read, the repetitions would be compared to eliminate any insertions or deletions.

“We found a way to build the information more like a lattice,” Jones said. “Each piece of information reinforces other pieces of information. That way, it only needs to be read once.”

The language the researchers developed also avoids sections of DNA that are prone to errors or that are difficult to read. The parameters of the language can also change with the type of information that is being stored. For instance, a dropped word in a novel is not as big a deal as a dropped zero in a tax return.

To demonstrate information retrieval from degraded DNA, the team subjected its ‘Wizard of Oz‘ code to high temperatures and extreme humidity. Even though the DNA strands were damaged by these harsh conditions, all the information was still decoded successfully.

“We tried to tackle as many problems with the process as we could at the same time,” said Hawkins, who recently was with UT’s Oden Institute for Computational Engineering and Sciences. “What we ended up with is pretty remarkable.”

Bill Press is the Warren J. and Viola M. Raymer Professor, Computer Science and Integrative Biology, UT Austin and a member of the National Academy of Sciences. The research was funded by a College of Natural Sciences Catalyst Grant, the Welch Foundation and the National Institutes of Health.

Article: HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Proceedings of the National Academy of Sciences has published an article written by William H. Press, Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, John A. Hawkins, Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712, Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, Stephen K. Jones Jr, Jeffrey M. Schaub, Ilya J. Finkelstein, Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712.

Abstract: “Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.“