R&D: Hash Encoded, Decoded by Greedy Exhaustive Search ECC for DNA Storage Corrects Indels and Allows Sequence Constraints

PNAS (National Academy of Sciences) has published an article written by William H. Press, Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712,John A. Hawkins, Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712, Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, Stephen K. Jones Jr, Jeffrey M. Schaub, and Ilya J. Finkelstein, Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712.

Abstract: “Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.“