R&D: Six Articles on DNA Data Storage

R&D: Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage

Besides closing in on physical limits of DNA data storage, this study thus showcases maturity of error-correction coding and defines its current state-of-the-art.

BioRxiv has published an article written by Andreas L. Gimpel, Alex Remschak, Wendelin J. Stark, Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland, Reinhard Heckel, Department of Electrical and Computer Engineering, Technical University of Munich, Arcistrasse 21,80333, Munich, Germany, and Robert N. Grass, Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland.

Abstract: “A wide range of codecs with vastly different error-correction approaches have been proposed and implemented for DNA data storage to date. However, while many codecs claim to provide superior performance, no studies have systematically benchmarked codec implementations to establish the current state-of-the-art in DNA data storage. In this study, we use standardized error scenarios – both in silico and in vitro – to compare the performance of six representative codecs from the literature. We find synthetic benchmarks commonly used in literature to be unsuitable indicators of codec performance, as our data shows that common experimental benchmarks fail to differentiate codecs under standardized conditions. Instead, we implement a comprehensive benchmark covering the major experimental parameters to assess codec performance under realistic DNA data storage conditions, while establishing important baselines for future codec development. Verifying our results with fair and standardized experiments, we demonstrate data storage at 43 EB g^-1 using synthesis by material deposition and 13 EB g^-1 using the more error prone electrochemical synthesis, employing only existing codecs from the literature. Besides closing in on the physical limits of DNA data storage, this study thus showcases the maturity of error-correction coding and defines its current state-of-the-art.“

R&D: Basecalling for DNA storage

Authors demonstrate for first time that DNA coding scheme constraints can be leveraged to optimise basecallers.

BioRxiv has published an article written by Advait Menon, Samira Brunmayr, Omer Sella, and Thomas Heinis, Department of Computing, Imperial College, London, UK.

Abstract: “DNA is a promising medium for data storage with its high information density and stability. To retrieve information stored in DNA, sequencing technologies are used to read the encoded bases. The raw signals from sequencing are mapped to a sequence of {A,C,T,G} by machine learning algorithms known as basecallers. Currently, basecallers are optimised mainly on biological DNA instead of focusing on characteristics or constraints unique to data-encoding DNA. Taking advantage of these unique artificial features to fine-tune and adapt the architecture, we demonstrate for the first time that DNA coding scheme constraints can be leveraged to optimise basecallers. Using low-rank adaptation on the basecalling model, we achieve substantial gains with high resource efficiency. Additionally, constraint-aware beam search provides improvements without requiring model retraining.“

R&D: High-density DNA storage for vector images: hybrid encoding with error correction and contour-driven retrieval

Advancements address long-standing challenges in random access, data editing, and scalability, positioning our system as scalable solution for structured data storage in DNA.

Journal of Membrane Computing has published an article written by Chunxia Ge, Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, 264003, Shandong, China, Xiaosheng Dong, School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China, and Zhidong Xue, Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, 264003, Shandong, China, and School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China.

Abstract: “DNA inherently offers ultra-high storage density, exceptional longevity, and ultralow energy consumption, making it a transformative alternative to traditional semiconductor-based storage systems. Despite its potential, practical DNA storage faces critical bottlenecks, including low-throughput encoding, high error rates in molecular channels, and strict biological constraints. This study proposes a novel hybrid-encoding framework specifically designed for vector images—resolution-independent digital files retaining precision regardless of scaling. Our approach integrates: compressive hybrid encoding to maximize storage density while ensuring biological feasibility, error-resilient mechanisms (e.g., Reed–Solomon codes) mitigating DNA synthesis/sequencing errors, and biological constraint optimization by dynamically balancing GC content and homopolymer length. A visual interface tool automates bidirectional conversion between vector files and DNA sequences, enabling seamless storage, writing, and retrieval. Critically, we introduce contour-based associative retrieval, leveraging vector image topology to achieve similarity search across 100 images—an unprecedented feature in DNA storage systems. Performance evaluation through comprehensive simulations demonstrates: error reduction, precision retrieval and scalability. These advancements address long-standing challenges in random access, data editing, and scalability, positioning our system as a scalable solution for structured data storage in DNA.“

R&D: High-fidelity DNA polymerase for DNA-based digital information storage

Findings suggested that integrating high-fidelity DNA polymerase with robust coding algorithm is viable strategy to achieve error correction in DNA data storage.

Small Methods has published an article written by Xutong Liu, Kai Wen, Enyang Yu, Qixuan Zhao, Haoran Fu, Jingxuan Zhu, Haobo Han, Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun, 130012 China, and Quanshun Li, Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun, 130012 China, and Center for Supramolecular Chemical Biology, Jilin University, Changchun, 130012 China.

Abstract: “DNA is increasingly recognized for its superior data storage density, favorable stability, and low energy requirements, positioning it as a potential alternative for future digital information storage systems. However, the replication and transfer of information within DNA is prone to errors, primarily due to the inaccuracy during DNA synthesis. Herein, 9°N DNA polymerase is explored from Thermococcus sp. 9°N-7 for robust DNA information storage, leveraging its thermophilic characteristics and error-correcting capability to facilitate high-fidelity DNA amplification. Notably, the enzyme demonstrate great improvement in managing DNA substitution errors compared to commercial DNA polymerases, effectively addressing the shortfall in substitution error correction typically presented in coding algorithms. This distinctive fidelity and substrate specificity of 9°N DNA polymerase is attributed to specific conformational changes and interactions during the process of nucleotide incorporation. Collectively, the findings suggested that integrating high-fidelity DNA polymerase with robust coding algorithm is a viable strategy to achieve error correction in DNA data storage. This combination exhibite the potential to augment the accuracy, portability, and scalability of DNA-based information storage systems, paving the way for reliable and effective data storage.“

R&D: Advancements in DNA tagging and storage: techniques, applications, and future implications

Authors demonstrate that DNA tagging and data storage applications exhibit fundamentally different requirements, necessitating divergent technological strategies rather than unified solutions.

WIREs Computational Molecular Science has published an article written by Adam Kuzdraliński, Department of Bioinformatics, Polish-Japanese Academy of Information Technology, Warsaw, Mazowieckie, Poland, Marek Miśkiewicz, Institute of Computer Science and Mathematics, Maria Curie-Sklodowska University, Lublin, Poland, Hubert Szczerba, Department of Biotechnology, Microbiology and Human Nutrition, Faculty of Food Science and Biotechnology, University of Life Sciences in Lublin, Lublin, Poland, and Systems Biotechnology Group, Department of Systems Biology, National Centre for Biotechnology (CSIC), Madrid, Spain, Wojciech Mazurczyk, Institute of Computer Science, Faculty of Electronics and Information Technology, Warsaw University of Technology, Warsaw, Poland, Parallelism and VLSI Group, Faculty of Mathematics and Computer Science, FernUniversität in Hagen, Hagen, Germany, and IDEAS Research Institute, Warsaw, Poland, Tomasz Ociepa, Department of Bioinformatics, Polish-Japanese Academy of Information Technology, Warsaw, Mazowieckie, Poland, and Institute of Plant Genetics, Breeding and Biotechnology, University of Life Sciences in Lublin, Lublin, Poland, Michał Lechowski, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, and Bogdan Księżopolski, Kozminski University, Warsaw, Poland.

Abstract: “DNA-based technologies for object authentication and data storage are becoming an interesting alternative to classic identification systems, yet their practical implementation faces fundamental technical and commercial barriers that limit widespread adoption. This review presents an analysis of DNA tagging and storage technologies, assessing their technical features, cost-effectiveness, and real-world applicability through comparison of competing approaches. We demonstrate that DNA tagging and data storage applications exhibit fundamentally different requirements, necessitating divergent technological strategies rather than unified solutions. DNA tagging faces severe cost disadvantages ($1–$100 per authentication versus $0.01–$0.10 for established technologies) and extended verification times (30 min to 6+ hours versus instant readout), limiting viability to high-security, low-volume markets such as pharmaceuticals and luxury goods. Current commercial implementations frequently lack peer-reviewed validation, creating an evidence deficit that undermines enterprise confidence. Among current approaches, isothermal amplification methods (LAMP, RPA) combined with colorimetric detection represent the most promising pathway for field-deployable authentication, while Illumina sequencing platforms provide optimal performance for data storage applications. The absence of standardization frameworks fundamentally constrains commercial adoption across both domains, preventing interoperability and enabling unsubstantiated performance claims. We conclude that successful commercialization requires strategic reorientation toward application-specific optimization and integrative approaches where DNA serves as secondary authentication combined with established identifiers, rather than competing directly on speed and cost metrics.“

R&D: Random access and semantic search in DNA storage enabled by Cas9 and machine-guided design

Approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.

Nature Communications has published an article written by Carina Imburgia, Lee Organick, Karen Zhang, Nicolas Cardozo, Jeff McBride, Callista Bee, Delaney Wilde, Gwendolin Roote, Sophia Jorgensen, David Ward, Charlie Anderson, University of Washington, Paul G. Allen School of Computer Science and Engineering, Seattle, USA, Karin Strauss, Microsoft Research, Redmond, USA, Luis Ceze, University of Washington, Paul G. Allen School of Computer Science and Engineering, Seattle, USA, and Jeff Nivala, University of Washington, Paul G. Allen School of Computer Science and Engineering, Seattle, USA, and University of Washington, Molecular Engineering and Sciences Institute, Seattle, USA.

Abstract: “DNA is a promising medium for digital data storage due to its exceptional data density and longevity. Practical DNA-based storage systems require selective data retrieval to minimize decoding time and costs. In this work, we introduce CRISPR-Cas9 as a user-friendly tool for multiplexed, low-latency molecular data extraction. We first present a one-pot, multiplexed random access method in which specific data files are selectively cleaved using a CRISPR-Cas9 addressing system and then sequenced via nanopore technology. This approach was validated on a pool of 1.6 million DNA sequences, comprising 25 unique data files. We then developed a molecular similarity-search approach combining machine learning with Cas9-based retrieval. Using a deep neural network, we mapped a database of 1.74 million images into a reduced-dimensional embedding, encoding each embedding as a Cas9 target sequence. These target sequences act as molecular addresses, capturing clusters of semantically related images. By leveraging Cas9’s off-target cleavage activity, query sequences cleave both exact and closely related targets, enabling high-fidelity retrieval of molecular addresses corresponding to in silico image clusters similar to the query. These approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.“