R&D: Five Articles on DNA Data Storage

R&D: Improved Homopolymer-Free Encoding for DNA Storage Systems

Encoding solution can store information more efficiently over a given length, increasing storage utilization.

SSRN has published an article written by Vahid Hojatizadeh, and Ali Jahanian, Faculty of Computer Science and Engineering, Shahid Beheshti University, Evin, Tehran, Iran.

Abstract: “Nowadays dealing with big data has been become a remarkable problem in data storage space. Information storage based on inherent properties of DNA provides a high-density storage medium with advantages of high storage efficiency and long storage endurance. native strands of DNA are made from natural nucleotide which are called bases: adenine (A), thymine (T), guanine (G) and cytosine (C) There are only four natural nucleotides and DNA storage is thus limited by 2 bits per nucleotide.In this paper, for achieving higher encoding efficiency we use two brilliant approaches based on Hexadecimal. Hexadecimal Encoding Matrix (HEM) and Hexadecimal Encoding S-group Matrix (HESM) set up a matrix with 16 hexadecimal number and map each hexadecimal digit to at least two nucleotides. Our encoding can store information more efficiently over a given length, increasing storage utilization. Average encoding cost (number of nucleotides is used) has become better than published encoding up to now. HEM and HESM have error avoidance property and self-error detection.“

R&D: Efficient Constraining of Transcoding in DNA-Based Image Storage

Authors propose 2 transcoding methods that address these 2 key challenges: reducing data rate and minimizing errors.

IEEE Explore has published, in 2025 IEEE International Conference on Image Processing (ICIP) proceedings, an article written by Sara Al Sayyed, Aline Roumy, and Thomas Maugey, Université de Rennes, Inria, France.

Abstract: “DNA has emerged as a promising alternative for long-term data storage due to its high capacity, durability, and low-energy potential. However, storing data in DNA presents several challenges. First, it requires complex and costly biochemical processes, making efficient compression crucial to reducing DNA synthesis time and cost. Second, these processes are prone to errors that must be avoided and/or corrected. In particular, homopolymers (repetitions of the same nucleotide) are a well-known source of errors during the sequencing step. Avoiding such repetitions helps mitigate errors but introduces a constraint that may increase the data compression rate. In this paper, we propose two transcoding methods that address these two key challenges: reducing data rate and minimizing errors. The first method strictly enforces the error-minimization constraint by eliminating homopolymers of a certain length, at the cost of an increased data rate. In contrast, the second method accepts a slight increase in homopolymers. However, we show that these increases remain limited (2.14% increase in compression rate for the first method and 0.39% homopolymer rate for the second). These two approaches demonstrate that it is possible to efficiently constrain transcoding while balancing error minimization and compression performance.“

R&D: Achievable Rates of Nanopore-Based DNA Storage

Paper studies achievable rates of nanopore-based DNA storage when nanopore signals are decoded using a tractable channel model that does not rely on a basecalling algorithm.

IEEE Journal on Selected Areas in Information Theory has published an article written by Brendon McBain, and Emanuele Viterbo, ECSE Department, Monash University, Melbourne, VIC, Australia.

Abstract: “This paper studies achievable rates of nanopore-based DNA storage when nanopore signals are decoded using a tractable channel model that does not rely on a basecalling algorithm. Specifically, the noisy nanopore channel (NNC) with the Scrappie pore model generates average output levels via i.i.d. geometric sample duplications corrupted by i.i.d. Gaussian noise (NNC-Scrappie). Simplified message passing algorithms are derived for efficient soft decoding of nanopore signals using NNC-Scrappie. Previously, evaluation of this channel model was limited by the lack of DNA storage datasets with nanopore signals included. This is solved by deriving an achievable rate based on the dynamic time-warping (DTW) algorithm that can be applied to genomic sequencing datasets subject to constraints that make the resulting rate applicable to DNA storage. Using a publicly-available dataset from Oxford Nanopore Technologies (ONT), it is demonstrated that coding over multiple DNA strands of 100 bases in length and decoding with the NNC-Scrappie decoder can achieve rates of at least 0.64-1.18 bits per base, depending on the channel quality of the nanopore that is chosen in the sequencing device per channel-use, and 0.96 bits per base on average assuming uniformly chosen nanopores. These rates are pessimistic since they only apply to single reads and do not include calibration of the pore model to specific nanopores.“

R&D: Integrated Error Correction to Enhance Efficiency of Digital Data Storage Based on DNA Nanostructures

Authors confirmed the effectiveness of IEC by recovering medical data encoded in DNA with errors

ACS Nano has published an article written by Cuiping Mao, Guangdong Provincial Key Laboratory of Advanced Biomaterials, Shenzhen Key Laboratory of Smart Healthcare Engineering, Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, P. R. China, and Key Laboratory of Clinical Laboratory Diagnostics (Ministry of Education), College of Laboratory Medicine, Chongqing Medical University, Chongqing 400016, P. R. China, Shuo Zheng, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, No. 1088, Xueyuan Rd., Nanshan District, Shenzhen, Guangdong 518055, P. R. China, Zhihao Huang, Dou Wang, Guangdong Provincial Key Laboratory of Advanced Biomaterials, Shenzhen Key Laboratory of Smart Healthcare Engineering, Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, P. R. China, Yufan Zhuang, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, No. 1088, Xueyuan Rd., Nanshan District, Shenzhen, Guangdong 518055, P. R. China, Jiangjiang Zhang, Guangdong Provincial Key Laboratory of Advanced Biomaterials, Shenzhen Key Laboratory of Smart Healthcare Engineering, Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, P. R. China, Rui Wang, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, No. 1088, Xueyuan Rd., Nanshan District, Shenzhen, Guangdong 518055, P. R. China, and Xingyu Jiang, Guangdong Provincial Key Laboratory of Advanced Biomaterials, Shenzhen Key Laboratory of Smart Healthcare Engineering, Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, P. R. China.

Abstract: “Synthetic DNA is a durable, high-density information storage platform based on DNA nanostructures. However, errors during DNA reading pose challenges to data integrity. Conventional error-correcting codes add redundancy during encoding to ensure data integrity, thereby reducing storage density and increasing costs. Here, we present an integrated error correction (IEC) algorithm that synergistically combines three enhanced mechanisms: the “head–tail” region Levenshtein distance for error-tolerant clustering (10× faster); sliding window-optimized Hamming distance for error detection and correction of insertions and deletions without length constraints; and score-weighted majority voting for optimal sequence selection (2% higher accuracy), collectively enhancing storage density and decoding efficiency. We confirmed the effectiveness of IEC by recovering medical data encoded in DNA with errors. With IEC, we can simultaneously correct insertion, deletion, and substitution errors with a redundancy rate of 2.4%, while the current minimum redundancy rate is 7%. We thus achieved a logical density of 1.4 bits per nucleotide. Additionally, IEC ensures optimal fidelity during decoding, closely matching the encoded sequences, resulting in a reduction of the number of sequences by 3 orders of magnitude, minimizing computational overhead and runtime complexities, and enhancing decoding efficiency.“

R&D: Virtual Multi-level Directory File Addressing Method (VMDFAM) for DNA Storage

Work presents a robust solution for future large-scale DNA storage systems, offering hierarchical file organization capabilities with high efficiency and scalability.

Future Generation Computer Systems has published an article written by Xiangzhen Zan, Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, Guangdong, China, and Shenzhen Pengcheng Technician College, Shenzhen, 518000, Guangdong, China, Xiangyu Yao, Ling Chu, Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, Guangdong, China, Peng Xu, Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, Guangdong, China, School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, 558000, Guizhou, China, and Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, 510080, Guangdong, China.

Abstract: “As the capacity of DNA storage continues to expand, efficient file organization and access mechanisms have become critical challenges. Current DNA storage architecture research focuses on random file access, utilizing hierarchical primer addressing schemes to retrieve target data with high fidelity. However, these approaches often overlook critical issues such as the flexible logical representation of file hierarchies, and the impact of primer sequences on logical density, and primer-payload crosstalk that limits the address space. Here, we propose a virtual multi-level directory file addressing method (VMDFAM) based on modulation DNA storage architecture, which implicitly embeds file addresses – comprising disk partition, multi-level directory, and file identifier – into a binary carrier through a predefined modulation codebook. Compared with previous works, the proposed approach features a flexible hierarchical tree-structure and has no primer-payload crosstalk issue which is commonly found in DNA storage systems. Moreover, it preserves the payload encoding region without interference, significantly enhancing logical density. Theoretical analysis demonstrates that, for a typical DNA sequence length of 200, it achieves an address space size of up to and a storage capacity of around 32 petabytes (PB). Reliability analysis confirms that the addressing mechanism can tolerate reading and writing errors of up to 15%, addressing the inherent error-prone nature of DNA storage. The wet lab experiment demonstrates that our method can be reliably deployed in real biochemical environments. This work presents a robust solution for future large-scale DNA storage systems, offering hierarchical file organization capabilities with high efficiency and scalability.“