R&D: Six Recent Articles on DNA Data Storage Technologies and Solutions

R&D: CECLD, Classification Error Correction based on Levenshtein Distance in DNA Data Storage

Experimental results show that the CECLD algorithm efficiently corrects errors in sequences of varying lengths, with total channel error rate of 2.1 % and bit rate below 58.0 %.

Expert Systems with Applications has published an article written by Shufang Zhang, Ming Luo, Penghao Wang, School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China, Bingzhi Li, School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin, 300072, China, and State Key Laboratory of Synthetic Biology and Frontiers Science Center for Synthetic Biology, Tianjin, 300072, China, and Huaqing Yang, School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China.

Abstract: “With the global data volume increasing rapidly, DNA molecules are envisioned as a future solution for massive data storage due to their high density and longevity. In the biochemical process of DNA data storage, indel (insertion and deletion) errors have a greater impact on data accuracy than substitution errors. Although various error correction schemes have been proposed, there are still problems of low efficiency in correcting indel errors and the high redundancy required for data recovery. Therefore, this paper proposes a classification error correction method based on Levenshtein distance, named CECLD. It requires an error classification model with a neural network structure to evaluate Levenshtein distance features and identify errors. The inferred error types are then used to correct indel errors, effectively eliminating nucleotide misalignments. Using this error classification model, errors in the address or payload are sequentially corrected with CRC16 decoding or RS decoding. The experimental results show that the CECLD algorithm efficiently corrects errors in sequences of varying lengths, with a total channel error rate of 2.1 % and a bit rate below 58.0 %. The required redundancy is lower than that of the existing error correction method, which will significantly facilitate the widespread adoption of DNA data storage.“

R&D: Neural Polar Decoders for DNA Data Storage

Authors propose a data-driven approach based on neural polar decoders (NPDs) to design low-complexity decoders for channels with synchronization errors.

ArXiv has published an article written by Ziv Aharoni, and Henry D. Pfister, Department of Electrical and Computer Engineering, Duke University, USA.

Abstract: “Synchronization errors, such as insertions and deletions, present a fundamental challenge in DNA-based data storage systems, arising from both synthesis and sequencing noise. These channels are often modeled as insertion-deletion-substitution (IDS) channels, for which designing maximum-likelihood decoders is computationally expensive. In this work, we propose a data-driven approach based on neural polar decoders (NPDs) to design low-complexity decoders for channels with synchronization errors. The proposed architecture enables decoding over IDS channels with reduced complexity O(ANlogN), where A is a tunable parameter independent of the channel. NPDs require only sample access to the channel and can be trained without an explicit channel model. Additionally, NPDs provide mutual information (MI) estimates that can be used to optimize input distributions and code design. We demonstrate the effectiveness of NPDs on both synthetic deletion and IDS channels. For deletion channels, we show that NPDs achieve near-optimal decoding performance and accurate MI estimation, with significantly lower complexity than trellis-based decoders. We also provide numerical estimates of the channel capacity for the deletion channel. We extend our evaluation to realistic DNA storage settings, including channels with multiple noisy reads and real-world Nanopore sequencing data. Our results show that NPDs match or surpass the performance of existing methods while using significantly fewer parameters than the state-of-the-art. These findings highlight the promise of NPDs for robust and efficient decoding in DNA data storage systems.“

R&D: Sequence Analysis and Decoding with Extra Low-quality Reads for DNA Data Storage

Proposed methods reduced the reading cost by 6.83% on average and up to 19.67% while maintaining the writing cost.

Bioinformatics has published an article written by Jiyeon Park, Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, South Korea, Ha Hyeon Jeon, Jeong Wook Lee, Department of Chemical Engineering, POSTECH, Pohang 37673, South Korea, and Hosung Park, Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, South Korea.

Motivation: “Error detection/correction codes play an important role to reduce writing and/or reading costs in DNA data storage. Sequence analysis algorithms also make a crucial effect on error correction but have been executed independently from the decoding of error correction codes. In conventional sequence analysis, low-quality reads are usually discarded. For DNA data storage, low-quality reads can be constructively used to sequence analysis with the assistance of error detection/correction codes.“

Results: “We obtained the low-quality reads which failed to pass the chastity filter in Illumina NGS sequencing. We confirmed the effectiveness of the extra low-quality reads by providing error statistics and performing decoding with them. We proposed a sequence clustering algorithm for various-length reads and a consensus algorithm based on probabilistic majority and error detection to efficiently exploit the extra reads. The proposed methods reduced the reading cost by 6.83% on average and up to 19.67% while maintaining the writing cost.“

Availability and implementation (10.5281/zenodo.15571858).

R&D: Primer-Disk-Enabled DNA Data Storage System with Index and Record-Many-Read-Many Features

Work provides a new DNA data storage system with index and record-many-read-many features, paving way for the practical use of DNA data storage.

Advanced Science has published an article written by Jiaxiang Ma, Yu Yang, Ben Pei, Department of Mechanical Engineering, Tsinghua University, Beijing, 100084 China, Shengli Mi, Division of Advanced Manufacturing, Graduate school at Shenzhen, Tsinghua University, Shenzhen, 518055 China, Zhuo Xiong, Department of Mechanical Engineering, Tsinghua University, Beijing, 100084 China, and Liliang Ouyang, Department of Mechanical Engineering, Tsinghua University, Beijing, 100084 China, and State Key Laboratory of Tribology in Advanced Equipment, Tsinghua University, Beijing, 100084 China.

Abstract: “DNA data storage has emerged as a promising information storage technology by encoding information down to base molecules. However, it remains a challenge to structure the DNA data with ease of recording, retrieving, and reading. Here, a primer-disk-enabled hierarchical DNA data storage system is introduced, which allows for the multiple immobilizations of DNA molecules and the generation of corresponding QR codes for retrieving. The primer disk is pre-engineered to present multiple primers, on which encoded DNA molecules with complementary primers can be covalently immobilized on demand via solid-phase PCR. Each DNA file can be retrieved by inkjet printing a fluorescent QR code. A primer disk with up to 10 primers is used. The results show that different DNA files can be subsequently stored on the disk. One can have readily access to the index via fluorescent QR codes and decode information after sequent imaging, convention, and recognition. To this end, the recorded DNA files can be randomly read via solid-phase PCR with sufficient copies of collected DNA for up to 20 reads. Together, this work provides a new DNA data storage system with index and record-many-read-many features, paving the way for the practical use of DNA data storage.“

R&D: DNA Sequence Clustering in High Error Rates via Hash Sketches Fuzzy Clustering for Efficient Stored Data Reconstruction

Paper proposes a hash sketches fuzzy clustering (HSFC) method for reliable DNA storage data reconstruction.

Springer Verlag has published, in 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining proceedings, PAKDD 2025, Sydney, NSW, Australia, an article written by Qi Shao, The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, 116622, Liaoning, China, Yanfen Zheng, Ben Cao, School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China, Zhenlu Liu, Bin Wang, Shihua Zhou, The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, 116622, Liaoning, China, and Pan Zheng, Department of Accounting and Information Systems, University of Canterbury, Upper Riccarton, Christchurch, 8140, New Zealand.

Abstract: “Life is composed of sequences, but due to the complexity of biological sequences, clustering algorithms have been introduced for the analysis and processing of biological sequence data. However, in tasks involving synthetic DNA sequences, such as DNA data storage, under high-error-rate sequencing techniques like nanopore sequencing, the accuracy of clustering and the reliability of reconstruction remain significant challenges. Therefore, this paper proposes a hash sketches fuzzy clustering (HSFC) method for reliable DNA storage data reconstruction. HSFC employs locality sensitive hashing to map DNA sequences as hash sketches with drifts and designs fuzzy matching mechanisms that tolerate more sequence errors, thereby mitigating the impact of errors on clustering results. Experimental results show that HSFC improves the clustering accuracy of DNA sequences by 6% to 17% compared to state-of-the-art DNA clustering methods. Moreover, HSFC achieves sequence recovery and reconstruction rates of 99% at a simulation error rate of 10%. In conclusion, HSFC enhances the accuracy of DNA sequence clustering in high error rate environments, thus facilitating high quality data reconstruction and ensuring the integrity and reliability of DNA storage read data.“

R&D: DNA Data Storage System Using Electrochemically Active Non-natural Oligonucleotides with Flexible Microfluidic Chips

Authors present a DNA-based electrochemical readout data storage system capable of identifying different non-natural, electroactive bases (methylene blue- and ferrocene-modified bases) for DNA.

Analytical Chemistry has published an article written by Jiankai Li, Ziyan Wang, Leni Zhong, and Xingyu Jiang, Shenzhen Key Laboratory of Smart Healthcare Engineering, Guangdong Provincial Key Laboratory of Advanced Biomaterials, Department of Biomedical Engineering, Southern University of Science and Technology, No. 1088 Xueyuan Road, Nanshan District, Shenzhen, Guangdong 518055, P. R. China.

Abstract: “Introducing non-natural oligonucleotides can provide DNA as a data storage medium with higher storage density and novel data storage paradigms. In particular, electrochemically active non-natural oligonucleotides can be detected through electrochemical signals, allowing data storage to retrieve data. Here, we present a DNA-based electrochemical readout data storage system capable of identifying different non-natural, electroactive bases (methylene blue- and ferrocene-modified bases) for DNA. The system utilizes a flexible electrochemical microfluidic chip, where data writing is achieved through DNA hybridization, and the parallel electrochemical signal acquisition enables data reading. Using methylene blue- and ferrocene-modified oligonucleotides as a demonstration, which allows 4 (2²) combinations on an electrode for the data storage, we successfully encoded and retrieved a 120-bit text file based on quaternary coding on a flexible electrochemical microfluidic chip. Our system may offer potential applications for electrochemical readout from DNA data storage system.“