R&D: Deep DNA Storage, Scalable and Robust DNA Storage Via Coding Theory and Deep Learning

arXiv.org has published an article written by Daniella Bar-Lev, Department of Computer Science, Technion – Israel institute of Technology, Haifa, Israel, Itai Orr, Faculty of Engineering and the Institute for Nanotechnology and Advanced Materials, Bar Ilan, University, Ramat-Gan, Israel, and Wisense Technologies Ltd., Tel Aviv, Israel, Omer Sabary, Tuvi Etzion, and Eitan Yaakobi, Department of Computer Science, Technion – Israel institute of Technology, Haifa, Israel.

Abstract: “The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.“