What are you looking for ?
Advertise with us
RAIDON

R&D: NVM-ESR Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

Demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR.

arXiv has published an article written by Yehonatan Fridman, Yaniv Snir, Ben-Gurion University of the Negev Israel Atomic Energy Commission, Israel, Harel Levin, Scientific Computing Center, Nuclear Research Center – Negev, Mobileye Vision Technologies, Israel, Danny Hendler, Department of Computer Science, Ben-Gurion University of the Negev, Israel, Hagit Attiya, Department of Computer Science, Technion – Israel Institute of Technology, Israel, and Gal Oren, Department of Computer Science, Technion – Israel Institute of Technology, and Scientific Computing Center, Nuclear Research Center – Negev, Israel.

Abstract: HPC systems are a critical resource for scientific research and advanced industries. The demand for computational power and memory is increasing and ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of many compute nodes and are consequently expected to experience frequent faults and crashes. Exact state reconstruction (ESR) has been proposed as a mechanism to alleviate the impact of frequent failures on long-term computations. ESR has shown great potential in the context of iterative linear algebra solvers, a key building block in numerous scientific applications. Recent designs of supercomputers feature the emerging nonvolatile memory (NVM) technology. For example, the Exascale Aurora supercomputer is planned to integrate Intel Optane DCPMM. This work investigates how NVM can be used to improve ESR so that it can scale to future exascale systems such as Aurora and provide enhanced resilience. We propose the non-volatile memory ESR (NVM-ESR) mechanism. NVM-ESR demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR. We focus on the preconditioned conjugate gradient (PCG) iterative solver also studied in prior ESR research, because it is employed by the representative HPCG scientific benchmark.

Articles_bottom
ExaGrid
AIC
ATTOtarget="_blank"
OPEN-E