What are you looking for ?
Advertise with us
RAIDON

R&D: Prophet, SSD Failure Analysis and Prediction Guided by Flash Reliability Characteristics in Data Centers

Authors study failure characteristics of over 200,000 drives from industry data centers over 4-year period, as well as daily data

IEEE Transactions on Computers has published an article written by Yunpeng Song, Yujiong Liang, Jialin Liu, MoE Engineering Research Center of Software/Hardware Co-Design Technology and Application, East China Normal University, Shanghai, China, and School of Computer Science and Technology, East China Normal University, Shanghai, China, and Liang Shi, MoE Engineering Research Center of Software/Hardware Co-Design Technology and Application, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China, and Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China.

Abstract: Solid-state drives (SSDs) are massively deployed in various fields, especially in data centers, for their excellent cost-effectiveness. However, SSDs may fail due to their imperfect manufacturing processes, resulting in system-level failures and even downtime in data centers. This makes SSD failure prediction critical. Current studies focus on dealing with data missing, numerical normalization, and other statistical issues in using machine learning methods, but the consideration of the reliability characteristics of the underlying flash media of SSDs and the timeliness (time duration between predicted failure and real failure) of SSD failure prediction result is missing.“

In this work, we study the failure characteristics of over 200,000 drives from industry data centers over a 4-year period, as well as daily data. The relationship between SSD attribute values and failures is first investigated. Then, we analyzed the SSD failure characteristics from several aspects (causes, differences between failures, and timeliness of prediction results) relying on flash reliability characteristics. Based on these, a novel SSD failure prediction method (Prophet) is proposed. Specifically, Prophet contains the following two components. First, to cope with the differences between failures, a diff-state method is proposed for differential machine learning modeling of SSDs in different “States”. We define the “State” of an SSD, which represents the range of values in which the SSD currently lies in terms of some key attributes. Through flash reliability characteristics, we distinguish between different failures before training the model to obtain accurate predictions of different failure behaviors. Second, a recovery period method is proposed to enhance the timeliness of SSD failure prediction result by designing the sample selection method. The enhanced timeliness can be utilized by operations personnel to handle failed SSDs, such as replacement and repair. The evaluation results of the real dataset show that the predictive ability of Prophet is improved amazingly, realizing a high recall and low false-positive rates while providing sufficient response time for the processing of failed SSDs.

Articles_bottom
ExaGrid
AIC
Teledyne
ATTO
OPEN-E