What are you looking for ?
Advertise with us
RAIDON

215,011 HDDs Tested in 2Q22

Lifetime annualized failure rate at 1.39%, same as last quarter and down from 1.45% one year ago

KleinThis article, published on August 2, 2022, was written by Andy Klein, principal storage cloud evangelist, Backblaze, Inc.

 

 

 

Backblaze Drive Stats for 2Q22

As of the end of 2Q22, Backblaze was monitoring 219,444 HDDs and SSDs in its data centers around the world. Of that number, 4,020 are boot drives, with 2,558 being SSDs, and 1,462 being HDDs.

Later this quarter, it will review its SSD collection. Here, it is focused on the 215,424 data drives under management as it reviews their quarterly and lifetime failure rates as of the end of 2Q22. Along the way, it will share our observations and insights on the data presented and, as always, it looks forward to you doing the same in the comments section at the end of the post.

Lifetime HDD Failure Rates
In this report, we’ll change things up a bit and start with the lifetime failure rates. We’ll cover the 2Q22 data later on in this post. As of June 30, 2022, Backblaze was monitoring 215,424 HDDs used to store data. For this evaluation, it removed 413 drives from consideration as they were used for testing purposes or drive models which did not have at least 60 drives. This leaves us with 215,011 HDDs grouped into 27 different models to analyze for the lifetime report.

Backblaze Drive Stats 2q22 F1

Notes and Observations About Lifetime Stats
The lifetime annualized failure rate (AFR) for all the drives listed above is 1.39%. That is the same as last quarter and down from 1.45% one year ago (6/30/2021).

A quick glance down the AFR column identifies the three drives with the highest failure rates:
• The 8TB HGST (model: HUH728080ALE604) at 6.26%.
• The Seagate 14TB (model: ST14000NM0138) at 4.86%.
• The Toshiba 16TB (model: MG08ACA16TA at 3.57%.

What’s common between these 3 models? The sample size, in our case drive days, is too small, and in these 3 cases leads to a wide range between the low and high confidence interval values. The wider the gap, the less confident we are about the AFR in the first place.

In the table above, we list all of the models for completeness, but it does make the chart more complex. We like to make things easy, so let’s remove those drive models that have wide confidence intervals and only include drive models that are available. We’ll set our parameters as follows: a 95% confidence interval gap of 0.5% or less, a minimum drive days value of 1 million to ensure we have a large enough sample size, and drive models that are 8TB or more in size. The simplified chart is below.

Backblaze Drive Stats 2q22 F2

To summarize, in our environment, we are 95% confident that the AFR listed for each drive model is between the low and high confidence interval values.

Computing the AFR
We use the term AFR, throughout Drive Stats reports. Let’s spend a minute to explain how we calculate the AFR value and why we do it the way we do. The formula for a given cohort of drives is:

AFR = (drive_failures/(drive_days/365))*100

Let’s define the terms used:
Cohort of drives: The selected set of drives (typically by model) for a given period of time (quarter, annual, lifetime).
AFR: Which is applied to the selected cohort of drives.
drive_failures: The number of failed drives for the selected cohort of drives.
drive_days: The number of days all of the drives in the selected cohort are operational during the defined period of time of the cohort (i.e., quarter, annual, lifetime).

For example, for the 16TB Seagate drive in the table above, we have calculated there were 117 drive failures and 4,117,553 drive days over the lifetime of this particular cohort of drives. The AFR is calculated as follows:

AFR = (117/(4,117,553/365))*100 = 1.04%

Why Don’t We Use Drive Count?
Our environment is very dynamic when it comes to drives entering and leaving the system; a 12TB HGST drive fails and is replaced by a 12TB Seagate, a new Backblaze Vault is added and 1,200 new 14TB Toshiba drives are added: a vault of 4TB drives is retired, and so on. Using drive count is problematic because it assumes a stable number of drives in the cohort over the observation period. Yes, we will concede that with enough math you can make this work, but rather than going back to college, we keep it simple and use drive days as it accounts for the potential change in the number of drives during the observation period and apportions each drive’s contribution accordingly.

For completeness, let’s calculate the AFR for the 16TB Seagate drive using a drive count-based formula given there were 16,860 drives and 117 failures.

Drive Count AFR = (117/16,860)*100 = 0.69%

While the drive count AFR is much lower, the assumption that all 16,860 drives were present the entire observation period (lifetime) is wrong. Over the last quarter, we added 3,601 new drives, and over the last year, we added 12,003 new ones. Yet, all of these were counted as if they were installed on day one. In other words, using drive count AFR in our case would misrepresent drive failure rates in our environment.

How We Determine Drive Failure?
Today, we classify drive failure into 2 categories: reactive and proactive. Reactive failures are where the drive has failed and won’t or can’t communicate with our system. Proactive failures are where failure is imminent based on errors the drive is reporting which are confirmed by examining the SMART stats of the drive. In this case, the drive is removed before it completely fails.

Over the last few years, data scientists have used the SMART stats data we’ve collected to see if they can predict drive failure using various statistical methodologies, and more recently, AI and ML techniques. The ability to accurately predict drive failure, with minimal false positives, will optimize our operational capabilities as we scale our storage platform.

SMART Stats
SMART stands for Self-monitoring, Analysis, and Reporting Technology and is a monitoring system included in HDDs that reports on various attributes of the state of a given drive. Each day, the company records and stores the SMART stats that are reported by the HDDs we have in our data centers. Check out this post to learn more about SMART stats and how we use them.

2Q22 HDD Failure Rates
For the 2Q22 quarterly report, we tracked 215,011 HDDs broken down by drive model into 27 different cohorts using only data from 2Q. The table below lists the data for each of these drive models.

Backblaze Drive Stats 2q22 F3

Notes and Observations on 2Q22 Stats
Breaking news, the OG stumbles: The 6TB Seagate drives (model: ST6000DX000) finally had a failure this quarter – actually, two failures. Given this is the oldest drive model in our fleet with an average age of 86.7 months of service, a failure or 2 is expected. Still, this was the first failure by this drive model since 3Q21. At some point in the future we can expect these drives will be cycled out, but with their lifetime AFR at just 0.87%, they are not first in line.

Another zero for the next OG: The next oldest drive cohort in our collection, the 4TB Toshiba drives (model: MD04ABA400V) at 85.3 months, had zero failures for 2Q22. The last failure was recorded a year ago in 2Q21. Their lifetime AFR is just 0.79%, although their lifetime confidence interval gap is 1.3%, which as we’ve seen means we are lacking enough data to be truly confident of the AFR number. Still, at 1 failure per year, they could last another 97 years – probably not.

More zeroes for 2Q: 3 other drives had zero failures this quarter: the 8TB HGST (model: HUH728080ALE604), the 14TB Toshiba (model: MG07ACA14TEY), and the 16TB Toshiba (model: MG08ACA16TA). As with the 4TB Toshiba noted above, these drives have very wide confidence interval gaps driven by a limited number of data points. For example, the 16TB Toshiba had the most drive days – 32,064 – of any of these drive models. We would need to have at least 500,000 drive days in a quarter to get to a 95% confidence interval. Still, it is entirely possible that any or all of these drives will continue to post great numbers over the coming quarters, we’re just not 95% confident yet.

Running on fumes: The 4TB Seagate drives (model: ST4000DM000) are starting to show their age, 80.3 months on average. Their quarterly failure rate has increased each of the last 4 quarters to 3.42% this quarter. We have deployed our drive cloning programhttps://www.backblaze.com/blog/life-and-times-of-a-backblaze-hard-drive/ for these drives as part of our data durability program, and over the next several months, these drives will be cycled out. They have served us well, but it appears they are tired after nearly seven years of constant spinning.

The AFR increases, again: In 2Q, the AFR increased to 1.46% for all drives models combined. This is up from 1.22% in 1Q22 and up from 1.01% in 2Q21. The aging 4TB Seagate drives are part of the increase, but the failure rates of both the Toshiba and HGST drives have increased as well over the last year. This appears to be related to the aging of the entire drive fleet and we would expect this number to go down as older drives are retired over the next year.

Four Thousand Storage Servers
In the opening paragraph, we noted there were 4,020 boot drives. What may not be obvious is that this equates to 4,020 storage servers. These are 4U servers with 45 or 60 drives in each with drives ranging in size from 4TB to 16TB. The smallest is 180TB of raw storage space (45x4TB drives) and the largest is 960TB of raw storage (60x16TB drives). These servers are a mix of Backblaze Storage Pods and third-party storage servers. It’s been a while since our last Storage Pod update, so look for something in late 3Q or early 4Q.

HDD Stats Data
The complete data set used to create the information used in this review is available on HDD Test Data page. You can download and use this data for free for your own purpose. If you want the tables and charts used in this report, you can download the .zip file from Backblaze B2 Cloud Storage which contains the .jpg and/or .xlsx files as applicable.

Articles_bottom
ExaGrid
AIC
ATTOtarget="_blank"
OPEN-E