Why RAID Systems Dangerous and Why You Should Stop Using Them, According to WD
Rebuild times astronomical, data corruption enemy, erasure coding is better.By Francis Pelletier | March 5, 2019 at 2:38 pm
By Mike McWhorter, senior technologist, Western Digital Corp.
RAID has been around for a long time, and at this point, most people consider it to be a reliable and well understood technology.
But as storage devices continue to get larger, a lot of the benefits of RAID begin to diminish.
These days, we’re measuring storage systems in the petabytes, and unfortunately, RAID was not designed to operate at this kind of scale. If you attempt to use these new high capacity devices in a traditional RAID, you may run into some unpleasant surprises, and I wanted to share them with you so that you don’t have to find out about them the hard way.
Some things can’t scale in the data denter
Let’s start with the most obvious one. The rebuild times are astronomical in the data center. In the past, when you had a drive failure, all you had to do was swap out the bad drive, rebuild the RAID, and you could reestablish your data protection in couple of hours.
When you’re using smaller drives (say, less than 1TB), this works fairly well. But if you’re running multipl arrays of very large drives, the rebuild time can be days instead of hours. It’s a much larger window of vulnerability, and your chances of losing an additional hard drive within the arrays increases before the RAID finishes rebuilding, which can and does happen. And yes, you could mitigate the risk with RAID 6, but there is actually a much bigger problem here that most people are not aware of. When you use these large drives in a RAID configuration, there can be an increased risk for loss of data. Let me explain.
Loss of data is your enemy
Every storage device, whether it’s a hard drive or SSD, has a known bit error rate. You’ll see it listed on the spec sheet as the UBER – Unrecoverable Bit Error Rate. Drive manufacturers use strict manufacturing tolerances to keep these errors to a minimum, but it’s impossible to eliminate them completely.
The standard bit error rate on a modern hard drive is less than 1 in 10^15. This means that potentially one bit out of every 10,000,000,000,000,000 bits (or 1.2PB) will be recorded incorrectly. For a 100GB drive, this is relatively low risk since you are unlikely to write anywhere close to 1.2PB to the drive in its serviceable lifetime. But what about a 15TB drive? What if you had 24 of them in a RAID configuration, and were writing to them constantly in a high-volume production environment for years at a time? Suddenly 1.2PB doesn’t seem like such a large number, and the odds of you hitting one of those unrecoverable bit errors begin to increase.
This is the problem with RAID systems that are hundreds of terabytes in size. When you use these large drives in a RAID configuration over a long enough time period, loss of data isn’t just a possibility, it is a near statistical certainty. If you run it like that for long enough, you WILL risk not being able to read your data. It’s only a matter of time.
How can you create a large volume and still maintain your data integrity? We at Western Digital are one of the world’s leading experts and manufacturers of hard drives and SSDs. We also develop storage enclosures and systems to best leverage our knowledge in drive technology and deliver superior solutions. Here’s how our high-capacity storage systems, such as ActiveScaleTM and IntelliFlashTM address this problem.
Intelliflash NVMe storage arrays
Beyond RAID – IntelliFlash data protection
Let’s start with IntelliFlash. This is Western Digital’s highest performing storage system and it provides traditional block and file access in a single appliance. To protect against loss of data, IntelliFlash uses a checksum matching algorithm to detect and repair corrupted data. Here’s how it works.
Every time a new block of data is written, a checksum is calculated and stored for that block. Read operations are verified against the checksums, and if a mismatch is detected, the damaged block is automatically repaired using data from the parity disk. It does this transparently when the data is read, and then returns the repaired block to the requester. With IntelliFlash, your data is completely self-healing, so the user never has to worry about silent data corruption.
ActiveScale – From RAID systems to Erasure Coding
ActiveScale, Western Digital’s cloud object storage system, takes a different approach. This is our largest capacity storage array, and it uses technique called erasure coding to prevent data corruption. If you’ve never heard of erasure coding, you can think of it as an advanced form of RAID. The algorithm works like this.
First, each file is divided into shards, with each shard being placed on a different disk. Next, additional shards are created which contain error correction information. Using these extra ‘error correction’ shards, the algorithm is able to detect and correct multiple bit errors within a single shard, providing extremely strong protection against data corruption and up to 19 9’s of data durability. In addition, erasure coded volume can survive more disk failures than RAID, and makes very efficient use of disk space, making it the ideal choice for petabyte scale storage.
Don’t lose your data
Whether you’re using block, file, or object-based storage, Western Digital has the technology and experience to ensure that your data is always protected. We are in the position of being the only vendor that manufactures nearly all of the components of the system throughout the entire technology stack. From the components that make our hard drives and SSDs reliable to the expansion of capacity through our helium-sealed technology and all the way up the system stack, Western Digital innovates to deliver the most reliable storage solutions on the planet.