Three key strategies to prevent RAID failure
In this tip, analyst Jon Toigo shares several ways storage admins can proactively prevent the failure of RAID arrays.
In contemporary storage architecture, concerns have been growing regarding the vulnerability of RAID technology to multiple drive failures that occur within the time required to rectify a single or double disk failure. At the same time, the technology remains the first line of defense in most storage infrastructures.
This technical tip includes three key strategies data storage managers can use to help prevent RAID failure.
The vulnerabilities of RAID storage need to be assessed in a clear-headed manner so that risks can be mitigated through a combination of good architectural practices and supplementary protection measures. A good starting point is the data that can be gleaned from a root-cause analysis.
Root-cause analyses of storage failures can be useful in identifying common-sense measures to bolster RAID storage resiliency. Until recently, disk and array vendors kept internal failure analysis data fairly close to the vest. But a study undertaken by researchers at the University of Illinois' Department of Computer Science examined more than 34,000 storage array failures and illuminated their most common categories.
According to the study, array failures result from disruptions in the protocols used to transfer data from application software to disk; failures of interconnect devices (including controllers, power supplies, fans and cabling); and failures in the disk drives themselves. In high-end storage systems, disk drive failures are the leading cause of array failures, followed closely by interconnect failures. Midrange arrays demonstrate roughly the same failure rates, while low-end storage systems exhibit higher failure rates in the interconnect category -- presumably because devices do not ship with or support redundancies in interconnect components.
While disk failures are the leading cause of array failures in midrange and high-end arrays, some percentage of these failures may be the result of erroneous error-detection processes that mark the drive as bad even though it is working properly. That said, any error that takes a disk out of service in a RAID volume kicks off a process to rectify the situation that involves replacing the bad drive and rebuilding the RAID volume. This takes time, and statistically the chances are higher than previously thought that a second or third drive failure will occur in the amount of time required to rectify the situation fully.
Three ways you can help prevent a RAID failure
1. Don't set up RAID volumes using all drives from the same tray. Typically, trays or shelves of drives in a storage array are populated using sequentially numbered drives that come off the same manufacturing line, one after another. Setting up RAID across a set of sequential drives exposes the RAID volume to a higher potential for serial drive failures if the failures are the result of manufacturing defects that affected every drive on the line during a particular production cycle.
Instead of creating a RAID volume from all the drives on the first tray, use drives from different trays if possible. This simple hedge might reduce catastrophic RAID volume failures.
2. Leverage self-monitoring analysis reporting technology (SMART) and other disk monitoring technologies to spot disk failures. These kinds of technologies can also collect warnings about potential disk failures or other burgeoning error conditions so that bad disks can be replaced expeditiously. The faster a disk failure is resolved, the quicker the rebuild process can be initiated and completed. Too many companies ignore built-in error notifications, such as SMART messages.
3. Don't force drives back into service that have been marked by your RAID software or controller as nearing failure or as already bad. While Seagate and others tell us that the bulk of the so-called bad drives returned to them for replacement don't actually have any errors -- the drives were marked as bad by erroneous RAID management processes and operating system health monitors, or by the applications themselves when the drive was perfectly healthy -- why take the chance? Forcing a bad drive back into operation only exposes the RAID set to subsequent failures of the questionable drive and to secondary or tertiary drive failures before the situation can be fixed.