Everything you know about disks is wrong
- IT TOPICS:Servers & Data Center, Storage
Two bombshell papers released at the Usenix FAST '07 (File And Storage Technology) conference this week bring a welcome dose of reality to the basic building block of storage: the disk drive.
Together the two papers are 29 pages of dense computer science with lots of info on populations, statistical analysis, and related arcana. I recommend both papers. The following summary, and two longer analyses at StorageMojo are summaries of what I found interesting.
The first conference paper, from researchers at Google, Failure Trends in a Large Disk Drive Population (pdf) looks at a 100,000-drive population of Google PATA and SATA drives. Remember that these drives are in professionally managed, Class A data centers, and once powered on, are almost never powered down. So conditions should be nearly ideal for maximum drive life.
The most interesting results came in five areas:
- The validity of manufacturer's MTBF specs
- The usefulness of SMART statistics
- Workload and drive life
- Age and drive failure
- Temperature and drive failure
MTBF Google found that Annual Failure Rates were quite a bit higher than vendor MTBF specs suggest. For a 300,000-hour MTBF, one would expect an AFR of 1.46%, but the best the Googlers observed was 1.7% in the first year, rising to over 8.6% in the third year.
SMART: not very SMART (Self-Monitoring, Analysis, and Reporting Technology) is supposed to capture drive error data to predict failure. The authors found that several SMART errors were strong predictors of ensuing failure:
- scan errors
- reallocation count
- offline reallocation
- probational count
For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days. The other three correlations are less striking, but still significant. The problem: even these four predictors miss over 50% of drive failures. If you get one of these errors, replace your drive, but not getting one doesn't mean you are safe. SMART is simply not reliable.
Workload and drive life Defining workload isn't easy, but the good news is that the Googlers didn't find much of a correlation.
After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high ulization ones.
They did find infant mortality was higher among high-utilization drives. So burn those babies in!
Age and drive failure The authors note that their data doesn't really answer this question due to the mix of drive types and vendors. Nonetheless their drive population does show AFR increases with age.
Hot drives = dead drives? Possibly the biggest surprise in the Google study is that failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. This might mean cooling costs could be significantly reduced at data centers.
Beyond Google Google's paper wasn't the only cool storage paper or even the best: Bianca Schroeder and Garth Gibson of CMU's Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a "Best Paper" award.
They looked at 100,000 drives Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed Internet services providers. The drives had different workloads, different definitions of "failure" and different levels of data collection so the data isn't quite as smooth or complete as Google's. Yet it probably looks more like a typical enterprise data center, IMHO. Also she included "enterprise" drives in her sample.
Key observations from the CMU paper: High-end "enterprise" drives versus "consumer" drives?
. . . we observe little difference in replacement rates between SCSI, FC and SATA drives, . . . ."
So how much of that 1,000,000 hour MTBF are you actually getting?
Infant mortality?
. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.
The infant mortality effect is slightly different than what Google reported. Both agree on early the more important issue of early wear-out. Vendor MTBF reliability?
While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs [Average Replacement Rate] range from 0.5% to as high as 13.5%. . . . up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.
Actual MTBFs?
The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours."
In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec'd at.
Drive reliability after burn-in?
Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.
Drives are mechanical devices and wear out like machines do, not like electronics.
Data safety under RAID 5? The assumption of data safety behind RAID 5 is that drive failures are independent so that the likelihood of two drive failures in a single RAID 5 LUN is vanishingly low. The authors found that this assumption is incorrect.
. . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .
In fact, they found that a disk replacement made another disk replacement much more likely.
Independence of drive failures in an array?
The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.
Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!
Let the dialogue begin! The importance of these papers is that they present real-world results from large drive populations. Vendors have kept drive-reliability data to themselves for what now seem obvious reasons: they've been inflating their numbers. With good field numbers coming out, smart storage and systems folks can start designing for the real world. It's about time.
Comments welcome, of course. Plenty have already been made over at StorageMojo.com.



