Industry


Ads by TechWords

See your link here


Robin Harris's picture
Robin Harris

Random Writes

Everything you know about disks is wrong

Two bombshell papers released at the Usenix FAST '07 (File And Storage Technology) conference this week bring a welcome dose of reality to the basic building block of storage: the disk drive.

Together the two papers are 29 pages of dense computer science with lots of info on populations, statistical analysis, and related arcana. I recommend both papers. The following summary, and two longer analyses at StorageMojo are summaries of what I found interesting.

The first conference paper, from researchers at Google, Failure Trends in a Large Disk Drive Population (pdf) looks at a 100,000-drive population of Google PATA and SATA drives. Remember that these drives are in professionally managed, Class A data centers, and once powered on, are almost never powered down. So conditions should be nearly ideal for maximum drive life.

The most interesting results came in five areas:

  • The validity of manufacturer's MTBF specs
  • The usefulness of SMART statistics
  • Workload and drive life
  • Age and drive failure
  • Temperature and drive failure

MTBF Google found that Annual Failure Rates were quite a bit higher than vendor MTBF specs suggest. For a 300,000-hour MTBF, one would expect an AFR of 1.46%, but the best the Googlers observed was 1.7% in the first year, rising to over 8.6% in the third year.

SMART: not very SMART (Self-Monitoring, Analysis, and Reporting Technology) is supposed to capture drive error data to predict failure. The authors found that several SMART errors were strong predictors of ensuing failure:

  • scan errors
  • reallocation count
  • offline reallocation
  • probational count

For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days. The other three correlations are less striking, but still significant. The problem: even these four predictors miss over 50% of drive failures. If you get one of these errors, replace your drive, but not getting one doesn't mean you are safe. SMART is simply not reliable.

Workload and drive life Defining workload isn't easy, but the good news is that the Googlers didn't find much of a correlation.

After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high ulization ones.

They did find infant mortality was higher among high-utilization drives. So burn those babies in!

Age and drive failure The authors note that their data doesn't really answer this question due to the mix of drive types and vendors. Nonetheless their drive population does show AFR increases with age.

Hot drives = dead drives? Possibly the biggest surprise in the Google study is that failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. This might mean cooling costs could be significantly reduced at data centers.

Beyond Google Google's paper wasn't the only cool storage paper or even the best: Bianca Schroeder and Garth Gibson of CMU's Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a "Best Paper" award.

They looked at 100,000 drives Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed Internet services providers. The drives had different workloads, different definitions of "failure" and different levels of data collection so the data isn't quite as smooth or complete as Google's. Yet it probably looks more like a typical enterprise data center, IMHO. Also she included "enterprise" drives in her sample.

Key observations from the CMU paper: High-end "enterprise" drives versus "consumer" drives?

. . . we observe little difference in replacement rates between SCSI, FC and SATA drives, . . . ."

So how much of that 1,000,000 hour MTBF are you actually getting?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

The infant mortality effect is slightly different than what Google reported. Both agree on early the more important issue of early wear-out. Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs [Average Replacement Rate] range from 0.5% to as high as 13.5%. . . . up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours."

In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec'd at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives are mechanical devices and wear out like machines do, not like electronics.

Data safety under RAID 5? The assumption of data safety behind RAID 5 is that drive failures are independent so that the likelihood of two drive failures in a single RAID 5 LUN is vanishingly low. The authors found that this assumption is incorrect.

. . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

In fact, they found that a disk replacement made another disk replacement much more likely.

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Let the dialogue begin! The importance of these papers is that they present real-world results from large drive populations. Vendors have kept drive-reliability data to themselves for what now seem obvious reasons: they've been inflating their numbers. With good field numbers coming out, smart storage and systems folks can start designing for the real world. It's about time.

Comments welcome, of course. Plenty have already been made over at StorageMojo.com.

What People Are Saying

If motor failure due to

If motor failure due to frequent spin up is a problem, one would expect to see spin retries be greater than zero, a lot greater.

Evan wrote: I have 4 of the

Evan wrote: I have 4 of the exact same 36 gig drives for my RAID 5

I have an extra 9gig drive and another (different brand) 36gig drive.

Can I add all these to the RAID set up, for a total of 6 drives.
Or will the 9gig drive mess things up?
Best regards,"

I think you can add all these stuff but be carefull with the volume of RAID 5 hard drive. Mine has been crashed some days ago.

I have 4 of the exact same

I have 4 of the exact same 36 gig drives for my RAID 5

I have an extra 9gig drive and another (different brand) 36gig drive.

Can I add all these to the RAID set up, for a total of 6 drives.
Or will the 9gig drive mess things up?
Best regards,
Evan

Robin, interesting

Robin,
interesting information, thanks for sharing.

Interesting, My

Interesting,
My experience with a tripple mirrored drive farm over 15 years shows little correlation of the RAID 5 type. Logic then says that the RAID 5 type of access (all drives generally seeking in parrellel) and the likelyhood of all drives being manufactured in the same batch, give rise to this problem. The mirrored drives only seek together (more or less) on writes, thus there failure pattern is more evenly distributed over the entire disk farm.
Let me say it again, we have NEVER had two drives of the same mirror fail before the bad drive was replaced and probably not in any short time period thereafter.

the Google study _does_ show

the Google study _does_ show a significicant effect of temperature. take a look at figure 5 again: old-hot drives have a horrible failure rate. curiously, cold drives are also always moderately less reliable. the sweet spot is around 35C.

You could hardly be more

You could hardly be more wrong with the motor spin-up theory. The Google team reported in their section on SMART about

"Spin Retries. Counts the number of retries when the
drive is attempting to spin up. We did not register a sin-
gle count within our entire population."

If motor failure due to frequent spin up is a problem, one would expect to see spin retries be greater than zero, a lot greater.




With a 100,000 drive population and access to pretty good data, I think it is difficult to argue that they should "dig deeper". The data is also supported by the CMU paper, and beats the hell out of anything I've ever seen from the drive or array vendors.




As for power supplies, Google specs good, consumer-grade products. So if disk drives aren't compatible with consumer grade power supplies, the vendors should call that out and set standards for disk power. If they haven't, is it because it isn't a problem?

Robin

The data suggest to me that

The data suggest to me that specific environmental or usage patterns drastically reduce the lifetime of a hard drive. The study rules out heat. Two remaining factors are file activity and power supply. An erratic power supply voltage is always bad for electric motors. Hard drives often stand idle for long periods and spin up on demand. Others spin continuously. Frequent spin up from a resting state is probably the worst stress on motors. Constant activity of the head on active popular files is another severe stress. I wish the authors would dig deeper because the cause of failure should be easy to identify.