Industry


Ads by TechWords

See your link here


What’s up with MTBF?

The most likely component to fail in a computer system is a hard disk drive. Like your car, the mechanical parts are more likely to fail than the pure electrical parts.   For example, a wheel bearing in your car is more likely to fail than the computer chip that monitors your engine's performance.  Similarly, a wheel bearing in a hard drive is more likely to break than the processor on a disk controller.  The Mean Time Between Failure (or MTBF) is a rating used in the computer industry that attempts to predict the likelihood of a failure.  More specifically, the MTBF is a measure of the predicted time between failures.

The more drives you have in a storage system, the more likely it is that one of the drives will fail.  As a simple example, let's say you have a hard drive with an MTBF of four hours. In a system with two of these drives, you have twice the exposure to a drive failure so your MTBF is cut in half to two hours. Similarly, a system with four of these drives has an MTBF of one hour.   Thus we arrive at the industry accepted formula for the MTBF of a system composed of a number of the same  parts – take the MTBF of a component and divide it by the number of those components in the system.

As you probably guessed, the MTBF for hard drives in the real world is much greater than the four hour MTBF used in the hypothetical example above.  Modern SCSI and Fibre Channel drives tend to have MTBF ratings of a million hours or more, while ATA and SATA tend to be in the half a million to a million hour range.   For discussion purposes, let's consider a  drive with an MTBF rating of a million hours in an enterprise class storage system with 100 drives.  And let's imagine that that storage system is serving all of the mission critical applications needs of a medium sized enterprise (Email, finance, order entry and  HR). To calculate the MTBF of the system we take the MTBF of the drive (1 million) and divide it by the number of drives (100 drives) to get a system MTBF of 10,000 hours. Ten thousand hours sounds like along time, but that's only 416 days or just over a year.  With no protection from data loss due to a hard drive failure, that means our hypothetical company's IT operations will suffer a catastrophic loss of data every year or so.  As we approach the capacities deployed in today's enterprise class systems supporting a thousand drives or more, the likelihood of a drive failure shrinks to just over one a month. It's easy to see that as the number of drives in a data centers increases, it no longer becomes a matter of if a hard drive will fail, but instead how often they will fail.

RAID algorithms are designed to reduce the risk of data loss due to a hard drive failure.  When the inventors of RAID at the University of California Berkeley first proposed the RAID concept in the 1980's, the published MTBF calculations for each RAID algorithm were a key part of their findings.    RAID algorithms dramatically increases the MTBF of a system composed of many hard drives (except for RAID-0 striping which improves performance, but not protection). Using  copies of data on extra drives, the system is able to survive a hard drive failure.   When a hard drive fails using traditional RAID algorithms the MTBF of the system drops again until the failed drive is replaced and rebuilt. Emerging RAID-6 solutions solve this problem by adding yet another extra copy of data in each RAID group so that multiple drive failures in the same group can be tolerated without data loss.

For those of you deploying mission critical applications over a large number of drives, you may want to ask your storage system vendor about RAID-6. It costs a bit more to protect your data using RAID-6, but for the most mission critical applications, the cost of added insurance may be easy to justify.   Besides storage geeks like me, most of you in the real-world need don't need to worry about MTBF calculations and RAID algorithms, but trust me when I say that the engineers at your favorite storage company have worked over the numbers to ensure that your applications run smoothly through a hard drive failure.