Industry


Ads by TechWords

See your link here


Tony Asaro's picture
Tony Asaro

Technology Matters

Big, Big Storage Systems - Fact or Fiction?

I talked to Dave Fellinger - CTO of Data Direct Networks and the discussion made me think about big, big storage systems.  We are entering the era of PB storage systems.  But what is reality and what is fiction?  There are performance, reliability, footprint and cost issues that must be considered.  Here are some questions to ask:

1.  If the storage vendor claims they can support up to a PB of capacity ask them to explain how the system can run optimally with that much storage?  Ask for real world proof. 

2. A storage system with that much capacity will inevitably have disk drive failures.  What happens to performance of primary I/O when a RAID rebuild occurs?  RAID 5?  RAID 6? 

3. How long does the RAID rebuild take?  Days?  Hours? Minutes?  Seconds?  Some next generation storage systems claim to do RAID 6 rebuilds rapidly - in minutes or less.  Ask them how they achieve this and again - get proof.

4. With that many disk drives your chances of silent data corruption will go up.  Check out the CERN report published in April 2007 that analyzes silent data corruption and a great article by Robin Harris that analyzes the numbers.  Based on the findings - a PB data storage statistically will have 2,500 corrupt files that you won't even know about - and that is with non-compressed files - the number goes up with compressed files.  This of course is unacceptable.  How does your storage system deal with silent data corruption?  Will it even detect silent data corruptions?   Will it fix them once detected? And do any of these processes impact performance? 

Dave had great answers for all of the above and I found an ESG Lab report that was published in 2008 and it analyzed the Data Direct solution and architecture.  I used to be a part of the ESG Lab team so I was curious to read the report and it validates a lot of what David claimed.   For example, the ESG Lab report verified that Data Direct performed a RAID 6 rebuild of a 1 TB drive in just 30 seconds.  That wasn't a typo - a 1 TB RAID 6 rebuild in 30 seconds! Some storage systems literally take days to perform this task.  

ESG Lab also validated that Data Direct supports up to nearly 1 PB of capacity (in just 2 racks - which is pretty amazing) without a performance hit.  Additionally, the report does a good overall job of explaining the Data Direct technology and they ran performance, reliability, and scalability tests with impressive results.

The claims of PB storage systems must be carefully considered - just housing lots of lots of disk drives physically doesn't mean that the storage system will perform optimally and reliably.  And the issues of disk failures, RAID rebuilds and silent data corruption and how the storage system - especially one that is going to be used for 100s of TBs and potentially PBs - should be intelligently answered by any storage vendor that wants your business. 

What People Are Saying

Clarification

As a brief clarification, Data Direct does not use DIB technology. Disk space is not wasted to guarantee data integrity. The byte stripped data is read across the entire redundancy group and analyzed in real time in a state machine. All recovery operations are executed without loss of performance on the host side. This architecture never relies on any single drive for data or a checksum at any time.

CERN report misinterpreted

You incorrectly summarize the CERN report (and Robin's accurate analysis of it). CERN found that undetected errors were occuring using SATA drives in a storage platform that did not verify the integrity of stored data.

At least some storage arrays (EMC CLARiiON and Symmetrix among them) store additional checksum data (called Data Integrity Bits, or DIBs) along with the data to velidate the integrity of the data read back from the drive. If the DIBs and the data don't match upon a read, these arrays will either regenerate the lost data from the RAID protection, or they will report a data integrity error back to the host/application.

Not all arrays offer this level of integrity - in fact very few do (most Hitachi and IBM arrays DO NOT include this protection). THe Hitachi AMS series actually does write-read-verify for SATA drives, but that does not protect against long-term corruption from bit erosion or cosmic partical bit-flips).

By their own admission, the arrays that CERN was using did not have this added level of protection.

So, yes, it is indeed a very important question to ask your vendor, no matter what size array you are buying.

And note that it doesn't matter if you have one HUGE array or several small ones - the data integrity issues scale with total aggregate capacity. Smaller arrays are no safer than big ones in when it comes to data corruption.

Storage Anarchist is wrong

Storage Anarchist is wrong about HDS not supporting DIB. We support an 8 byte DIB and have for over two decades for FC drives and from the beginning with our support for SATA drives. EMC needs to do their homework if they are going to make such comments. In fact HDS does both DIB and RAW and now enhanced RAW. According to Storage Anarchist, EMC only supports DIB which is a big mistake on EMC's part since using only DIB can lead to long-term corruption of data.

I stand corrected

Sorry, I apparently missed the documented use of DIBs in Hitachi kit. My bad.

As to Read After Write, that seems to add a rather hefty performance penalty. Do you have any statistics on how frequently RAW actually detects errors?

Instead of RAW, EMC arrays do continous background integrity checks of all disk (and flash) data. We believe that this approach has less impact on performance and provides greater protection than merely verifying once as does RAW. (Perhaps you've addressed this with "enhanced RAW".)

The real point here is that CERN wasn't using an array that had ANY of this sort of protection, and if they were, it is highly unlikely they would have suffered from their reported data corruption. Storage devices (disks, flash and RAM) are not infallible, and good storage arrays will protect against all kinds of data integrity risks that home-brew storage kits won't.

Response to Storage Anarchist

Anyone interested in Storage Anarchist’s comments I think it is important to know he is a VP at EMC - thus his references to EMC solutions.

Barry – aka Storage Anarchist – I actually recommend that you re-evaluate your analysis of the CERN report. The CERN report stated that 80% of the errors were due to 64k regions of corrupted data and appears to be SATA disk drive firmware problems. Are you saying that you believe that DIB addresses this?

I did a bit of research and as I understand DIB is at the block level on the physical disk and therefore a checksum could be validated but the data can still be incorrect.

What I like about Data Direct and how they deal with silent data corruption on SATA drives is they perform command retries, drive resets, individual drive power recycling and if needed perform a rebuild based on journaling. In this regard they don't manage disk drives individually but as a community - which provides integrity above and beyond checksums at the individual drive level. Also their process doesn't require any additional capacity consumption AND it does not slow down system performance AND is done instantly.

You do make one good point about the total aggregate capacity of lots of “smaller” storage systems present the same challenges as a single large one. However, you did miss the point I was making - there is value in having a single storage system supporting lots of capacity and therefore a discussion of reliability and scalability with these types of systems is an important topic.

DIBs

DIBs may or may not be managed by the drive itself; for maximum robustness, DIBs and checksums should be driven by the storage controller - this helps protect against buggy drive firmware causing data corruption.

And for the record, Data Direct is not unique in how they manage drives. Many other arrays (including those from EMC) provide similar features as those you list (retry, reset, recycle). Additionally, some arrays (like the Symmetrix) actually generate and validate checksums to ensure end-to-end data integrity for data blocks as they move within the array. Importantly, integrity is assured by implementing multiple overlapping error detection domains, ensuring that the bytes received from the host stay the same all the way out to the disk and back. Although CERN attributed errors to the SATA firmware itself, it is possible that some percentage of the corruptions they found had actually occurred further upstream in the I/O path.

That said, while clearly not infallible, Data Integrity Bytes of sufficient size can indeed be expected to detect the errors found by CERN, if indeed they were caused by corruption localized to the drive or drive firmware.

This is because a SATA drive doesn't store data in 64K byte increments; the native block size is (by current SATA definition) 512 bytes. Using T10-DIF, for example, an additional 8 bytes of DIB would be more than sufficient to detect even multi-bit errors in any of the 512 byte blocks used to store the 64K host writes. Each 64K logical block would be actually stored as 128 512-byte blocks with 128 8-byte DIBs, more than sufficient to identify multiple errors spread across the entire 64K byte block.

Note that on SATA drives with a fixed block size, DIBs are typically stored in separate 512-byte blocks. On SAS and FC drives that support it, the drives are usually formatted with 520 byte blocks and the DIB is stored alongside the data.

Whether rebuild from a journal or from the partners in a RAID set, the potential for bit errors within the data blocks used to rebuild the data always exists...journalling is not necessarily better or worse than RAID-based rebuild, it's just another approach.

And yes, there is definite value in consolidating lots of little arrays into a single big one - especially if that consolidated array can cost effectively support multiple different tiers of capacity in the same footprint (e.g. Flash, Fibre and SATA storage). This is a key component of EMC's strategy for both Symmetrix and CLARiiON, and both arrays currently support more spindles and usable capacity than any of their direct competitor's offerings.