How safe is your deduplicated data?
- TAGS:backup, data protection, Dedupe, disaster recovery, enterprise
- IT TOPICS:Applications, Data Center, Hardware, Storage
As more organizations move to disk-based backup with dedupe, the glaring absence of high availability (HA) in many of today’s offerings is becoming apparent.
Data deduplication is a hot technology, since it can really help to reduce overall backup costs by storing months or years worth of backup data very efficiently on disk. (Greg Shulz delivered a fairly good presentation in April 2007 at Storage Networking World (SNW) that goes over trends in storage and how dedupe is part of that trend (PDF), and another article here about demystifying data deduplicaiton.)
If the dedupe solution breaks though, what happens to the months of data stored behind it on disk? Since data deduplication solutions can store a lot of data, you can end up with a lot of eggs (backups in this case) in a one basket. It's obvious that RAID protection alone is not enough when you store multiple backups on non-removable media that can fail.
With non-HA dedupe solutions, if the solution fails, there are no tapes to go back to recover. This is why having tapes around is still a good thing (especially for long term archives) and why high availability clustering is a must for dedupe.
Unlike disk-based backup technology, when a tape is removed from a tape library, the data is fairly secure and not susceptible to power issues or other mechanical failures. If you need to recover data from the tape, your backup application will ask for the tape that matches the barcode of where the backup application stored the files.
If the tape ends up being bad, or the data on the tape corrupt, you still have the option to use the backup tape from the night before. Even if a tape drive fails, you usually still can recover from another available tape drive The importance of HA in disk-based backup solutions: In a deduped disk-based backup solution, there may be months worth of backups stored behind a single head. If the solution is not highly available (i.e., clustered) and the head fails, there may be no way to recover the data at all, which makes high availability very important when using disk-based data protection solutions.
An enterprise class data protection solution should be able to recover from multiple failures. After all, this may be the only backup copy of your data. Since it all deduped and stored as a single copy, recoverability from any failure is very important.
Here is a simple checklist to use when you are shopping for data deduplication solutions for backups.
HA checklist for dedupe solutions:
No single point of failure (NSPOF) - Includes all the highly available features typically found in enterprise servers, such as dual power, dual CPU, multiple PCI buses, multiple ports, etc.
File system integrity - Some dedupe solutions depend on a filesystem, and others use RAW disk I/O. If the solution uses a filesystem to store the data, it must be self-protected against viruses and other security issues.
Clustering - The solution should provide automatic failover for recovery if one of the solution's nodes fails.Â
Index integrity - The index stores the pointers to the dedupe data in the repositroy. The solution should have a mechanism for recovery from internal corruption of the index.
Replication - The solution should be able to efficiently replicate your data to an off-site copy.Â
Tape integration - The solution should let you easily leverage existing tape, if you desire, so you can copy the data to tape to protect data long term at low cost.
RAID protection - In order to protect data from multiple disk failures, dual parity RAID6 should be used behind any deduplication repository.
Write verify - The solution should be able to assure data written to disk is correct in the first place. This is similar to the write verify capabilities of most backup software but done at a hardware level.
In-place upgrades - If the solution is not clustered, how do you upgrade the software or hardware without losing access to backup or recovery of the data within the solution?Â
Recovery optimization - The solution should let you recover data at any time, even while backups are still being performed on a particular dataset.
Assuring the solution you choose conforms to most, if not all, of the criteria above will enable you to have peace of mind when things go bump in the night.
Christopher Poelker is the author of Storage Area Networks for Dummies, and he is currently the vice president of Enterprise Solutions at FalconStor Software.
