Data De-Dupe: Take Three
- IT TOPICS:Storage
First, let me say I don't have an obsessive personality ... really. But I can't seem to go a day without talking about data de-duplication with somebody. I've even ranted to my husband, who isn't even in the data storage industry, about it (as if he cares). Data de-duplication is exciting because, as I've said before, it has massive, immediate real-world implications for users.
In fact, in my 10+ years in the industry, I have never seen a new technology get as much attention -- and, importantly, be incorporated into vendor product lines -- as quickly as data de-duplication. Its adoption is on an accelerated, or shortened, "technology adoption life-cycle" curve. There are no real technology "laggards" or "traditionalists." Even the disk array vendors see the writing on the wall, and are working feverishly to roll out data de-duplication capabilities this year. And the "pragmatists" are right on the heels of the early adopters; there's no real chasm between the two.
Data de-duplication features are already available in a variety of disk-based backup products (VTL and near-line disk, for example), and it's only a matter of time before they appear in other secondary as well as primary storage products. So, now is the time to begin considering your options.
Over the coming weeks, you'll begin to hear discussion about how and where data de-duplication should be done. Should it be done at the primary storage or secondary storage level, or both? What's the best way to extend data de-duplication to the massive volumes of data being generated at remote offices?
And should you do the de-duplication process real-time (i.e., in-band) as data is written to the primary or secondary disk target or after (i.e., out-of-band) the data has been written to the disk device in a "dual-hop" process. This boils down to a conversation of performance vs. capacity trade-offs.
Doing data de-duplication in-band, without clustering, can cause some performance degradation and, depending on the scale of the environment, can require users to invest in multiple VTL solutions to keep things streaming. But doing data de-duplication after the data has been written to the disk device has potential capacity (the full capacity is written to disk first before it is "shrunk" during the de-duplication process) and backup window issues (a second process is required to de-duplicate the data).
Over the coming weeks, I will explore each of these issues in-depth to help guide users, and vendors, through the process. Stay tuned.



