Asteroids, nuclear war and data center outages: Surviving big disasters by being small
- TAGS:availability, data storage, disaster recovery, reliable
- IT TOPICS:Emerging Technology, Infrastructure Management, Storage
Imagine that it's the height of the Cold War, and you are trying to design an approach to command, control, and communications that can survive a full-scale nuclear attack.
One approach is to build a small number of communications nodes that are highly resilient. For example, you can build communications bunkers a mile deep under mountains, or keep a pair of specially-outfitted jets continuously in the air above Nebraska. Let's call this the big box approach.
An alternative approach, described by the RAND corporation in 1961, is to deploy large numbers of cheap  and distributed communications nodes in a packet switched network. In this approach, messages take the best available route to go from one location to another. Even if large numbers of the nodes are destroyed, messages can still get through. Let's call this the small box approach.
Of course, the U.S. Government tried both approaches. While the big box approach cost billions and, in many cases, has outlived its usefulness, the small box approach was not only cheap, but also took on a life of its own. The small box approach contributed to the development of ARPAnet, which later evolved into the Internet we know and love. Not only has the Internet made it possible to cheaply and reliably communicate about military matters, it has also made it a breeze to transmit petabytes of kittens-playing-piano video.
Now, let's imagine that you are trying to design a storage system that can store lots of data, with high performance and high resiliency. You could try to solve this problem with a big box approach, building a high performance, high capacity and highly resilient monolithic storage system. At present, this "scale up" approach dominates the storage industry.
However, I think the small box, scale out, software-based approach will ultimately have to dominate storage, much as the small box, Â distributed network approach came to dominate communications.
 At the end of the day, storage is really about four things: 1) capacity: how much data you can store; 2) performance: how fast can you can store and retrieve data; 3) reliability and availability: how safe your data is against loss or corruption;  and 4) economics: how much it costs to do all the above. Enterprises are looking for radical improvement in all four areas.
For now, let's focus on availability and reliability. In subsequent posts, I'll address the impact of the small box approach on capacity and performance.
It's a simple rule of thumb that adding an extra "9" of reliability usually at least triples the cost of a device. Thus, creating a device that is 99.9% reliable (e.g. only has a 0.1% chance of failing during any period) costs at least three times as much as creating a device that is 99% reliable (e.g. has a 1% chance of failing during any period). Creating a device that is 99.99% reliable is nine times as expensive as the 99% reliable device. And, creating a device that is 99.999% (5 nines) reliable is 27 times as expensive.
If you take a small box approach, you instead look at the probability of multiple independent devices failing simultaneously. The chance of two devices (each of which has a 1% chance of failure) failing simultaneously is 1-(.01) 2, or 99.99%. The chances of three such devices failing simultaneously is theoretically 1-(.01) 3, or 99.9999%. So, for the price of three cheap devices, you can get greater reliability than a single, highly-engineered device, and at one-ninth the price (this assumes that the devices truly fail independently). Furthermore, with the small box approach, it is possible to physically and logically separate the devices. So, in addition to protecting against the things that might cause an individual box to fail (e.g. faulty components), you also can gain greater protection against things like power outages, fires, floods and clumsy server administrators.
The math behind this is shown in the following table.
Â
It is no surprise that organizations like Google and Facebook have gravitated towards a small box, scale out, geographically replicated model of storage. At the end of the day, even the most resilient, highly-engineered systems can fall victim to catastrophic failures. By contrast, it is almost impossible to wipe out a large number of geographically-dispersed systems.
In other words, if an asteroid or nuclear weapon is about to hit, it's safer to bet on large numbers of cockroaches than a few massive dinosaurs.
Ben Golub was CEO of Gluster, Inc. , which is now the Storage Business Unit of Red Hat. He is on Twitter @golubbe. Â

