Ben Golub's picture
Ben Golub

Storage in a Big Data World

Clouds, neurosis and Joni Mitchell

I've looked at clouds from both sides now,
From up and down, and still somehow,
It's cloud illusions I recall,
I really don't know clouds, at all.

-Joni Mitchell, Both Sides Now (video below)

There's an old saying that the difference between neurosis and psychosis is that neurotics build castles in the clouds and psychotics move in. What then does that say about those of us who try to build storage farms in the cloud? Are we afflicted with some special form of insanity?

Most storage technology and deployment decisions ultimately come down to a limited number of factors: economics, capacity, performance, security and availability.

In my last post, I discussed the economics of public cloud storage. In this post, I will discuss capacity and performance.

Capacity

In my post on the small box vs. large box approach to storage, I discussed the economic advantages that come from purchasing capacity in small increments, avoiding the need to either prepay for large amounts of storage that go unused, or worry about running out of storage.

Public cloud storage offers the prospect of virtually unlimited amounts of capacity, consumable in incredibly small increments. Public cloud storage offers the additional advantages of: a) being able to provision new storage nearly instantaneously; b) never needing to worry about available space, power, cooling, etc.; and c) providing the option to cost-effectively "shrink" storage if space is no longer needed.

The value of the last point is somewhat less clear for storage than it is for public cloud computing. Many workloads require "bursty" amounts of compute. For example, certain jobs are only run intermittently, and other jobs involve intense periods of calculation interspersed with long idle periods. By contrast, storage needs tend to be less bursty (at least as far as capacity is concerned). While there are certainly some workloads that generate large amounts of data which is subsequently winnowed down, the vast majority of workloads do not end with large amounts of data being thrown away. Since one pays a high price per terabyte for the flexibility of cloud storage capacity, it is worth examining whether your workloads really utilize that flexibility.

A further consideration is whether your applications or workloads need to view all of the public cloud storage as part of a single pool. Most public cloud storage is delivered as object storage, which is a great solution if your applications are written to utilize object storage. However, about 95% of all existing applications require POSIX-compliant storage, which means that you need a public cloud storage option that flexibly consolidates the capacity into a global namespace.

Performance

"Cloud storage" and "high performance" are generally not terms that are used in the same sentence. There are three reasons for this.

First, public cloud storage providers tend to provision storage using low-end, commodity servers and disks. Second, to the extent that you are using shared physical resources (i.e. other users may have data and compute jobs on the same physical devices), performance can be highly variable if your demands are being placed at the same time as other users. Finally, public cloud services by their nature are provisioned across the Internet, where bandwidths are narrow and distance introduces unavoidable latency issues and bandwidth.

The first set of issues can be addressed in much the same way as in a small box, on-premise storage pool. By distributing workloads across large numbers of low end disks and servers in a global namespace, one can get very respectable performance even in the public cloud. ([NB: This generally works better for throughput-dependent workloads (e.g. media streaming) than latency-dependent applications (such as RDBMS).]

Similarly, the second set of issues can be addressed either by purchasing dedicated cloud resources (although this tends to be expensive) or by intelligently provisioning cloud resources so that things are very highly distributed. Again, as discussed in the small boxes post, distribution minimizes the chances that a single, temporarily slow, physical resource will impact system performance.

As should be apparent from the above discussion, provisioning and managing public cloud resources in the manner discussed above will generally require the use of some advanced storage software on top of the naked public cloud storage.

What about the third set of issues? Can one overcome the performance limitations imposed by the Internet itself ?

If your applications are physically run far from the public cloud storage, the answer is probably "no". Of course, many use cases (e.g. serving files to end users) won't be impacted by these performance considerations. Furthermore, if the applications are themselves run in the public cloud, then this final set of performance issues can also go away.

While public cloud storage is most frequently discussed in relation to backup or low-tier storage, the combination of economics, capacity, and performance issues suggest that it may be more ideally suited for "bursty" applications (e.g. GIS, analytics jobs) where both compute and storage happen in the cloud. In such cases, the combination of public cloud compute and public cloud storage is often ideal.

Of course, no discussion of cloud compute is complete without a consideration for issues of availability and security. As Joni Mitchell's song suggests, one needs to look at clouds from both sides. If you are going to be closely examining the top of a physical cloud, it is a good idea to understand the reliability of your parachute. More on this in the next post.

Ben Golub was CEO of Gluster, Inc. , which is now the Storage Business Unit of Red Hat. He is on Twitter @golubbe.  

 


What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?