Does Yahoo's 2 petabyte database prove the superiority of the column-oriented database?
- TAGS:column-oriented database, data compression, data warehousing, DATAllegro, IRS, Michael Stonebraker, Oracle, SybaseIQ, Yahoo!
- IT TOPICS:Data Center, Development, Emerging Technology, Enterprise Apps, Privacy
There has been plenty of recent hype around column-oriented databases, aka the vertical database. Yahoo! Inc.'s revelation that it has built perhaps the world's fastest, busiest data warehouse using this approach is only the latest. (for a good ongoing discussion of the various vertical vendors, check out analyst Curt Monash's DBMS2 blog.)
Yahoo's year-old database already stores 2 petabytes of actively-accessed, structured data, though Yahoo! vice-president Waqar Hasan predicted that will grow to multiple tens of petabytes by next year.
It certainly dwarfs the 150 TB data warehouse that the IRS is using to clamp down on tax cheats that I marveled at earlier this spring.
(FYI, an anonymous commentator to my Yahoo story claimed to have worked on an application that stored data on the "entire history of the US Strategic Weapons program...the production database was about 10 petabytes. It was ONE SMALL PART of the comprehensive database of the National Archives." No word on whether that was a row or column-based database.)
The IRS was using SybaseIQ, which Hasan looked at briefly along with other off-the-shelf column-oriented databases such as Vertica before deciding that none of them would offer the speed and scalability Yahoo needed. Which is why the Web firm, which analyzes the behavior of half a billion Web visitors a month, decided to take a PostGreSQL engine and tweak it.
So who's going to step up and defend conventional row-based databases? Oracle? IBM? Microsoft? Nope. Instead, Stuart Frost, CEO and founder of data warehousing appliance vendor, DATAllegro Inc., did so in a blog posting earlier this month.
Frost allows that "there are some advantages" to column-based data warehouses. Data compression is better: column-oriented systems claim between 4:1 to 10:1 compression, while DATAllegro achieves between 2:1 and 6:1, he said. Also, queries that access only a few columns will run faster through a vertical approach.
But, and here is Frost's argument, "in most real-world implementations, these advantages don't make a great deal of difference." (my emphasis)
Column-oriented databases are no miracle, but just an "extreme form of vertical partitioning of data."
DATAllegro and other row-based databases achieve the same effect through "sophisticated horizontal partitioning to limit the number of rows read for each query" as well as "clever usage of materialized view technology to limit the number of columns we need to read," wrote Frost.
The net result is "very similar performance to that claimed by [Vertica Inc. founder Michael] Stonebraker, i.e. 10 to 50x that of traditional databases such as Oracle."
Does anyone have experience with DATAllegro and can testify that it is both speedier than Oracle and as fast as column-oriented databases? Or is anyone a user of Vertica, SybaseIQ, ParAccel, InfoBright or other vertical DBs, and want to share the purported advantages they see?

