Based on the software underpinning Google’s online empire, Hadoop was designed as a means of crunching vast amounts of data using very ordinary machines. But the world’s big name hardware makers see it quite differently.
In recent months, the likes of Dell, Oracle, and EMC have unveiled what they bill as specialized hardware appliances for Hadoop, and on Monday, they were joined by storage hardware outfit and EMC rival NetApp, which announced a creation it calls the NetApp Open Solution for Hadoop.
Named for the yellow stuffed elephant that belonged to the son of its original developer, Hadoop is an open source software platform that analyzes data by splitting it into tiny pieces and distributing it across a large cluster of machines. The platform was originally built by Yahoo! using research papers published by Google, and it helps drive such web operations as Facebook, Twitter, and eBay. But Hadoop is evolving into a tool for the average business — which faces its own avalanche of unstructured data pouring from the web.
Targeting such businesses, NetApp is offering what amounts to a cluster of hardware devices running the for-pay Hadoop distribution from Cloudera, a Silicon Valley startup that has commercialized the platform in much the same way Red Hat commercialized Linux. Jeff O’Neal, NetApp’s senior director of data center solutions, bills the new product as a “unique solution” in the Hadoop world, saying that — unlike other Hadoop appliances — it lets you readily add extra storage without adding extra CPUs.
“We’ve effectively separated the two physically,” O’Neal tells Wired, “so you can grow storage at a different rate than you grow compute.” In other words, as you require more storage, you can add up to fourteen 2-terabyte drives to a single server node — rather than adding extra servers to the cluster.
NetApp and Cloudera pitch the product as a superior alternative to the Hadoop appliance offered by EMC, the Massachusetts-based storage giant. But John Schroeder — CEO of MapR Technologies, the startup that supplies the Hadoop distro for EMC’s hardware — doesn’t see the appeal of “separating” compute and storage as NetApp describes.
“The main concept behind Hadoop is data with compute,” he tells Wired. “The whole idea is to shard your data across the cluster and then each node works on its local shard. That’s where quite a bit of the efficiency comes from.”
MapR spent two years building a proprietary version of Hadoop that corrects certain flaws in the open source platform, including its dependence on a single “NameNode” server that helps oversee all other servers in the cluster. With the open source platform, if the NameNode goes down, the entire cluster goes down. But MapR has eliminated this single point of failure.
Asked how NetApp addresses this flaw, O’Neal says the company provides a network file system (NFS) back-up for the NameNode, and he mentions other redundant hardware available with the product. But at this point, the claims and counterclaims from NetApp and its competitors are little more than a war of words. The NetApp cluster won’t be available until December.
What NetApp’s announcement does show is that Hadoop is quickly becoming one of those things that every big name IT outfit absolutely must offer. “In addition to all the storage vendors already offering Hadoop appliances,” says Jim Kobelius, an analyst with research outfit Forrester, “I’ve had so many others contact me to find out how they should get in on the market.”
Cade Metz is the editor of Wired Enterprise. Got a NEWS TIP related to this story -- or to anything else in the world of big tech? Please e-mail him: cade_metz at wired.com.