M.C. Srivas is amazed by Google’s search engine. And he helped build Google’s search engine.
He’s amazed that if you search for “2005 Accord,” Google seems to understand you’re looking for a family sedan, giving you links not only for used Honda Accords but for similar cars with similar prices — a Volkswagon Passat or a Toyota Camry. He’s amazed Google can sort out the difference between a search for an apartment and one for a house. He’s amazed it can distinguish between “new” and “New York” and “New York Times.”
But he’s not applauding Google’s famous search algorithms. He’s applauding the infrastructure Google built to support those algorithms — software platforms such as the Google File System (GFS) and Google MapReduce that store and analyze data by spreading it across an army of ordinary servers. The algorithms are important too, but it’s MapReduce that took all those pages from across the web and put them into a readily searchable index. “The things we did at Google were incredible,” Srivas says. “I was just blown away by how effectively data was used.”
Srivas spent nearly two years at Google, running one of its search infrastructure teams, and in the summer of 2009, he left the company to found a startup that takes the ideas behind Google’s top-secret infrastructure and delivers them to the average business. The company is called MapR, after Google’s MapReduce, and like so many other companies, Srivas and crew are selling a product based on Hadoop, an open source incarnation of Google’s GFS and MapReduce platforms. But unlike its competitors, MapR is offering something that’s very different from the open source Hadoop project. The company spent two years rewriting Hadoop behind closed doors, eliminating what Srivas sees as major flaws in the platform.
“Three years ago, I gave a public talk about all the problems that existed with Hadoop, and three years later, they’re all still there [in the open source version],” Srivas tells Wired. “At some point, you just have to say ‘This cannot be fixed,’ and just throw it out and rewrite it. That’s what we did.”
Hadoop is a prime example of how technologies developed by the giants of the internet are now reinventing the software — and the hardware — used by everyday businesses. It’s a means of analyzing large amounts of unstructured data using a cluster of dirt-cheap servers. At Yahoo and Facebook, it feeds information into live web services, and it helps track how these services are performing. In the age of the internet, with more and more data flooding into the world’s businesses, this is something that can appeal to almost any large operation.
Big-name tech vendors such as Microsoft, Oracle, and IBM are offering tools based on Hadoop, and Srivas is just one example of an engineer who left a big name web outfit to build a startup around the platform. Cloudera and Hortonworks are the other notable startups. Each of these startups takes a slightly different approach to Hadoop, and naturally, they spend a fair amount of time criticizing each other’s efforts. But despite his obvious agenda, the message from Srivas provides a nice counterpoint to all the hype surrounding the platform. Hadoop is a tool that still needs work.
Like Google. Sorta
Kirk Dunn — the chief operating officer of Hadoop startup Cloudera — points to Google, Facebook, and Yahoo as proof that the platform is ready for prime time. “Google and Facebook and Yahoo have many thousands of nodes that have been running many years,” he tells Wired. “The body of evidence there is overwhelming.”
Google doesn’t actually run Hadoop — Yahoo and Facebook and others built Hadoop using Google research papers that describe its back-end infrastructure — but his point is well taken. Yahoo and Facebook use Hadoop to crunch epic amounts of data using thousands of ordinary servers, and most businesses that adopt the technology will run the platform across much smaller clusters of machines.
But as M.C. Srivas points out, the open source version of Hadoop is still plagued by what are commonly called “single points of failure.” If one particular server goes down, it can bring down the entire platform. This sort of thing is something a Yahoo or a Facebook can deal with, Srivas says, but not necessarily the average business. “The reason that Yahoo and Facebook can run it is that they employ 50, 60, 70 engineers to feed the thing,” he says. “Other companies don’t have that.”
Before founding MapR, Srivas says, he met with the founders of Cloudera and considered joining their effort. But they wanted to tackle Hadoop the way Red Hat tackled Linux — i.e., offer support, services, and additional software around the open source platform — and he felt the before doing anything else, you had to fix the holes in the platform. Rather than join what would become Cloudera, he found a kindred spirit in John Schroeder, the former CEO of Calista Systems, a desktop virtualization outfit that was acquired by Microsoft in early 2008.
Schroeder had a friend at Google who also worked with MapReduce. Like Srivas, he attributes Google’s success not to its search algorithms but to its infrastructure. “From my acquaintance at Google, I observed — earlier than most — the power of MapReduce,” Scroeder says. “In 1998, they were the 19th search engine to enter the market. Remember doing an Alta Vista search, anyone? Google’s implementation of MapReduce on GFS and [Google's distributed database] BigTable vaulted them to leadership within two years.”
Srivas and Schroeder met through mutual acquaintances in the venture capital world, and they founded MapR in 2009. For two years, their team worked to build a proprietary version of Hadoop that would eliminate certain limitations, including those single points of failure, and in May of 2011, they took the wraps off their proprietary Hadoop distribution. It was already being used by an analytics appliance offered by storage giant EMC.
Hadoop of the Future
According to Srivas and Schroeder, their Hadoop distro is several years ahead of the open source distributions offered by the likes of Cloudera. Which is only what you’d expected them to say. But it’s indisputable that the company has fixed major flaws that still plague the open source version.
Hadoop consists of a file system (HDFS) and a number-crunching platform (Hadoop MapReduce). The file system lets you spread data across a cluster of machine, and the MapReduce processes this data by sending little pieces of code to each individual server. During those two years of development, MapR essentially rewrote the file system. “It could not be saved,” Srivas says. The company also revamped Hadoop’s “job tracker,” which distributes jobs across machines and then manages their execution, and its “name node,” which oversees file names across the system. On the open source platform, both are single points of failure, and the name node limits the number of files the platform can handle.
Cloudera’s Kirk Dunn acknowledges these shortcomings, but he says there are other things to consider when evaluating the merit of the open source version of Hadoop. The open source project will eventually eliminate those flaws as well, he says, and in the end, there are advantages to having all your code out in the open. “The conversation must be raised to a higher level than how many goesintas and goesoutas a distribution has,” he says. “With open source, you get the community effect. Would you rather rely on hundreds of engineers working on a very important problem? Or would you rather rely on one company with a handful of engineers?”
Indeed, both Cloudera and Hortwornworks, a Yahoo spin-off, are committed to enhancing the open source project. And though much of its code is proprietary, MapR is also making some contributions back to the project.
There are other areas where the platform can improve, and MapR is tackling these as well. For the most part, Hadoop is a “batch” system. You give it a task. It works for a while. And then it churns out a result. It’s not designed to generate information in “real time.” With its search engine, Google has now abandoned MapReduce, moving to a platform called “Caffeine” that can update its search index on the fly, and John Schroeder hints that MapR is moving in a similar direction — though its solution will likely look very different from Caffeine.
M.C. Srivas points out that Hadoop is quite different from what Google runs internally. In addition to GFS and MapReduce, Google runs another software layer called Borg, a means of managing server clusters inside its data centers. Google has yet to publish information about Borg, and like all ex-Google employees, Srivas won’t say any more about it, citing a non-disclosure agreement. But his larger message is that you shouldn’t mistake Hadoop for the Google infrastructure. Nor, he says, should you mistake it for what’s running at Yahoo and Facebook. “I’m sure that, like Google, they’re holding back what they see as their secret sauce,” he says.
This may or may not be the case. But the fact remains that Google and Yahoo! and Facebook are not your average business. If Hadoop is to succeed elsewhere, it must evolve. At MapR, it already has.
Cade Metz is the editor of Wired Enterprise. Got a NEWS TIP related to this story -- or to anything else in the world of big tech? Please e-mail him: cade_metz at wired.com.