It took more than three years, but Microsoft has finally learned to stop worrying and love Hadoop.
Hadoop — an open source platform for crunching epic amounts of a data across an army of dirt-cheap servers — underpins everything from Facebook and Twitter to Yahoo! and eBay, and it’s poised for use across the enterprise, with EMC, IBM, and Oracle pushing the platform onto corporate customers. But although Microsoft acquired a Hadoop pioneer as far back as the summer of 2008, its relationship with the platform has been uneasy at best, as the company continued to shed its traditional aversions to open source software.
Any aversion to Hadoop disappeared on Wednesday, when the company announced that it will integrate the platform with future versions of its relational database, SQL Server, and its platform cloud, Windows Azure, an online service for hosting and readily scaling applications. The company is now working to port the Hadoop platform to Windows — it was built for use atop Linux — and Doug Leland, general manager of product management for SQL Server, told Wired that the company plans to eventually release its work back to the open source community.
“This shows that Microsoft is serious about Hadoop,” said Jim Kobelius, an analyst with research outfit Forrester. “It wasn’t before.”
This time last year, Microsoft lent its support to another big name open source project: OpenStack project, an effort to build “infrastructure clouds” along the lines of Amazon’s EC2. But Redmond relied on a third party to provide the code. This time, Leland says, Microsoft engineers will do the coding.
The world’s largest software giant continues to evolve.
When Linux Was “Cancer”
Famously, through the ’90s and into the aughts, Microsoft quarantined itself from the world of open source software. In 2001, chief exec Steve Ballmer referred to Linux — the granddaddy of open source — as a “cancer.” But as the influence of Linux and open source grew, the company began to bring down the wall, realizing it couldn’t survive in the long term if it didn’t.
The result — at least in the short term — was a kind of open source schizophrenia. Some parts of the company would reach out to the open source community, while others were still loath to do so. Some Microsoft products would play nicely with open source code, but these tools rarely included open source code themselves. The company’s rather complicated relationship to open source was exemplified by its 2008 purchase of Powerset, a semantic search startup based in San Francisco that was among the first companies to embrace Hadoop.
The original Hadoop project was started by independent coder Doug Cutting, who named the platform after his son’s yellow stuffed elephant, and it was Yahoo! who hired Cutting and seeded the open source project at the Apache Foundation. But Powerset founded Hbase, the “NoSQL” database that runs atop Hadoop. The startup’s semantic search engine — a means of searching with natural language rather than mere keywords — was tightly integrated with the open source platform.
After imposing a three month hiatus on Powerset’s two full-time Hbase “committers” — Michael Stack and Jim Kellerman — Microsoft allowed the pair to continue their contributions to the open source project, and Powerset, which was rolled into Redmond’s Bing search engine, continued to run atop Hadoop.
This made Bing one of the first “shipping” Microsoft products to actually include open source code. But somewhere along the way, Microsoft moved the engine onto a proprietary platform, and Stack left the company, taking his HBase work to web search outfit StumbleUpon.
The New Microsoft
Doug Leland declined to discuss Microsoft’s past history with Hadoop, pointing out that Powerset was handled by a separate part of the company, but he made it quite clear that both the SQL Server and the Windows Azure teams are committed to the open source platform for the long term. “There have certainly been requests from our [SQL Server and Windows Azure] customers to embrace Hadoop and deliver an enterprise-class distribution of the platform that’s built into the Windows infrastructure and is easily managed within that infrastructure,” he said. “And that’s what we’re doing.”
Hortonworks — an outfit that Yahoo! recently created using its core Hadoop engineers — is working in tandem with Microsoft on its port to Windows. Hadoop will be available as a “technology preview” on Azure by the end of the year, and a preview for use with SQL Server will be available sometime next year.
Whereas a relational database such as SQL Server organizes information into neat rows and columns, letting you carefully slice and dice that data as needed, Hadoop is a way of processing large unstructured datasets. In essence, Microsoft’s Hadoop port will run on its Windows Server operating system alongside SQL Server, and the company is providing “connectors” for moving data between the two. On Azure, Microsoft will provide its Hadoop port as a service to developers, letting them build applications atop the platform without installing it inside their own data centers.
The rub is that with a distributed number-crunching platform like Hadoop, operating system overhead can be an issue, and Windows carries an awful lot of baggage. Linux, a more streamlined OS, seems much better suited to the platform.
But for many, including Eric Baldeschwieler, the CEO of Hortonworks, Microsoft’s announcement is quite a milestone. “This is a real validation of Hadoop and it’s readiness for prime time,” Baldeschwieler told Wired. “It brings Hadoop to such a large audience, and Microsoft is doing it in an open source way, which is great for everyone involved.”