How companies are using open-source cluster-computing techniques to analyze their data.
As recently as two years ago, the International Supercomputing Conference (ISC) agenda included nothing about distributed computing for Big Data— as if projects such as Google Cluster Architecture, a low-cost, distributed computing design that enables efficient processing of large volumes of less-structured data, didn’t exist. In a May 2008 blog, Brough Turner noted the omission, pointing out that Google had harnessed as much as 100 petaflops1 of computing power, compared to a mere 1 petaflop in the new IBM Roadrunner, a supercomputer profiled in EE Times that month. “Have the supercomputer folks been bypassed and don’t even know it?” Turner wondered.2
Turner, co-founder and CTO of Ashtonbrooke.com, a startup in stealth mode, had been reading Google’s research papers and remarking on them in his blog for years. Although the broader business community had taken little notice, some companies were following in Google’s wake. Many of them were Web companies that had data processing scalability challenges similar to Google’s.
Yahoo, for example, abandoned its own data architecture and began to adopt one along the lines pioneered by Google. It moved to Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by the Apache Software Foundation; it also adopted MapReduce, Google’s parallel programming framework. Yahoo used these and other open-source tools it helped develop to crawl and index the Web. After implementing the architecture, it found other uses for the technology and has now scaled its Hadoop cluster to 4,000 nodes.
By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what O’Reilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Big Data refers to data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis by traditional means. Many who are familiar with these new methods are convinced that Hadoop clusters will enable cost-effective analysis of Big Data, and these methods are now spreading beyond companies that mine the public Web as part of their business.
What are these methods and how do they work? This article looks at the architecture and tools surrounding Hadoop clusters with an eye toward what about them will be useful to mainstream enterprises during the next three to five years. We focus on their utility for less-structured data.
1FLOPS stands for “floating point operations per second.” Floating point processors use more bits to store each value, allowing more precision and ease of programming than fixed point processors. One petaflop is upwards of one quadrillion floating point operations per second.
2Brough Turner, “Google Surpasses Supercomputer Community, Unnoticed?” May 20, 2008, http://blogs.broughturner.com/ communications/2008/05/google-surpasses-supercomputercommunity- unnoticed.html (accessed April 8, 2010).