Hadoop’s foray into the enterprise

Photo: Amr AwadallahCloudera’s Amr Awadallah discusses how and why diverse companies are trying this novel approach.

Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya

Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that offers products and services around Hadoop, an open-source technology that allows efficient mining of large, complex data sets. In this interview, Awadallah provides an overview of Hadoop’s capabilities and how Cloudera customers are using them.

PwC: Were you at Yahoo before coming to Cloudera?

AA: Yes. I was with Yahoo from mid-2000 until mid- 2008, starting with the Yahoo Shopping team after selling my company VivaSmart to Yahoo. Beginning in 2003, my career shifted toward business intelligence and analytics at consumer-facing properties such as Yahoo News, Mail, Finance, Messenger, and Search.

I had the daunting task of building a very large data warehouse infrastructure that covered all these diverse products and figuring out how to bring them together.

That is when I first experienced Hadoop. Its model of “mine first, govern later” fits in with the well-governed infrastructure of a data mart, so it complements these systems very well. Governance standards are important for maintaining a common language across the organization. However, they do inhibit agility, so it’s best to complement a well-governed data mart with a more agile complex data processing system like Hadoop.

PwC: How did Yahoo start using Hadoop?

AA: In 2005, Yahoo was faced with a business challenge. The cost of creating the Web search index was approaching the revenues being made from the keyword advertising on the search pages. Yahoo Search adopted Hadoop as an economically scalable solution, and worked on it in conjunction with the open-source Apache Hadoop community. Yahoo played a very big role in the evolution of Hadoop to where it is today. Soon after the Yahoo Search team started using Hadoop, other parts of the company began to see the power and flexibility that this system offers. Today, Yahoo uses Hadoop for data warehousing, mail spam detection, news feed processing, and content/ad targeting.

PwC: What are some of the advantages of Hadoop when you compare it with RDBMSs [relational database management systems]?

AA: With Oracle, Teradata, and other RDBMSs, you must create the table and schema first. You say, this is what I’m going to be loading in, these are the types of columns I’m going to load in, and then you load your data. That process can inhibit how fast you can evolve your data model and schemas, and it can limit what you log and track.

With Hadoop, it’s the other way around. You load all of your data, such as XML [Extensible Markup Language], tab delimited flat files, Apache log files, JSON [JavaScript Object Notation], etc. Then in Hive or Pig [both of which are Hadoop data query tools], you point your metadata toward the file and parse the data on the fly when reading it out. This approach lets you extract the columns that map to the data structure you’re interested in.

Creating the structure on the read path like this can have its disadvantages; however, it gives you the agility and the flexibility to evolve your schema much quicker without normalizing your data first. In general, relational systems are not well suited for quickly evolving complex data types.

Another benefit is retroactive schemas. For example, an engineer launching a new product feature can add the logging for it, and that new data will start flowing directly into Hadoop. Weeks or months later, a data analyst can update their read schema on how to parse this new data. Then they will immediately be able to query the history of this metric since it started flowing in [as opposed to waiting for the RDBMS schema to be updated and the ETL (extract, transform, and load) processes to reload the full history of that metric].