Making Hadoop suitable for enterprise data science

Mike Lang

Mike Lang is CEO of Revelytix.

Download this article as a PDF

Creating data lakes enables enterprises to expand discovery and predictive analytics.

Interview conducted by Alan Morrison, Bo Parker, and Brian Stein

PwC: You’re in touch with a number of customers who are in the process of setting up Hadoop data lakes. Why are they doing this?

ML: There has been resistance on the part of business owners to share data, and a big part of the justification for not sharing data has been the cost of making that data available. The data owners complain they must write in some special way to get the data extracted, the system doesn’t have time to process queries for building extracts, and so forth.

But a lot of the resistance has been political. Owning data has power associated with it. Hadoop is changing that, because C-level executives are saying, “It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.”

But they haven’t integrated anything. They’re just getting an extract. The benefit is that to add value to the integration process, business owners don’t have nearly the same hill to climb that they had in the past. C-level executives are not asking the business owner to add value. They’re just saying, “Dump it,” and I think that’s under way right now.

With a Hadoop-based data lake, the enterprise has provided a capability to store vast amounts of data, and the user doesn’t need to worry about restructuring the data to begin. The data owners just need to do the dump, and they can go on their merry way.

PwC: So one major obstacle was just the ability to share data cost-effectively?

ML: Yes, and that was a huge obstacle. Huge. It is difficult to overstate how big that obstacle has been to nimble analytics and data integration projects during my career. For the longest time, there was no such thing as nimble when talking about data integration projects.

Once that data is in Hadoop, nimble is the order of the day. All of a sudden, the ETL [extract, transform, load] process is totally turned on its head—from contemplating the integration of eight data sets, for example, to figuring out which of a company’s policyholders should receive which kinds of offers at what price in which geographic regions. Before Hadoop, that might have been a two-year project.

PwC: What are the main use cases for Hadoop data lakes?

ML: There are two main use cases for the data lake. One is as a staging area to support some specific application. A company might want to analyze three streams of data to reduce customer churn by 10 percent. They plan to build an app to do that using three known streams of data, and the data lake is just part of that workflow of receiving, processing, and then dumping data off to generate the churn analytics.

The last time we talked [in 2013], that was the main use case of the data lake. The second use case is supporting data science groups all around the enterprise. Now, that’s probably 70 percent of the companies we’ve worked with.

PwC: Why use Hadoop?

“If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000.”

ML: Data lakes are driven by three factors. The first one is cost. Everybody we talk to really believes data lakes will cost much less than current alternatives. The cost of data processing and data storage could be 90 percent lower. If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000.

The second factor is flexibility. The flexibility comes from the late binding principle. When I have all this data in the lake and want to analyze it, I’ll basically build whatever schema I want on the fly and I’ll conduct my analysis the way data scientists do. Hadoop lends itself to late binding.

The third factor relates to scale. Hadoop data lakes will have a lot more scale than the data warehouse, because they’re designed to scale and process any type of data.

PwC: What’s the first step in creating such a data lake?

ML: We’re working with a number of big companies that are implementing some version of the data lake. The first step is to create a place that stores any data that the business units want to dump in it. Once that’s done, the business units make that place available to their stakeholders.

The first step is not as easy as it sounds. The companies we’ve been in touch with spend an awful lot of time building security apparatuses. They also spend a fair amount of time performing quality checks on the data as it comes in, so at least they can say something about the quality of the data that’s available in the cluster.

But after they have that framework in place, they just make the data available for data science. They don’t know what it’s going to be used for, but they do know it’s going to be used.

PwC: So then there’s the data preparation process, which is where the metadata reuse potential comes in. How does the dynamic ELT [extract, load, transform] approach to preparing the data in the data science use case compare with the static ETL [extract, transform, load] approach traditionally used by business analysts?

ML: In the data lake, the files land in Hadoop in whatever form they’re in. They’re extracted from some system and literally dumped into Hadoop, and that is one of the great attractions of the data lake—data professionals don’t need to do any expensive ETL work beforehand. They can just dump the data in there, and it’s available to be processed in a relatively inexpensive storage and processing framework.

The challenge, then, is when data scientists need to use the data. How do they get it into the shape that’s required for their R frame or their Python code for their advanced analytics? The answer is that the process is very iterative. This iterative process is the distinguishing difference between business analysts and data warehousing and data scientists and Hadoop.

Traditional ETL is not iterative at all. It takes a long time to transform the different data into one schema, and then the business analysts perform their analysis using that schema.

The data scientist doesn’t like the ETL paradigm that business analysts use at all. Data scientists have no idea at the beginning of their job what the schema should be, and so they go through this process of looking at the data that’s available to them.

Let’s say a telecom company has set-top box data and finance systems that contain customer information. Let’s say the data scientists for the company have four different types of data. They’ll start looking into each file and determine whether the data is unstructured or structured this way or that way. They need to extract some pieces of it. They don’t want the whole file. They want some pieces of each file, and they want to get those pieces into a shape so they can pull them into an R server.

So they look into Hadoop and find the file. Maybe they use Apache Hive to transform selected pieces of that file into some structured format. Then they pull that out into R and use some R code to start splitting columns and performing other kinds of operations. The process takes a long time, but that is the paradigm they use. These data scientists actually bind their schema at the very last step of running the analytics.

Let’s say that in one of these Hadoop files from the set-top box, there are 30 tables. They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.

PwC: How can the schema become dynamic and allow greater reuse?

ML: That’s why you need lineage. As data scientists assemble their intermediate data sets, if they look at a lineage graph in our Loom product, they might see 20 or 30 different sets of data that have been created. Of course some of those sets will be useful to other data scientists. Dozens of hours of work have been invested there. The problem is how to find those intermediate data sets. In Hadoop, they are actually realized persisted data sets.

So, how do you find them and know what their structure is so you can use them? You need to know that this data set originally contained data from this stream or that stream, this application and that application. If you don’t know that, then the data set is useless.

At this point, we’re able to preserve the input sets—the person who did it, when they did it, and the actual transformation code that produced this output set. It is pretty straightforward for users to go backward or forward to find the data set and then find something downstream or upstream that they might be able to use by combining it, for example, with two other files. Right now we provide the bare-bones capability for them to do that kind of navigation. From my point of view, that capability is still in its infancy.

PwC: And there’s also more freedom and flexibility on the querying side?

ML: Predictive analytics and statistical analysis are easier with a large-scale data lake. That’s another sea change that’s happening with the advent of big data. Everyone we talk to says SQL worked great. They look at the past through SQL. They know their current financial state, but they really need to know the characteristics of the customer in a particular zip code that they should target with a particular product.

When you can run statistical models on enormous data sets, you get better predictive capability. The bigger the set, the better your predictions. Predictive modeling and analytics are not being done timidly in Hadoop. That’s one of the main uses of Hadoop.

This sort of analysis wasn’t performed 10 years ago, and it’s only just become mainstream practice. A colleague told me a story about a credit card company. He lives in Maryland, and he went to New York on a trip. He used his card one time in New York and then he went to buy gas, and the card was cut off. His card didn’t work at the gas station. He called the credit card company and asked, “Why did you cut off my card?”

And they said, “We thought it was a case of fraud. You never have made a charge in New York and all of a sudden you made two charges in New York.” They asked, “Are you at the gas station right now?” He said yes.

It’s remarkable what the credit card company did. It ticked him off that they could figure out that much about him, but the credit card company potentially saved itself tens of thousands of dollars in charges it would have had to eat.

This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.