It should come as no surprise that organizations dealing with lots of data are already investigating Big Data technologies, or that they have mixed opinions about these tools.
“At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions of rows of data, looking for things that match a pattern approximately,” says John Parkinson, TransUnion’s acting CTO. “We want to do accurate but approximate matching and categorization in very large low-structure data sets.”
Parkinson has explored Big Data technologies such as MapReduce that appear to have a more efficient filtering model than some of the pattern-matching algorithms TransUnion has tried in the past. “MapReduce also, at least in its theoretical formulation, is very amenable to highly parallelized execution,” which lets the users tap into farms of commodity hardware for fast, inexpensive processing, he notes.
However, Parkinson thinks Hadoop and MapReduce are too immature. “MapReduce really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it. As for Hadoop, they have done a good job, but it’s like a lot of open-source software—80 percent done. There were limits in the code that broke the stack well before what we thought was a good theoretical limit.”
Parkinson echoes many IT executives who are skeptical of open-source software in general. “If I have a bunch of engineers, I don’t want them spending their day being the technology support environment for what should be a product in our architecture,” he says.
That’s a legitimate point of view, especially considering the data volumes TransUnion manages—8 petabytes from 83,000 sources in 4,000 formats and growing— and its focus on mission-critical capabilities for this data. Credit scoring must run successfully and deliver top-notch credit scores several times a day. It’s an operational system that many depend on for critical business decisions that happen millions of times a day. (For more on TransUnion, see the interview with Parkinson on page 14.)
Disney’s system is purely intended for exploratory efforts or at most for reporting that eventually may feed up to product strategy or Web site design decisions. If it breaks or needs a little retooling, there’s no crisis.
But Albers disagrees about the readiness of the tools, noting that the Disney Technology Shared Services Group also handles quite a bit of data. He figures Hadoop and MapReduce aren’t any worse than a lot of proprietary software. “I fully expect we will run on things that break,” he says, adding facetiously, “Not that any commercial product I’ve ever had has ever broken.”
Data architect Estes also sees responsiveness in open-source development that’s laudable. “In our testing, we uncovered stuff, and you get somebody on the other end. This is their baby, right? I mean, they want it fixed.”
Albers emphasizes the total cost-effectiveness of Hadoop and MapReduce. “My software cost is zero. You still have the implementation, but that’s a constant at some level, no matter what. Now you probably need to have a little higher skill level at this stage of the game, so you’re probably paying a little more, but certainly, you’re not going out and approving a Teradata cluster. You’re talking about Tier 3 storage. You’re talking about a very low level of cost for the storage.”
Albers’ points are also valid. PwC predicts these open-source tools will be solid sooner rather than later, and are already worthy of use in non-mission-critical environments and applications. Hence, in the CIO article on page 36, we argue in favor of taking cautious but exploratory steps.