Bill James has loved baseball statistics ever since he was a kid in Mayetta, Kansas, cutting baseball cards out of the backs of cereal boxes in the early 1960s. James, who compiled The Bill James Baseball Abstract for years, is a renowned “sabermetrician” (a term he coined himself). He now is a senior advisor on baseball operations for the Boston Red Sox, and he previously worked in a similar capacity for other Major League Baseball teams.
James has done more to change the world of baseball statistics than anyone in recent memory. As broadcaster Bob Costas says, James “doesn’t just understand information. He has shown people a different way of interpreting that information.” Before Bill James, Major League Baseball teams all relied on long-held assumptions about how games are won. They assumed batting average, for example, had more importance than it actually does.
James challenged these assumptions. He asked critical questions that didn’t have good answers at the time, and he did the research and analysis necessary to find better answers. For instance, how many days’ rest does a reliever need? James’s answer is that some relievers can pitch well for two or more consecutive days, while others do better with a day or two of rest in between. It depends on the individual. Why can’t a closer work more than just the ninth inning? A closer is frequently the best reliever on the team. James observes that managers often don’t use the best relievers to their maximum potential.
The lesson learned from the Bill James example is that the best statistics come from striving to ask the best questions and trying to get answers to those questions. But what are the best questions? James takes an iterative approach, analyzing the data he has, or can gather, asking some questions based on that analysis, and then looking for the answers. He doesn’t stop with just one set of statistics. The first set suggests some questions, to which a second set suggests some answers, which then give rise to yet another set of questions. It’s a continual process of investigation, one that’s focused on surfacing the best questions rather than assuming those questions have already been asked.
Enterprises can take advantage of a similarly iterative, investigative approach to data. Enterprises are being overwhelmed with data; many enterprises each generate petabytes of information they aren’t making best use of. And not all of the data is the same. Some of it has value, and some, not so much.
The problem with this data has been twofold: (1) it’s difficult to analyze, and (2) processing it using conventional systems takes too long and is too expensive.
Addressing these problems effectively doesn’t require radically new technology. Better architectural design choices and software that allows a different approach to the problems are enough. Search engine companies such as Google and Yahoo provide a pragmatic way forward in this respect. They’ve demonstrated that efficient, cost-effective, system-level design can lead to an architecture that allows any company to handle different data differently.
Enterprises shouldn’t treat voluminous, mostly unstructured information (for example, Web server log files) the same way they treat the data in core transactional systems. Instead, they can use commodity computer clusters, open-source software, and Tier 3 storage, and they can process in an exploratory way the less-structured kinds of data they’re generating. With this approach, they can do what Bill James does and find better questions to ask.
In this issue of the Technology Forecast, we review the techniques behind low-cost distributed computing that have led companies to explore more of their data in new ways. In the article, “Tapping into the power of Big Data,” on page 04, we begin with a consideration of exploratory analytics—methods that are separate from traditional business intelligence (BI). These techniques make it feasible to look for more haystacks, rather than just the needle in one haystack.
The article, “Building a bridge to the rest of your data,” on page 22 highlights the growing interest in and adoption of Hadoop clusters. Hadoop provides highvolume, low-cost computing with the help of opensource software and hundreds or thousands of commodity servers. It also offers a simplified approach to processing more complex data in parallel. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze lots of data they didn’t have the means to analyze before.
The buzz around Big Data and “cloud storage” (a term some vendors use to describe less-expensive clustercomputing techniques) is considerable, but the article, “Revising the CIO’s data playbook,” on page 36 emphasizes that CIOs have time to pick and choose the most suitable approach. The most promising opportunity is in the area of “gray data,” or data that comes from a variety of sources. This data is often raw and unvalidated, arrives in huge quantities, and doesn’t yet have established value. Gray data analysis requires a different skill set—people who are more exploratory by nature.
As always, in this issue we’ve included interviews with knowledgeable executives who have insights on the overall topic of interest:
Please visit pwc.com/techforecast to find these articles and other issues of the Technology Forecast. If you would like to receive future issues of the Technology Forecast as a PDF attachment in your e–mail box, you can sign up at pwc.com/techforecast/subscribe.
We welcome your feedback on this issue of the Technology Forecast and your ideas for where we should focus our research and analysis in the future.