A step toward the data lake in healthcare:
Late-bound data warehouses

Dale Sanders

Dale Sanders is senior vice president of Health Catalyst.

Download this article as a PDF

Dale Sanders of Health Catalyst describes how healthcare providers are addressing their need for better analytics.

Interview conducted by Alan Morrison, Bo Parker, and Brian Stein

PwC: How are healthcare enterprises scaling and maturing their analytics efforts at this point?

DS: It’s chaotic right now. High-tech funding facilitated the adoption of EMRs [electronic medical records] and billing systems as data collection systems. And HIEs [health information exchanges] encouraged more data sharing. Now there’s a realization that analytics is critical. Other industries experienced the same pattern, but healthcare is going through it just now.

The bad news for healthcare is that the market is so overwhelmed from the adoption of EMRs and HIEs. And now the changes from ICD-9 [International Classification of Diseases, Ninth Revision] are coming, as well as the changes to the HIPAA [Health Insurance Portability and Accountability Act] regulation. Meaningful use is still a challenge. Accountable care is a challenge.

There’s so much turmoil in the market, and it’s hard to admit that you need to buy yet another IT system. But it’s hard to deny that, as well. Lots of vendors claim they can do analytics. Trying to find the way through that maze and that decision-making is challenging.

PwC: How did you get started in this area to begin with, and what has your approach been?

DS: Well, to go way back in history, when I was in the Air Force, I conceived the idea for late binding in data warehouses after I’d seen some different failures of data warehouses using relational database systems.

If you look at the early history of data warehousing in the government and military—it was all on mainframes. And those mainframe data warehouses look a lot like Hadoop today. Hadoop is emerging with better tools, but conceptually the two types of systems are very similar.

When relational databases became popular, we all rushed to those as a solution for data warehousing. We went from the flat files associated with mainframes to Unix-based data warehouses that used relational database systems. And we thought it was a good idea. But one of the first big mistakes everyone made was to develop these enterprise data models using a relational form.

I watched several failures happen as a consequence of that type of early binding to those enterprise models. I made some adjustments to my strategy in the Air Force, and I made some further adjustments when I worked for companies in the private sector and further refined it.

I came into healthcare with that. I started at Intermountain Healthcare, which was an early adopter of informatics. The organization had a struggling data warehouse project because it was built around this tightly coupled, early-binding relational model. We put a team together, scrubbed that model, and applied late binding. And, knock on wood, it’s been doing very well. It’s now 15 years in its evolution, and Intermountain still loves it. The origins of Health Catalyst come from that history.

PwC: How mature are the analytics systems at a typical customer of yours these days?

DS: We generally get two types of customers. One is the customer with a fairly advanced analytics vision and aspirations. They understand the whole notion of population health management and capitated reimbursement and things like that. So they’re naturally attracted to us. The dialogue with those folks tends to move quickly.

Then there are folks who don’t have that depth of background, but they still understand that they need analytics.

“We have an analytics adoption model that we use to frame the progression of analytics in an organization. Most of the (healthcare) industry operates at level zero.”

We have an analytics adoption model that we use to frame the progression of analytics in an organization. We also use it to help drive a lot of our product development. It’s an eight-level maturity model. Intermountain operates pretty consistently at levels six and seven.

But most of the industry operates at level zero—trying to figure out how to get to levels one and two. When we polled participants in our webinars about where they think they reside in that model, about 70 percent of the respondents said level two and below.

So we’ve needed to adjust our message and not talk about levels five, six, and seven with some of these clients. Instead, we talk about how to get basic reporting, such as internal dashboards and KPIs [key performance indicators], or how to meet the external reporting requirements for joint commission and accountable care organizations [ACOs] and that kind of thing.

If they have a technical background, some organizations are attracted to this notion of late binding. And we can relate at that level. If they’re familiar with Intermountain, they’re immediately attracted to that track record and that heritage. There are a lot of different reactions.

PwC: With customers who are just getting started, you seem to focus on already well-structured data. You’re not opening up the repository to data that’s less structured as well.

“The vast majority of data in healthcare is still bound in some form of a relational structure, or we pull it into a relational form.”

DS: The vast majority of data in healthcare is still bound in some form of a relational structure, or ultimately we pull it out into a relational form. Late binding puts us between the worlds of traditional relational data warehouses and Hadoop—between a very structured representation of data and a very unstructured representation of data.

But late binding lets us pull in unstructured content. We can pull in clinical notes and pretext and that sort of thing. Health Catalyst is developing some products to take advantage of that.

But if you look at the analytic use cases and the analytic maturity of the industry right now, there’s not a lot of need to bother with unstructured data. That’s reserved for a few of the leading innovators. The vast majority of the market doesn’t need unstructured content at the moment. In fact, we really don’t even have that much unstructured content that’s very useful.

PwC: What’s the pain point that the late-binding approach addresses?

DS: This is where we borrow from Hadoop and also from the old mainframe days.

When we pull a data source into the late-binding data warehouse, we land that data in a form that looks and feels much like the original source system.

Then we make a few minor modifications to the data. If you’re familiar with data modeling, we flatten it a little bit. We denormalize it a little bit. But for the most part, that data looks like the data that was contained in the source system, which is a characteristic of a Hadoop data lake—very little transformation to data.

So we retain the binding and the fidelity of the data as it appeared in the source system. If you contrast that approach with the other vendors in healthcare, they remap that data from the source system into an enterprise data model first. But when you map that data from the source system into a new relational data model, you inherently make compromises about the way the data is modeled, represented, named, and related.

You lose a lot of fidelity when you do that. You lose familiarity with the data. And it’s a time-consuming process. It’s not unusual for that early binding, monolithic data model approach to take 18 to 24 months to deploy a basic data warehouse.

In contrast, we can deploy content and start exposing it to analytics within a matter of days and weeks. We can do it in days, depending on how aggressive we want to be. There’s no binding early on. There are six different places where you can bind data to vocabulary or relationships as it flows from the source system out to the analytic visualization layer.

Before we bind data to new vocabulary, a new business rule, or any analytic logic, we ask ourselves what use case we’re trying to satisfy. We ask on a use case basis, rather than assuming a use case, because that assumption could lead to problems. We can build just about whatever we want to, whenever we want to.

PwC: In essence, you’re moving toward an enterprise data model. But you’re doing it over time, a model that’s driven by use cases.

“We are building an enterprise data model one object at a time.”

DS: Are we actually building an enterprise data model one object at a time? That’s the net effect. Let’s say we land half a dozen different source systems in the enterprise data warehouse. One of the first things we do is provide a foreign key across those sources of data that allows you to query across those sources as if they were an enterprise data model. And typically the first foreign key that we add to those sources—using a common name and a common data type—is patient identifier. That’s the most fundamental. Then you add vocabularies such as CPT [Current Procedural Terminology] and ICD-9 as that need arises.

When you land the data, you have what amounts to a virtual enterprise model already. You haven’t remodeled the data at all, but it looks and functions like an enterprise model. Then we’ll spin targeted analytics data marts off those source systems to support specific analytic use cases.

For example, perhaps you want to drill down on the variability, quality, and cost of care in a clinical program for women and newborns. We’ll spin off a registry of those patients and the physicians treating those patients into its own separate data mart. And then we will associate every little piece of data that we can find: costing data, materials management data, human resources data about the physicians and nurses, patient satisfaction data, outcomes data, and eventually social data. We’ll pull that data into the data mart that’s specific to that analytic use case to support women and newborns.

PwC: So you might need to perform some transform rationalization, because systems might not call the same thing by the same name. Is that part of the late-binding vocabulary rationalization?

DS: Yes, in each of those data marts.

PwC: Do you then use some sort of provenance record—a way of rationalizing the fact that we call these 14 things different things—that becomes reusable?

DS: Oh, yes, that’s the heart of it. We reuse all of that from organization to organization. There’s always some modification. And there’s always some difference of opinion about how to define a patient cohort or a disease state. But first we offer something off the shelf, so you don’t need to re-create them.

PwC: What if somebody wanted to perform analytics across the data marts or across different business domains? In this framework, would the best strategy be to somehow consolidate the data marts, or instead go straight to the underlying data warehouse?

DS: You can do either one. Let’s take a comorbidity situation, for example, where a patient has three or four different disease states. Let’s say you want to look at that patient’s continuum of care across all of those.

Over the top of those data marts is still this common late-binding vocabulary that allows you to query the patient as that patient appears in each of those different subject areas, whatever disease state it is. It ends up looking like a virtual enterprise model for that patient’s record. After we’ve formally defined a patient cohort and the key metrics that the organization wants to understand about that patient cohort, we want to lock that down and tightly bind it at that point.

First you get people to agree. You get physicians and administrators to agree how they want to identify a patient cohort. You get agreement on the metrics they want to understand about clinical effectiveness. After you get comprehensive agreement, then you look for it to stick for a while. When it sticks for a period of time, then you can tightly bind that data together and feel comfortable about doing so—so you don’t need to rip it apart and rebind it again.

PwC: When you speak about coming toward an agreement among the various constituencies, is it a process that takes place more informally outside the system, where everybody is just going to come up with the model? Or is there some way to investigate the data first? Or by using tagging or some collaborative online utility, is there an opportunity to arrive at consensus through an interface?

DS: We have ready-to-use definitions around all these metrics—patient registries and things like that. But we also recognize that the state of the industry being what it is, there’s still a lot of fingerprinting and opinions about those definitions. So even though an enterprise might reference the National Quality Forum, the Agency for Healthcare Research and Quality, and the British Medical Journal as the sources for the definitions, local organizations always want to put their own fingerprint on these rules for data binding.

We have a suite of tools to facilitate that exploration process. You can look at your own definitions, and you can ask, “How do we really want to define a diabetic patient? How do we define congestive heart failure and myocardial infarction patients?”

We’ll let folks play around with the data, visualize it, and explore it in definitions. When we see them coming toward a comprehensive and persistent agreement, then we’ll suggest, “If you agree to that definition, let’s bind it together behind that visualization layer.” That’s exactly what happens. And you must allow that to happen. You must let that exploration and fingerprinting happen.

“A drawback of traditional ways of deploying data warehouses is they predispose various bindings and rules. They don’t allow for data exploration and local fingerprinting.”

A drawback of the traditional ways of deploying data warehouses is that they presuppose all of those bindings and rules. They don’t allow that exploration and local fingerprinting.

PwC: So how do companies get started with this approach? Assuming they have existing data warehouses, are you using those warehouses in a new way? Are you starting up from scratch? Do you leave those data warehouses in place when you’re implementing the late-bound idea?

DS: Some organizations have an existing data warehouse. And a lot of organizations don’t. The greenfield organizations are the easiest to deal with.

The strategy is pretty complicated to decouple all of the analytic logic that’s been built around those existing data warehouses and then import that to the future. Like most transitions of this kind, it often happens through attrition. First you build the new enterprise data warehouse around those late-binding concepts. And then you start populating it with data.

The one thing you don’t want to do is build your new data warehouse under a dependency to those existing data warehouses. You want to go around those data warehouses and pull your data straight from source systems in the new architecture. It’s a really bad strategy to build a data warehouse on top of data warehouses.

PwC: Some of the people we’ve interviewed about Hadoop assert that using Hadoop versus a data warehouse can result in a cost benefit that’s at least an order of magnitude cheaper. They claim, for example, that storing data costs $250,000 per terabyte in a traditional warehouse versus $25,000 per terabyte for Hadoop. If you’re talking with the C-suite about an exploratory analytics strategy, what’s the advantage of staying with a warehousing approach?

DS: In healthcare, the compelling use case for Hadoop right now is the license fee. Contrast that case with what compels Silicon Valley web companies and everybody else to go to Hadoop. Their compelling reason wasn’t so much about money. It was about scalability.

If you consider the nature of the data that they’re pulling into Hadoop, there’s no such thing as a data model for the web. All the data that they’re streaming into Hadoop comes tagged with its own data model. They don’t need a relational database engine. There’s no value to them in that setting at all.

For CIOs, the fact that Hadoop is inexpensive open source is very attractive. The downside, however, is the lack of skills. The skills and the tools and the ways to really take advantage of Hadoop are still a few years off in healthcare. Given the nature of the data that we’re dealing with in healthcare right now, there’s nothing particularly compelling about Hadoop in healthcare right now. Probably in the next year, we will start using Hadoop as a preprocessor ETL [extract, transform, load] platform that we can stream data into.

During the next three to four years, as the skills and the tools evolve to take advantage of Hadoop, I think you’ll see companies like Health Catalyst being more aggressive about the adoption of Hadoop in a data lake scenario. But if you add just enough foreign keys and dimensions of analytics across that data lake, that approach greatly facilitates reliable landing and loading. It’s really, really hard to pull meaningful data out of those lakes without something to get the relationship started.