How the Semantic Web might improve cancer treatment

Photo: Lynn Vogel M. D. Anderson’s Lynn Vogel explores new techniques for combining clinical and research data.

Interview conducted by Alan Morrison, Bo Parker, and Joe Mullich

Lynn Vogel is vice president and CIO of The University of Texas M. D. Anderson Cancer Center. In addition, he holds a faculty appointment at The University of Texas in Bioinformatics and Computational Biology. In this interview, Vogel describes M. D. Anderson’s semantic technology research and development and the hospital’s approach to data integration.


PwC: Could you give us a sense of the IT organization you manage and the semantic technology projects you’ve been working on?

LV: M. D. Anderson has a little more than $3 billion a year in revenue, about 17,000 employees, and a fairly substantial investment in IT. We have a little more than 700 people in our IT division. We do a significant amount of software development. For example, we have been developing our own electronic medical record capability, which is something fewer than a half dozen healthcare organizations in the country have even tried, let alone been successful with.

We tend, I think, to be on the high end of the scale both in terms of investment and in terms of pushing the envelope with technologies. For example, our electronic medical record is probably the single most complete model based on service-oriented architecture [SOA] that there is in healthcare, particularly in the clinical space. We have built it entirely on a SOA framework and have demonstrated, certainly to our satisfaction, that SOA is more than simply a reasonable framework. SOA probably is the framework that we need across the industry during the next three to five years if we’re going to keep pace with all the new data sources that are impacting healthcare.

In the semantic technology area, we have a couple of faculty who have done a lot of work on semantic data environments. It’s turning out to be a very tricky business. When you have a semantic data environment up and running, what do you do with it and how does it integrate with other things that you do? It is still such a new development that what you would call the practical uses, the use cases around it, are still a challenge to figure out. What will be the actual impact on the daily business of research and clinical care?

We have an environment here we call S3DB, which stands for Simple Sloppy Semantic Database. Our faculty have published research papers on it that describe what we do, but it’s still very, very much on the cutting edge of the research process about data structures. And although we think there’s enormous potential in moving this along, it’s still very, very uncertain as to where the impact is actually going to be.

PwC: When you’re working with S3DB, what kinds of sources are you trying to integrate? And what’s the immediate objective once you have the integration in place?

LV: The big challenge in cancer care at this point—and it really focuses on personalized medicine—is how to bring together the data that’s generated out of basic research processes and the data that’s generated out of the clinical care process. And there are a number of interesting issues about that. You talk to most CIOs in healthcare, and they say, “Oh, we’ve got to get our physicians to enter orders electronically.” Well, we’re starting to do that more and more, but that’s not our big issue. Our big issue is that a patient comes in, sees the doctor, and the doctor says, “I’m sorry to tell you that you have cancer of this particular type.” And if I were the patient, I’d say, “Doctor, I want you to tell me: Of the last 100 patients who had this diagnosis with my set of characteristics—with my clinical values, lab values, whatever else—who were put on the therapy that you are prescribing for me, what has been their outcome?”


“One of the expectations, particularly around semantic technology, is that it enables us to provide not simply a bridge between clinical and research data sources, but potentially a home for both of those types of data sources. It has the ability to view data not simply as data elements, but as data elements with a context.”

On the one hand, that is a clinical question. I am a patient. I want to know what my chances are. At the end of the day in cancer, it’s about survival. So that’s the first question. On the other hand, it’s also a research question, because the clinician—to have an idea about the prognosis and to be able to respond to this patient—needs to know about the data that has been collected around this particular kind of patient, this particular kind of disease, and this particular kind of therapy. So one of the expectations, particularly around semantic technology, is that it enables us to provide not simply a bridge between clinical and research data sources, but potentially a home for both of those types of data sources. It has the ability to view data not simply as data elements, but as data elements with a context.

PwC: The data you’re talking about, is it external and internal data, structured and unstructured?

LV: Yes, it could be anything. Unstructured data is obviously a bigger problem than anything else. But even with structured data, integrating data from disparate sources is a big challenge. I might have gene expression data from a series of biomarker studies. I could have patient data in terms of diagnosis, lab values, and so on. Those are very different types of data.

When you look at the structure of IT in healthcare today, it’s largely patient focused, on discrete values. I want to find out what Mrs. Smith’s hemoglobin level is. That’s a very discrete question, and it’s a very clear, simple question with a very discrete answer. In that process, the clinician is looking at one patient but is trying to assimilate many, many, many attributes of that patient. That is, he or she is looking at lab values, pictures from radiology, meds, et cetera, and working toward an assessment of an individual patient.


“Either you optimize for the clinician, looking for one patient and that patient’s many values, or you optimize for the researcher, looking at very few values, but many, many patients.”

The research question turns that exactly on its head, just the reverse of the clinical question. The researcher is interested in looking at a very few attributes, but across many, many patients. Unfortunately, there isn’t a database technology on the market today that can reconcile those issues. Either you optimize for the clinician, looking for one patient and that patient’s many values, or you optimize for the researcher, looking at very few values, but many, many patients.

And so that kind of challenge is what confronts us. From a data management standpoint, you use the data that you get from gene expression studies to match patterns of data with association studies, which is really not what you’re doing on the clinical side. Now, having said that, our semantic tools are one way to bridge that gap. It is possible that semantic technologies will provide the framework within which both of these vastly different types of data can be used. I think this is going to determine the future of how successful we are in dealing with cancer. We’re not convinced entirely yet, but there are positive indications we’ve written about in our publications.

PwC: So is this part of the new emphasis on evidence-based medicine, or is it something else?

LV: Evidence-based medicine historically has focused on the data that says, if I have a patient with a particular problem, a given therapy will work. Basically the question is: Is this patient’s cellular structure and the kind of genetic expression that shows up in this patient’s cell—is that amenable to targeting with a particular therapy? So evidence-based medicine really covers the whole gamut of medicine. What we’re trying to figure out is at a much more granular level.
We’re looking for the relationship between the development of a cancerous condition and a particular gene expression, and then a particular therapy that will deal with that condition or deal with that diagnosis under the conditions of a particular gene expression.

PwC: Is the bigger problem here that the data has probably been collected somewhere in some context, but there are no standards for what you use to describe that data, and you need to stitch together data sets from many, many different studies and rationalize the nomenclature? Or is it a different problem?

LV: One of the biggest problems with genetic studies today is that people don’t follow highly standardized procedures, and the replication of a particular study is a real challenge, because it turns out that the control processes that guide what you’re doing sometimes omit things.
For example, we had a faculty group here a year or so ago that tried to look at a fairly famous study that was published about gene expressions in association with a particular disease presentation. And when they asked for the data set, because now you’re required to publish the data set as well as the conclusions, it turns out that it was an Excel spreadsheet, and when they had actually done the analysis, they had included the heading rows as well as the data, so it wasn’t quite what they represented.

So that’s just kind of sloppy work. That doesn’t even get to one of the big challenges in healthcare, which is vocabulary and terminology. You know, there could be 127 ways to describe high blood pressure. So if I described it one way, and you described it another way, and I put my data in my database, and you put your data in your database, and we combine our databases and do a search, we’d have to know all 127 ways that we’ve described it to arrive at any kind of a conclusion from the data, and that is very, very difficult.

PwC: As you look out three to five years, and presuming we continue to find more and more value from semantic technologies, where will it have the biggest impact in healthcare IT?

LV: I think one of the big challenges in healthcare IT is that IT investments, particularly in the clinical side of healthcare, are by and large driven by a small number of commercial vendors, and they sell exclusively into the acute care market, which is fine. I mean, it’s a reasonable market to sell into, but they don’t have a clue about the challenges of research.

If you look at what’s happening in medicine today, medicine in general is more and more based on things like genomics, which is coming from the research side of the house. But when you talk to healthcare IT vendors or look at their products, you discover that they have built their products on technologies and architectures that are now 15 to 20 years old.

PwC: A meta-model of health that’s care focused.

LV: It is care focused, and in most cases, it’s built on a single physical data repository model. It says, “Let’s take all the clinical data we have from every place we have it and dump it into one of these clinical data repositories.”

Well, vendors have discovered a couple of things. One is that even the task of integrating images into that database is very, very difficult. In fact, in most vendor architectures, the image archive is separate from the data archive. And, frankly, it’s fairly cumbersome to move back and forth. So that says that the architecture we had 15 years ago wasn’t really built to accommodate the integration of imaging data. All you need to do is to step into the genomics world, and you realize that the imaging integration challenges only scratch the surface. You have no idea what you’re running into.

PwC: Is the British exercise in developing a National Health Service electronic medical record addressing these sorts of data integration issues?

LV: Not to my knowledge. I mean, everybody now is working with images to some extent. National Health Service is trying to build its models around commercial vendor products, and those commercial products are only in the acute care space. And, they’re built on the closed data models, which in 1992 were terrific. We were excited.


“You have two directions you can go. You can try to cram it all into one big place, which is the model we had in 1992. Or, you can say there will always be repositories of data, and there will always be new types of data. We need an architecture that will accommodate this reality.”

But that’s really why we opted to go off on our own. We felt very strongly that there will always be two things: many sources of data, and new kinds of data sources to incorporate. And you have two directions you can go. You can try to cram it all into one big place, which is the model we had in 1992. Or, you can say there will always be repositories of data, and there will always be new types of data. We need an architecture that will accommodate this reality, and, frankly, that architecture is a services architecture.

PwC: Do semantics play just a temporary role while the data architecture is being figured out? So that eventually the new data becomes standard fare and the role of semantics disappears? Or is the critical role of semantic technology enduring, because there never will be an all-encompassing data architecture?

LV: I think, quite honestly, the answer is still out there. I don’t know what the answer is. As we continue to move forward, semantic technologies will have a role to play, just because of the challenges that data creates within the contexts that need to be understood and represented in a data structure. And semantic technology is one possibility for capturing, maintaining, and supporting those contexts. My guess is it’s the next stage of the process, but it’s really too soon to tell.


“As we continue to move forward, semantic technologies will have a
role to play, just because of the challenges that data creates within the contexts that need to be understood and represented in a data structure.”

Oracle now supports semantic representation, which basically means we have moved past rows and columns and tables and elements, to RDF [Resource Description Framework] triples. That’s good stuff, but we’re not clear yet, even with the time we’ve spent on it, where all this fits into our game. The technology’s very experimental, and there’s a lot of controversy, quite frankly. There are people who focus on this who have totally opposite views of whether it’s actually useful or not, and part of the reason for those opposite views is that we don’t really understand yet what it does. We kind of know what it is, but to understand what it does is the next test.

PwC: And do you think the community aspect—working collaboratively with the broader community on medical ontologies, terminology, and controlled vocabularies—is likely to play a role, or do you think that the M. D. Andersons of the world are going to have to figure this out for themselves?

LV: That’s one of the things that worries me about the federal stimulus plan and its funding for electronic medical records. It’s too strongly influenced by the vendor community. It’s not that the vendors are bad guys; they’re not. They’re actually smart, and they offer good products by and large. They’ve all had their share of fabulous successes and dismal failures, and it just goes to the point that it’s not the technology that’s the issue, it’s what you do with it that makes the difference.

PwC: But at this time they have no incentive to create a data architecture for electronic medical records that works in the way you desire, that is capable of being flexible and open to new sources of data.

LV: That is correct.

PwC: What about the outlook for interoperability on the research side?

LV: For all the talk of vendors who say they can talk to all of their own implementations, the answer is no, they can’t. Interoperability is a big buzzword. It’s been around for a long time. You know technology doesn’t solve the organizational issues.

When you have 85 percent of the physicians in this country practicing in two- and three-person practices, that’s a different issue from let’s make everybody interoperable. The physicians today have no incentives to make the information technology and process change investments that are required for interoperability. I’m a physician, you come to see me, I give you a diagnosis and a treatment, and if I can’t figure it out, I will send you to a specialist and let him figure it out, and, hopefully, he’ll get back to me, because all he wants to do is specialist stuff, and then I’ll continue on with you. But within that framework, there are not a lot of incentives to share clinical data generated from these patient interactions.

PwC: On the research side of the question, we’ve been reading about the bioinformatics grid and wondering if that sort of approach would have a benefit on the clinical side.

LV: I think it does. There are all kinds of discussions about the grid technology, and the National Cancer Institute has pushed its bioinformatics grid, the caBIG initiative. I think there has been significant underestimation of the effort required to make that work. People would like to think that in the future all of this stuff will be taken care of automatically. There’s a new report just out from the National Research Council on Computational Technology for Effective Health Care. It’s a fascinating discussion of what the future of medicine might look like, and it has an enormous number of assumptions about new technologies that will be developed.

All this stuff will take years to develop and figure out how to use it effectively. It’s just very hard work. We can talk about semantic technology and have an interesting discussion. What’s difficult is to figure out how you’re going to use it to make life better for people, and that’s still unclear.

PwC: Siloed, structured, and unstructured data are part of the reality you’ve had for years.

LV: That’s correct. And we’d like to eliminate that problem. You know, we have tons of unstructured data all over the place. We have a whole initiative here at M. D. Anderson, which we call our Structured and Clinical Documentation Initiative, which is addressing the question of how can we collect data in a structured way that then makes it reusable to improve the science? And people have developed a lot of ways, workarounds, if you will—all the way from natural language processing to scanning textual documents—because we have a ton of data that, for all practical purposes, will never be terribly useful to support science. Our commitment now is to change that initial process of data collection so that the data is, in fact, reusable down the road.

PwC: And there’s an element of behavioral change here.

LV: It’s also the fact that, in many cases, if you structure data up front, it will take you a bit longer to collect it. You could argue that once you’ve collected it, you have this fabulous treasure trove of structured data that can advance the science of what we do. But there’s an overhead for the individual clinicians who are collecting the data. They’re already under enormous pressure regarding the amount of time they spend with their patients. If you say, “Oh, by the way, for every patient that you see now, we’re adding 15 minutes so you can structure your data so that we all will be smarter,” that’s a pretty hard sell.

PwC: A reality check, that’s for sure.

LV: Well, you can read the literature on it, and it is absolutely fascinating, but at the end of the day we have to deliver for our patients. That’s really what the game is about. We don’t mind going off on some rabbit trails that hold some potential but we’re not clear how much. On the other hand, we have to be realistic, and we don’t have all the money in the world.


“At the end of the day we do have to deliver for our patients.”