Arvind Parthasarathi is president of YarcData, a Cray company. In this interview, Parthasarathi describes emerging use cases for in-memory graph analytics appliances, which analyze relationships between people, places and things in large graph data sets.
Arvind Parthasarathi of YarcData describes the emerging field of relationship analytics.
Interview conducted by Alan Morrison and Bo Parker
PwC: What kinds of questions can graph analytics appliances help answer?
AP: Graph appliances offer the ability to explore patterns by traversing relationships you couldn’t find or follow before. Take, for example, a counterterrorism analyst who suspects a terrorist plot. He says, “Find me two people who have some geographical relationship to each other, but are not in the same house.” One of them has rented a truck; the other has bought some fertilizer. And they are somehow both related to the same landmark. Any one of these relationships is innocuous, but if you put them all together, the user sees indications of a plot.
The user may start with that query, but then may say, “Well, what if there was a third guy in between? What if they didn’t live on the same street, but they lived in the same city? What if their association was to a website?” Once you find the pattern, you want to be able to look for extensions to the pattern. Frankly, you can’t come up with all the extensions yourself, so you want the system to start extending the pattern for you.
Those are the two major use cases we see with our customers: pattern-based queries and interactive queries. The terrorist plot example above is a type of pattern-based query. An example of an interactive query would be a doctor querying the graph analytics appliance about a patient’s puzzling condition. The appliance may propose competing hypotheses about the patient’s problem and suggest questions or tests to help differentiate which it may be. Some back and forth between the appliance and the doctor leads to a diagnosis on the condition and best treatment option.
PwC: Why use data graphs instead of some other approach for this?
AP: Recently we spoke to a customer that has a large analytical environment, a large data warehouse. They’re getting many queries now that the system wasn’t designed for. If you think of that query challenge in graph terms, the queries are not about the nodes in the graph; they’re about the edges in the graph, the relationships. Those relationships are very important because in one way or another, they characterize the similarities and differences between nodes or groups of nodes. Those relationships, in other words, provide a rich web of contextual information.
This customer finds that they have two choices. They can restrict users from running the forbidden queries. Or, they can try to figure out a way to allow those queries.
Graph appliances are able to articulate the relationships. With a graph, filling out the relationship between A and B is just a question of drawing a line between the two of them. If we can focus on graph-style analytics, instead of just building a data warehouse, we can help our customers build a relationship warehouse.
PwC: What is the user experience like when it comes to this kind of analytics?
AP: Users have two choices. They can start to explore that graph in real time, visualizing and walking the graph. Or alternatively, they can use a partially specified or pattern-based query, or what we call query by example.
PwC: Graphs can represent networked behaviors of people, places, and things. Isn’t that one of the key factors behind their utility?
AP: That’s right. Every computer science student spends time thinking about graph theory, and most of the world’s problems can be represented as graphs. For example, a lot of supply chain problems are essentially graph problems. In areas such as proteins, genetics, telecommunications, financial services, and finance theory, many problems are most easily expressed as graphs. Relationships by their very nature are graph oriented.
In spite of that, it’s been common practice to take a problem that’s easily represented in graph form and convert it into a tabular form and store the entire thing in a relational database. Enterprises found this conversion necessary, because the only means they had for persistence and data management was a tabular data representation and relational databases.
With the advent of big data, that kind of conversion doesn’t work anymore, because the data sets are too large to keep converting back and forth between graphs and tabular data structures. Customers require more real time results. To speed up the analytic process, customers are looking for ways to analyze data directly in its graph form without this conversion step.
PwC: What have been the challenges with processing data in graph form?
AP: It’s one thing to operate just on one node. On a single node you can change information such as your address, name, and picture. But now imagine asking a question about the relationships between many different people. Relationships are expansive. It’s very hard to ask those relationship questions because the most interesting, real-life example graphs, in general, cannot be partitioned. There’s no clean way to say, “I’m going to take all these relationships and put them in machines, exit the cluster, and then put these relationships in one machine, and these others in the next machine.”
The other problem with graphs is that they’re not predictable. Techniques such as caching and pre-fetching are built upon the assumption that we can predict where the data is going to be. Graphs, and a lot of graph relationships, are completely ad hoc. I’ll be at one end, and I’ll need to traverse all the way to the other end of the graph. It’s not about adding nodes—instead, I’m constantly adding new relationships between nodes by taking out old relationships between nodes.
A third issue is that graphs are dynamic—the data is constantly changing. We needed to establish a quick way of bringing in a new data set, finding the relationships, and then throwing out the set if it doesn’t make sense and bringing in a new one. For example, if you have a very large soda and you want to consume it quickly, you can’t have a extremely narrow straw. With a large amount of shared, multi-threaded memory, you can walk each part of the graph. It’s possible to bring in about 350TB an hour, crunch it, and send it out again.
Most real problems don’t fit themselves into nice small data sets. At the scale of real problems, to get a real-time response, you must marry the hardware and the software together. To get four to five orders of magnitude of performance improvement, you really need to change the paradigm.
PwC: This approach would seem to lend itself to different kinds of cluster analysis, for example.
AP: Correct. Let’s say I’m developing a real-time fraud detection application. With a conventional cluster analysis approach, you’re looking for a guy on the watch list whose name is John. Do I have somebody called Johnny, John John, or others? You’re looking for attribute-based clusters, nodes by themselves, or in simple combinations.
Connecting nodes and edges and then inferring even more connections to other nodes provides a different kind of querying power. Imagine asking a fraud detection question that said, “I think I see a fraud pattern where there’s a doctor who has a family relationship with some salesperson who’s a member of a particular life sciences committee. Are there people like that in my data set?” That approach allows you to ask a lot more questions around fraud, and the more you explore the different edges between nodes, the more you can discover and ponder other relationships that might be relevant also.
It’s not a replacement for other kinds of queries, but an additional method that helps in certain ways. We’re not trying to be everything for everybody; we’re basically saying, “Listen, there’s a portion of what you do that you can do much better.”
PwC: Hasn’t the data you’re working with lost a lot of its graph structure in the process of being normalized and captured? What do you do about that?
AP: Customers have their data in all sorts of formats. It’s often structured and semi-structured—XML, relational tables, flat files, and the like. It’s necessary to take in their data, rebuild the graph, and identify explicit relationships. Then we apply graph reasoning to flesh out the graph more—inferring additional relationships, in other words. Then customers, depending on the vertical, will pick an intelligence analyst, a drug researcher, a data scientist, or a quant and put them in front of our system to ask questions. That’s when they can ask the new kinds of questions we discussed earlier.
PwC: What’s another example of the kinds of queries that are possible with a relationship analytics approach?
AP: Think about what constitutes a typical weapon—a gun, for instance. A gun basically has a control mechanism—I can pick it up and aim it. It has a targeting mechanism to facilitate the aiming. It has a guidance mechanism—I can pull the trigger, and it can project something in a particular direction. And it has a delivery mechanism—the ability to hit a target. So we know intuitively that general description of a gun also fits a rocket-propelled grenade or bazooka or missile.
But if you think about it from a different perspective, a plane actually has all those characteristics, too. It has a control mechanism—a pilot sitting in a plane. It has the ability to propel itself and to target because it can be guided. And it has detonation capability. So if you take the gun’s characteristics, relationship analytics will help you find things such as planes that could be used as weapons, things you hadn’t thought about before in that context.
If our customers know exactly what they’re looking for, they will be specific about it and get an answer back. That’s the needle in a haystack challenge. But if they know only certain attributes and want help filling out the picture, then we take a general clue like a guidance system and say, “This is what it could look like.” It’s like searching for a different kind of needle in a needle stack.