Traversing the giant global graph

Tom Scott of BBC Earth describes how everyone benefits from interoperable data.Photo: Tom Scott

Interview conducted by Alan Morrison, Bo Parker, and Joe Mullich

In his role as digital editor, Tom Scott is responsible for the editorial, design, and technical development of BBC Earth—a project to bring more of the BBC’s natural history content online. In a previous role, he was part of the Future Media and Technology team in the BBC’s Audio and Music department. In this interview, Scott describes how the BBC is using Semantic Web technology and philosophy to improve the relevance of and access to content on the BBC Programmes and Music Web sites in a scalable way.

PwC: Why did you decide to use Semantic Web standards, and how did you get started?

TS: We had a number of people looking at how we could use the Web to support the large number of TV and radio programs that the BBC broadcasts. BBC Programmes had evolved with separate teams building a Web site for each individual program, including the Music Web site. That was two years ago.

If all you were interested in was a particular program, or a particular thing, that was fine, but it’s a very vertical form of navigation. If you went to a Radio One Web site or a particular program site, you had a coherent experience within that site, but you couldn’t traverse the information horizontally. You couldn’t say, “Find me all the programs that feature James May” or “Show me all the programs that have played this artist,” because it just wasn’t possible to link everything up when the focus was on publishing Web pages.

“At some point the data management problem reaches a whole different level; it reaches Web scale.”

We concluded it’s not really about Web pages. It’s about real-world objects. It’s about things that people care about, things that people think about. These things that people think about when browsing the BBC sites are not just broadcasts. In some situations they might be more interested in an artist or piece of music. We wanted both. The interest lies in the joins between the different domains of knowledge. That’s where the context that surrounds a song or an artist lives. It’s much more interesting to people than just the specific song played on a specific program.

There was a meeting of minds. We naturally fell into using the Semantic Web technologies as a byproduct of trying to make what we were publishing more coherent. We looked at what technologies were available, and these seemed the best suited to the task.

So that’s when we started with the programs. One of the things Tom Coates [now at Yahoo Brickhouse] figured out was that giving each program for the BBC broadcasts a fixed and permanent URL [Uniform Resource Locator], a subset of Uniform Resource Identifiers [URIs, see pages 6 and 7] that could be pointed to reliably, makes it possible to easily join stuff. So we started working on URLs and modeling that domain of knowledge, and then we thought about how our programs space can relate to other domains.

Some programs played music, and that means someone could view a page and go from there to an artist page that shows all the programs that have played that artist, and that might help someone find these other programs. BBC is more about programs than music. We mainly make programs, and we don’t make much music. But we do have a role in introducing people to new music and we do that via our programs. Someone who listens to this particular program might also like listening to this other program because it plays the music that they like.

PwC: How does the system work?

TS: There are logical databases running behind the scenes that hold metadata about our programs. But we also use information about artists and recordings from an outside source called, which maintains repositories of music metadata, and we take a local copy of that. This is joined with data from the Wikipedia and from BBC proprietary content. All this is then rendered data in RDF [Resource Description Framework], JSON [JavaScript Object Notation], XML [Extensible Markup Language], and the pages on the BBC Programmes [] Web site.

The Web site is the API [application programming interface], if you like. You can obtain a document in an HTML [HyperText Markup Language] view format. Or,
if you are looking to do something with our data, you can get it in a variety of machine-friendly views, such as JSON or RDF. The machine readability allows you to traverse the graph from, say, a BBC program into its play count, and then from there into the next data cloud. So ultimately, via SPARQL [Semantic Protocol and RDF Query Language], you could run a query that would allow you to say, “Show me all the BBC programs that play music from artists who were born in German cities with a population of more than 100,000 people.”

You probably wouldn’t do that, but the point is, the constructed query was initially complex. It’s not something that would be trivial and easy to think of. Because there is data that is held within the BBC but linked to data sourced from outside the BBC, you can traverse the graph to get back to that data set.

PwC: What does graph data [data in RDF format] do? What does this type of model do that the older data models have not done?

TS: I think the main difference is where data comes from—where it originates, not where it resides. If you have complete control and complete autonomy over the data, you can just dump the whole lot into a relational database, and that’s fine. As the size of the data management problem gets larger and larger, ordinary forms of data management become more complex and difficult, but you can choose to use them. At some point the data management problem reaches a whole different level; it reaches Web scale. So, for example, the hypothetical query that I came up with includes data that is outside of the BBC’s control—data about where an artist was born and the size of the city they were born in. That’s not information that we control, but RDF makes it possible to link to data outside the BBC. This creates a new resource and a bridge to many other resources, and someone can run a query across that graph on the Web. Graphs are about context and connections, rather than defining sets, as with relational data.

The real difference is that it is just at a higher level of abstraction. It’s Tim Berners-Lee’s Giant Global Graph, a term (though not the idea) I’m sure he must have used with his tongue shoved firmly into his cheek.

“This semantic technology frees you from the limitations of a page-oriented architecture and provides an open, flexible, and structured way to access data that might be embedded in or related to a Web page.”

Originally, the Web freed you from worrying about the technical details of networks and the servers. It just became a matter of pointing to the Web page, the document. This semantic technology frees you from the limitations of a page-oriented architecture and provides an open, flexible, and structured way to access data that might be embedded in or related to a Web page. SQL [structured query language], in some ways, makes you worry about the table structure. If you are constructing a query, you have to have some knowledge of the database schema or you can’t construct a query. With the Semantic Web, you don’t have to worry about that. People can just follow their noses from one resource to another, and one of the things they can get back from that are other links that take them to even more resources.

PwC: Are there serendipitous connections that come about simply by working at Web scale with this approach?

TS: There’s the serendipity, and there’s also the fact that you can rely on other people. You don’t have to have an über plan. The guys at DBpedia [a version of Wikipedia in RDF form] can do their thing and worry about how they are going to model their information. As long as I can point to the relevant bits in there and link to their resources, using the URIs, then I can do my thing. And we can all play together nicely, or someone else can access that data. Whereas, if we all have to collaborate on trying to work out some über schema to describe all the world’s information, well, we are all going to be extinct by the time we manage to do that.

PwC: So, there’s a basic commonality that exists between, say, DBpedia and MusicBrainz and the BBC, in the way these sources are constructed?

TS: The relationship between the BBC content, the DBpedia content, and MusicBrainz is no more than URIs. We just have links between these things, and we have an ontology that describes how this stuff maps together.

PwC: Is there a layer of semantics associated with presentation that is connected to the data itself? How did you think about that and manage the presentation rather than the structure of the data?

TS: We wanted the presentation to be good, and from there we fell into the Semantic Web. I would argue that if you structure your information in the same simple fashion as the Linked Data requires, then that creates the user experience. Linked Data is about providing resources for real world things and having documents that make assertions about those things. The first step in building a usable service is to design it around those things that matter to people. Those things that people care about. This, it turns out, is the same first step when following Linked Data principles.

I don’t mean that you would expose it raw this way to an audience, but first you need to structure your information the same way and create the same links between your different entities, your different resources. Once you’ve done that, then you can expose that in HTML.

The alternative is to build individual Web pages where the intelligence about the structure of the data is in the HTML. You could do that to a point, but quite quickly it becomes just too complicated to create sanity across a very large data set.

If you think about music, there are things that make sense in music. They make recordings, and these are released on different media. If you pour your data into that implicit ontology, into that structure, and then expose it as HTML, it just makes sense to people. They can browse our program information and can join it to another one of the domains around other programs.

PwC: Many companies have terabytes or petabytes of data that they don’t really know much about. They have to get their arms around it somehow. Is Linked Data an approach they should consider, beyond what we’ve already talked about?

TS: There is certainly mileage in it, because when you start getting either very large volumes or very heterogeneous data sets, then for all intents and purposes, it is impossible for any one person to try to structure that information. It just becomes too big a problem.
For one, you don’t have the domain knowledge to do that job. It’s intellectually too difficult. But you can say to each domain expert, model your domain of knowledge—the ontology—and publish the model in the way that both users and machine can interface with it.
Once you do that, then you need a way to manage the shared vocabulary by which you describe things, so that when I say “chair,” you know what I mean. When you do that, then you have a way in which enterprises can join this information, without any one person being responsible for the entire model.

After this is in place, anyone else can come across that information and follow the graph to extract the data they’re interested in. And that seems to me to be a sane, sensible, central way of handling it.

PwC: If we think about broad adoption of Semantic Web standards, it sounds like a lot depends on trillions of URIs being created. Some people say that we’ll never get there, that it’s too complicated a problem.

TS: The people who say we’ll never get there are envisaging a world that is homogeneous. It’s a bit like saying car ownership will never get there, because not everyone has a car. The reality is that the future is uneven, and some people will get there sooner than others. Here at the BBC, our work is around programs and music. I’m biased, but I really think that the approach has created a sane, coherent, and stable user experience for people, which is good for our audience. To provide that, we have represented our data in a way that people can now build stuff on top of. Time will tell whether people will do so.

PwC: Do you think an increased focus on data semantics is going to result in a new role within organizations, where job descriptions will include the word “ontology”? Are you being seen as an ontologist within the BBC because you are bringing that specific capability?

TS: It’s more about what I used to get the job done as opposed to my job title. My job is product management, and the best way to manage and develop products in an information-rich space is to do so through domain modeling. You’ll find that most of the people doing this are more interested in the outcomes than the artifacts that you produce along the way. An ontology is a useful artifact.