Same Old Same Old? « eFragments

Same Old Same Old?

January 19th, 2011 by Adrian Stevenson

Cultural Heritage and the Semantic Web British Museum and UCL Study Day, British Museum, London, January 13^th 2011

“The ability of the semantic web to cheaply but effectively integrate data and breakdown data silos provides museums with a long awaited opportunity to present a richer, more informative and interesting picture. For scholars, it promises the ability to uncover relationships and knowledge that would otherwise be difficult, if not impossible, to discover otherwise.”

Such was the promise of the ‘Cultural Heritage and the Semantic Web Study Day’ held in the hallowed halls of the Museum last week. Dame Wendy Hall from the University of Southampton opened the case for the defence, citing some anecdotes and giving us an outline of ‘Microcosm’, a cataloguing system she helped develop that was used for the Mountbatten archive back in 1987. Microcosm employed the semantic concept of entity-based reasoning using object-concept-context triples. Hall talked through some of the lessons learnt from her involvement in the early days of the Web. She believes that big is beautiful, i.e. the network is everything, and that scruffy works, meaning that link fail is ok, as it’s part of what makes the Web scale. Open and free standards are also very important. Hall mentioned a number of times that people said the Web wouldn’t scale, whether it was the network itself, or the ability to search it, but time had proved them wrong. Although Hall didn’t make the point explicitly, the implication was that the same thing would prove to be the case for the semantic web. As to why it didn’t take off ten years ago, she believes it’s because the Artificial Intelligence community became very interested, and took it down “an AI rat hole” that it’s only now managing to re-emerge from. She does acknowledge that publishing RDF is harder than publishing web pages, but believes that it is doable, and that we are now past the tipping point, partly due to the helping push from data.gov.uk.

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Ken Hamma spoke about ‘The Wrong Containers’. I found the gist of the talk somewhat elusive, but I think the general message was the now fairly familiar argument that we, in this case the cultural heritage sector (read ‘museums’ here), are too wedded to our long cherished notions of the book and the catalogue, and we’ve applied these concepts inappropriately as metaphors to the Web environment. He extended this reasoning to the way in which museums have similarly attempted to apply their practices and policies to the Web, having historically acted as gatekeepers and mediators. Getting institutions to be open and free with their data is a challenge, many asking why they should share. Hamma believes the museum needs to break free from the constraints of the catalogue, and needs to rethink its containers.

John Sheridan from The National Archives framed his talk around the Coalition Agreement, which provides the guiding principles for the publication of public sector information, or as he put it, the “ten commandments for the civil service”. The Agreement mandates what is in fact a very liberal licensing regime with a commitment to publishing in open standards, and the National Archives have taken this opportunity to publish data in Linked Data form and make it available via the data.gov.uk website. John acknowledged that not all data consumers will want data in RDF form from a SPARQL endpoint, so they’ve also developed Linked Data APIs with the facility to deliver data in other formats, the software code for this is being available open source. John also mentioned that the National Archives have generated a vocabulary for government organisational structure called the ‘Central Government Ontology’, and they’ve also been using a datacube to aid the creation of lightweight vocabularies for specific purposes. John believes that it is easier to publish Linked Data now than it was just a year ago, and ‘light years easier’ than five years ago.

Data provenance is a current important area for the National Archives, and they now have some ‘patterns’ for providing provenance information. He also mentioned that they’ve found the data cleansing tools available from Google Refine to be very useful. It has extensions for reconciling data that they’ve used against the government data sets, as well as extensions for creating URIs and mapping data to RDF. This all sounded very interesting, with John indicating that they are now managing to enable non-technical people to publish RDF simply by clicking, and without having to go anywhere near code.

John certainly painted a rosy picture of how easy it is to do things, one I have to say I don’t find resonates that closely with my own experience on the Locah project, where we’re publishing Linked Data for the Archives Hub and Copac services. I had a list of questions for John that I didn’t get to ask on the day. I’ll be sure to point John to these:

What are the processes for publication of Linked Data, and how are these embedded to enable non-technical people to publish RDF?
Are these processes documented and available openly, such as in step-by-step guides?
Do you have generic tools available for publishing Linked Data that could be used by others?
How did you deal with modelling existing data into RDF? Are there tools to help do this?
Does the RDF data published have links to other data sets, i.e. is it Linked Data in this sense?
Would they consider running or being involved in hands on Linked Data publishing workshops?

Hugh Glaser from Seme4 outlined a common problem existing at the British Museum and many other places: that many separate research silos exist within organisations. The conservation data will be in one place, the acquisition data in another place, and the cataloguing data in yet another. Fusing this data together for the museum website by traditional means is very expensive, but the use of Linked Data based on the CIDOC Conceptual Reference Model ontology for the catalogue, and the <sameAs> service to tie things together, can make things more cost effective. He then gave a quick demo of RKBExplorer, a service that displays digests of semantic relationships. Despite Hugh’s engaging manner, I’m not sure the demonstrations would have been enough to persuade people of the benefits of Linked Data to the cultural heritage sector.

In the short panel session that followed, John Sheridan noted that the National Archives are using named graphs to provide machine-readable provenance trails for legislation.data.gov.uk, employing the Open Provenance Model Vocabulary in combination with Google Refine processing. Hugh made the interesting point that he thinks we can get too hung up on the modelling of data, and the publication of RDF. As a result, the data published ends up being too complex and not fit for purpose. For example, when we’re including provenance data, we might want to ask why we are doing this. Is it for the user, or really just for ourselves, serving no real purpose. Big heavyweight models can be problematic in this respect.

The problem of having contradictory assertions about the same thing also came up. In Linked Data, all voices can be equal, so attribution may be important. However, even with the data the British Museum creates, there will be some contradictory assertions. John Sheridan pointed out that data.gov.uk has aided the correction of data. The publication of data about bus stops revealed that 20,000 specified locations weren’t in the right place, these then being corrected by members of the public. Hugh reminded us that the domain of a Web URI, such as http://www.britishmuseum.org/does itself provide a degree of attribution and trust.

Alanas Kiryakov from Ontotext was the first speaker to sound a warning note or two, with what I thought was an admirably honest talk. Ontotext provide a service called FactForge, and to explain this, Alanas talked a little about how we can make inferences using RDF, for example, the statement ‘is a parent’ infers the inverse statement ‘is a child’. He noted that the BBC were probably the first to use concept extraction, RDF and triple stores on a large scale site, the solution having been chosen over a traditional database solution, with the semantic web delivering a cheaper product.

So why is the semantic web is still not used more? Alanas believes it’s because there are still no well-established Linked Data ‘buys’ to convince business enterprise. Linked Data he suggests is like teenage sex – many talk about it, but not many do it. Nevertheless, he does believe that Linked Data facilitates better data integration, and adds value to proprietary data through better description whilst being able to make data more open. However, Linked Data is hard for people to comprehend, and its sheer diversity comes at a price. Getting specific information out of DBpedia is tough. The Linked Data Web is also unreliable, exhibiting high down times. One point that really struck me was how slow he says the distributed data web is, with a SPARQL query over just two or three servers being unacceptably slow.

Overcoming these limitations of the Linked Data Web form the basis of Ontotext’s ‘reason-able’ approach, which is to group selected datasets (DBPedia, Freebase, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj, MusicBrainz ) and ontologies (Dublin Core, SKOS, RSS, FOAF) into a compound set which is then cleaned-up and post processed. It does strike me that this re-centralising, dare I say it, portal approach seems to defeat much of the point of the Linked Data Web, with inherent issues arising from a not-unbound data set and out of sync data, albeit I realise it aims to provide an optimised and pragmatic solution. Alanas suggests that many real time queries would be impossible without services like Factforge.

Alanas then explained how the wide diversity of the Linked Data Web often leads to surprising and erratic results, the example given that the most popular entertainer in Germany according to a SPARQL query is the philosopher Nietzsche, as demonstrated using the Factforge query interface. This arises from what Alanas calls the ‘honey and sting’ of owl:sameas, the semantic web concept that allows for assertions to be made that two given names or identifiers refer to the same individual or entity. This can generate a great multiplicity of statements, and give rise to many different versions of the same result.

Dominic Oldman from the British Museum’s Information Services Development section concluded the day talking about the ResearchSpace project based at the British Museum. The project aims to create a research collaboration and digital publication environment, the idea being that the data becomes a part of the space, along with the tools and collaboration. It consists of things like blogging tools, forums, and wikis in an environment alongside the data, which is imported in RDF using the CIDOC Conceptual Reference Model. An example was shown of a comparison of a drawing and painting of same thing by an artist, and the ability to bring these together.

Tags: Archives, British Museum, jiscexpo, Linked Data, locah, Museums, Open Data, RDF, Semantic Web, SPARQL

This entry was posted on Wednesday, January 19th, 2011 at 11:10 am and is filed under Linked Data, Semantic Web. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “Same Old Same Old?”

Pete Johnston says:

January 20, 2011 at 10:38 am

Re the rosiness or otherwise of the Locah experience :)

I think we should be a bit careful about generalising, or at least we need to bear in mind some of the particular characteristics of the source data with which Locah is dealing when doing so.

Working with EAD is challenging because at heart it’s not really a “data-centric” XML format, for the sorts of reasons Mark Matienzo has talked about e.g. in http://www.slideshare.net/anarchivist/archives-the-semantic-web – and it is by design a fairly complex XML format that allows for a lot of structural variation.

And there’s further complexity arising from the nature of the Archives Hub collection, where data creation is decentralised, and has been carried out by multiple independent parties, using different tools and approaches, over an extended period of time, and as consequence there’s a good deal of variation in that aggregated data.

I’d argue that that is a challenge in any processing across that dataset, even staying within the world of XML/XPath etc.

So given this background, I still – despite my occasionally tearing my hair out on Twitter on finding some new pattern in the data that I hadn’t anticipated! – think the process has been relatively straightforward, really.

From my perspective, most of the effort has been in managing that complexity, rather than because of any difficulty associated with the process of generating linked data in particular. And I think the fact that we did process a “controlled” subset of the data – one where we knew the variation was limited – quite quickly reflects that.
Adrian Stevenson says:

January 20, 2011 at 10:59 am

Hi Pete

Re. “EAD is challenging because at heart it’s not really a “data-centric” XML format”, yeah, that is of course a good point, and one perhaps I’ve tended not to fully appreciate myself, even though I’m on the project :). I’ll have a look at those slides you mention.

I was aware when writing that section of the post, that it might sound as if I’m saying we’ve found it really hard with the Locah experience, which isn’t what I meant, so I guess I should have tried to rephrase it. The point was more that John did kind of give the impression that it’s very easy (unless I was misunderstanding him), which I think is in danger of giving the wrong impression to people. Maybe within the government sector they are finding it that easy, in which case it would be great to learn more from them, hence my questions.
Tyler Bell says:

January 20, 2011 at 4:12 pm

Great commentary Adrian. Same-old indeed: there was a similar event next door at UCL back around 2001, where we were talking about XML. Many of the informatic concerns and arguments persist unchanged; the ‘teenage sex’ simile was also employed. I suspect that it was met with as many stern faces this time around.
Adrian Stevenson says:

January 20, 2011 at 4:39 pm

Hi Tyler. Cheers for the thumbs up. Interesting anecdote there, that the teenage sex thing came up before at the same venue. Alanas added that Linked Data is also similar in that one’s first experience of it isn’t very satisfying, but it gets better the more you do it :) I think the comment went down maybe ok actually, though I don’t recall if I was looking round the room at the time.