eFragments » jiscexpo

There’s No Business Like No Business Case

Adrian Stevenson — Fri, 18 Mar 2011 12:53:37 +0000

‘What is the Business Administrative Case for Linked Data?’ parallel session, JISC Conference 2011, BT Convention Centre, Liverpool, UK. 15th March 2011

One of the parallel sessions at this years JISC Conference in Liverpool promised to address the “business value to the institution” of linked data, being aimed at “anyone who wants a clear explanation in terms of how it applies to your institutional business strategy for saving money..”. I was one of a number of people invited to be on the panel by the session host, David Flanders, but such was the enthusiasm, I was beaten to it by the other panellists, despite replying within a few hours.

The session kicked off with a five minute soapbox from each of the panellists before opening up to a wider discussion. First up was Hugh Glaser from Seme4. He suggested that Universities have known for a long time that they need to improve data integration and fusion, but have found this a difficult problem to solve. You can get consultants in to do this, but it’s expensive and the IT solutions often end up driving the business process instead of the other way round. Everything single modification has to be paid for, such that your business process often ends up being frozen. However, Linked Data offers the possibility of solving these problems at low risk and low cost, being more of an evolutionary than revolutionary solution. The success of data.gov.uk was cited, it having taken only seven months to release a whole series of datasets. Hugh emphasised that not only has the linked data approach been implemented quickly and cheaply here, but also it hasn’t directly impinged upon or skewed the business process.

He also talked about his work with the British Museum, the problem there being that data has been held separately in different parts of the organisation resulting in seven different databases. These have been converted into linked data form and published openly, now allowing the datasets to be integrated. Hugh mentioned that another bonus of this approach is that you don’t necessarily have to write all your applications yourself. The finance section of data.gov.uk lists five applications contributed by developers not involved with the government.

Linked Data Panel Session at JISC Conference 2011 (L-R: David Flanders, Hugh Glaser, Wilbert Kraan, Bijan Parsia, Graham Klyne)

Wilbert Kraan from CETIS described an example where linked data makes it possible to do things existing technologies simply can’t. The example was based on PROD, a database of JISC projects provided as linked data. Wilbert explained that they are now able to ask questions of the dataset not possible before. They can now put information on where projects have taken place on a map, also detailing the type of institution, and its rate of uptake. The neat trick is that CETIS don’t have to collect any data themselves, as many other people are gathering data and making it available openly. As the PROD data is linked data, it can be linked in to this other data. Wilbert suggested that it’s hard to say if money is saved, because in many cases, this sort of information wouldn’t be available at all without the application of linked data principles.

Lecturer in Computer Science at the University of Manchester, Bijan Parsia talked about the notion of data disintermediation, which is the idea that linked data cuts out intermediaries in the data handling process, thereby eliminating points of frictions. Applications such as visualisations can be built upon linked data without the need to climb the technical and administrative hurdles around a proprietary dataset. Many opportunities then exist to build added value over time.

The business case favoured by Graham Klyne was captured by the idea that it enables “uncoordinated reuse of information” as espoused by Clark Parsia, an example being the simplicity with which it’s possible to overlay faceted browse functionality on a dataset without needing to ask permission. Graham addressed the question of why there are still few compelling linked data apps. He believes this comes down to the disconnect between who pays and who benefits. It is all too often not the publishers themselves who benefit, so we need to do everything possible to remove the barriers to data publishing. One solution may be to find ways to give credit for dataset publication, in the same way we do for publishing papers in the academic sector.

David then asked the panel for some one-liners on the current barriers and pain points. For Wilbert, it’s simply down to lack of knowledge and understanding in the University enterprise sector, where the data publisher is also the beneficiary. Hugh felt it’s about the difficulty of extracting the essence of the benefit of linked data. Bijan suggested that linked data infrastructure is still relatively immature, and Graham felt that the realisation of benefits is too separate from the costs of publication, although he acknowledged that it is getting better and cheaper to do.

We then moved on to questions and discussion. The issue of data quality was raised. Hugh suggested that linked data doesn’t solve problem of quality, but it can help expose quality issues, and therefore their correction. He pointed out that there may be trust and therefore quality around a domain name, such as http://www.bl.uk/ for data from the British Library. Bijan noted that data quality is really no more of an issue than it is for the wider Web, but that it would help to have mechanisms for reporting issues back to the publisher. Hugh believes linked data can in principle help with sustainability, in that people can fairly straightforwardly pick up and re-host linked data. Wilbert noted one advantage of linked data is that you can do things iteratively, and build things up over time without having to make a significant upfront commitment. Hugh also reminded us of the value of linked data to the intranet. Much of the British Library data is closed off, but has considerable value internally. Linked data doesn’t have to open to be useful.

The session was very energetic, being somewhat frantic and freewheeling at times. I was a little frustrated that some interesting discussion points didn’t have the opportunity to develop, but overall the session managed to cover a lot of ground for a one-hour slot. Were any IT managers convinced enough to look at linked data further? For that I think we’ll have to wait and see. For now, as Ethan Merman would say, “let’s go on with the show”.

Same Old Same Old?

Adrian Stevenson — Wed, 19 Jan 2011 11:10:39 +0000

Cultural Heritage and the Semantic Web British Museum and UCL Study Day, British Museum, London, January 13^th 2011

“The ability of the semantic web to cheaply but effectively integrate data and breakdown data silos provides museums with a long awaited opportunity to present a richer, more informative and interesting picture. For scholars, it promises the ability to uncover relationships and knowledge that would otherwise be difficult, if not impossible, to discover otherwise.”

Such was the promise of the ‘Cultural Heritage and the Semantic Web Study Day’ held in the hallowed halls of the Museum last week. Dame Wendy Hall from the University of Southampton opened the case for the defence, citing some anecdotes and giving us an outline of ‘Microcosm’, a cataloguing system she helped develop that was used for the Mountbatten archive back in 1987. Microcosm employed the semantic concept of entity-based reasoning using object-concept-context triples. Hall talked through some of the lessons learnt from her involvement in the early days of the Web. She believes that big is beautiful, i.e. the network is everything, and that scruffy works, meaning that link fail is ok, as it’s part of what makes the Web scale. Open and free standards are also very important. Hall mentioned a number of times that people said the Web wouldn’t scale, whether it was the network itself, or the ability to search it, but time had proved them wrong. Although Hall didn’t make the point explicitly, the implication was that the same thing would prove to be the case for the semantic web. As to why it didn’t take off ten years ago, she believes it’s because the Artificial Intelligence community became very interested, and took it down “an AI rat hole” that it’s only now managing to re-emerge from. She does acknowledge that publishing RDF is harder than publishing web pages, but believes that it is doable, and that we are now past the tipping point, partly due to the helping push from data.gov.uk.

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Ken Hamma spoke about ‘The Wrong Containers’. I found the gist of the talk somewhat elusive, but I think the general message was the now fairly familiar argument that we, in this case the cultural heritage sector (read ‘museums’ here), are too wedded to our long cherished notions of the book and the catalogue, and we’ve applied these concepts inappropriately as metaphors to the Web environment. He extended this reasoning to the way in which museums have similarly attempted to apply their practices and policies to the Web, having historically acted as gatekeepers and mediators. Getting institutions to be open and free with their data is a challenge, many asking why they should share. Hamma believes the museum needs to break free from the constraints of the catalogue, and needs to rethink its containers.

John Sheridan from The National Archives framed his talk around the Coalition Agreement, which provides the guiding principles for the publication of public sector information, or as he put it, the “ten commandments for the civil service”. The Agreement mandates what is in fact a very liberal licensing regime with a commitment to publishing in open standards, and the National Archives have taken this opportunity to publish data in Linked Data form and make it available via the data.gov.uk website. John acknowledged that not all data consumers will want data in RDF form from a SPARQL endpoint, so they’ve also developed Linked Data APIs with the facility to deliver data in other formats, the software code for this is being available open source. John also mentioned that the National Archives have generated a vocabulary for government organisational structure called the ‘Central Government Ontology’, and they’ve also been using a datacube to aid the creation of lightweight vocabularies for specific purposes. John believes that it is easier to publish Linked Data now than it was just a year ago, and ‘light years easier’ than five years ago.

Data provenance is a current important area for the National Archives, and they now have some ‘patterns’ for providing provenance information. He also mentioned that they’ve found the data cleansing tools available from Google Refine to be very useful. It has extensions for reconciling data that they’ve used against the government data sets, as well as extensions for creating URIs and mapping data to RDF. This all sounded very interesting, with John indicating that they are now managing to enable non-technical people to publish RDF simply by clicking, and without having to go anywhere near code.

John certainly painted a rosy picture of how easy it is to do things, one I have to say I don’t find resonates that closely with my own experience on the Locah project, where we’re publishing Linked Data for the Archives Hub and Copac services. I had a list of questions for John that I didn’t get to ask on the day. I’ll be sure to point John to these:

What are the processes for publication of Linked Data, and how are these embedded to enable non-technical people to publish RDF?
Are these processes documented and available openly, such as in step-by-step guides?
Do you have generic tools available for publishing Linked Data that could be used by others?
How did you deal with modelling existing data into RDF? Are there tools to help do this?
Does the RDF data published have links to other data sets, i.e. is it Linked Data in this sense?
Would they consider running or being involved in hands on Linked Data publishing workshops?

Hugh Glaser from Seme4 outlined a common problem existing at the British Museum and many other places: that many separate research silos exist within organisations. The conservation data will be in one place, the acquisition data in another place, and the cataloguing data in yet another. Fusing this data together for the museum website by traditional means is very expensive, but the use of Linked Data based on the CIDOC Conceptual Reference Model ontology for the catalogue, and the service to tie things together, can make things more cost effective. He then gave a quick demo of RKBExplorer, a service that displays digests of semantic relationships. Despite Hugh’s engaging manner, I’m not sure the demonstrations would have been enough to persuade people of the benefits of Linked Data to the cultural heritage sector.

In the short panel session that followed, John Sheridan noted that the National Archives are using named graphs to provide machine-readable provenance trails for legislation.data.gov.uk, employing the Open Provenance Model Vocabulary in combination with Google Refine processing. Hugh made the interesting point that he thinks we can get too hung up on the modelling of data, and the publication of RDF. As a result, the data published ends up being too complex and not fit for purpose. For example, when we’re including provenance data, we might want to ask why we are doing this. Is it for the user, or really just for ourselves, serving no real purpose. Big heavyweight models can be problematic in this respect.

The problem of having contradictory assertions about the same thing also came up. In Linked Data, all voices can be equal, so attribution may be important. However, even with the data the British Museum creates, there will be some contradictory assertions. John Sheridan pointed out that data.gov.uk has aided the correction of data. The publication of data about bus stops revealed that 20,000 specified locations weren’t in the right place, these then being corrected by members of the public. Hugh reminded us that the domain of a Web URI, such as http://www.britishmuseum.org/does itself provide a degree of attribution and trust.

Alanas Kiryakov from Ontotext was the first speaker to sound a warning note or two, with what I thought was an admirably honest talk. Ontotext provide a service called FactForge, and to explain this, Alanas talked a little about how we can make inferences using RDF, for example, the statement ‘is a parent’ infers the inverse statement ‘is a child’. He noted that the BBC were probably the first to use concept extraction, RDF and triple stores on a large scale site, the solution having been chosen over a traditional database solution, with the semantic web delivering a cheaper product.

So why is the semantic web is still not used more? Alanas believes it’s because there are still no well-established Linked Data ‘buys’ to convince business enterprise. Linked Data he suggests is like teenage sex – many talk about it, but not many do it. Nevertheless, he does believe that Linked Data facilitates better data integration, and adds value to proprietary data through better description whilst being able to make data more open. However, Linked Data is hard for people to comprehend, and its sheer diversity comes at a price. Getting specific information out of DBpedia is tough. The Linked Data Web is also unreliable, exhibiting high down times. One point that really struck me was how slow he says the distributed data web is, with a SPARQL query over just two or three servers being unacceptably slow.

Overcoming these limitations of the Linked Data Web form the basis of Ontotext’s ‘reason-able’ approach, which is to group selected datasets (DBPedia, Freebase, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj, MusicBrainz ) and ontologies (Dublin Core, SKOS, RSS, FOAF) into a compound set which is then cleaned-up and post processed. It does strike me that this re-centralising, dare I say it, portal approach seems to defeat much of the point of the Linked Data Web, with inherent issues arising from a not-unbound data set and out of sync data, albeit I realise it aims to provide an optimised and pragmatic solution. Alanas suggests that many real time queries would be impossible without services like Factforge.

Alanas then explained how the wide diversity of the Linked Data Web often leads to surprising and erratic results, the example given that the most popular entertainer in Germany according to a SPARQL query is the philosopher Nietzsche, as demonstrated using the Factforge query interface. This arises from what Alanas calls the ‘honey and sting’ of owl:sameas, the semantic web concept that allows for assertions to be made that two given names or identifiers refer to the same individual or entity. This can generate a great multiplicity of statements, and give rise to many different versions of the same result.

Dominic Oldman from the British Museum’s Information Services Development section concluded the day talking about the ResearchSpace project based at the British Museum. The project aims to create a research collaboration and digital publication environment, the idea being that the data becomes a part of the space, along with the tools and collaboration. It consists of things like blogging tools, forums, and wikis in an environment alongside the data, which is imported in RDF using the CIDOC Conceptual Reference Model. An example was shown of a comparison of a drawing and painting of same thing by an artist, and the ability to bring these together.