Posts Tagged ‘SPARQL’

Same Old Same Old?

Wednesday, January 19th, 2011 by Adrian Stevenson

Cultural Heritage and the Semantic Web British Museum and UCL Study Day, British Museum, London,  January 13th 2011

“The ability of the semantic web to cheaply but effectively integrate data and breakdown data silos provides museums with a long awaited opportunity to present a richer, more informative and interesting picture. For scholars, it promises the ability to uncover relationships and knowledge that would otherwise be difficult, if not impossible, to discover otherwise.”

Such was the promise of the ‘Cultural Heritage and the Semantic Web Study Day’ held in the hallowed halls of the Museum last week. Dame Wendy Hall from the University of Southampton opened the case for the defence, citing some anecdotes and giving us an outline of ‘Microcosm’, a cataloguing system she helped develop that was used for the Mountbatten archive back in 1987. Microcosm employed the semantic concept of entity-based reasoning using object-concept-context triples. Hall talked through some of the lessons learnt from her involvement in the early days of the Web. She believes that big is beautiful, i.e. the network is everything, and that scruffy works, meaning that link fail is ok, as it’s part of what makes the Web scale. Open and free standards are also very important. Hall mentioned a number of times that people said the Web wouldn’t scale, whether it was the network itself, or the ability to search it, but time had proved them wrong. Although Hall didn’t make the point explicitly, the implication was that the same thing would prove to be the case for the semantic web.  As to why it didn’t take off ten years ago, she believes it’s because the Artificial Intelligence community became very interested, and took it down “an AI rat hole” that it’s only now managing to re-emerge from. She does acknowledge that publishing RDF is harder than publishing web pages, but believes that it is doable, and that we are now past the tipping point, partly due to the helping push from data.gov.uk.

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Ken Hamma spoke about ‘The Wrong Containers’. I found the gist of the talk somewhat elusive, but I think the general message was the now fairly familiar argument that we, in this case the cultural heritage sector (read ‘museums’ here), are too wedded to our long cherished notions of the book and the catalogue, and we’ve applied these concepts inappropriately as metaphors to the Web environment. He extended this reasoning to the way in which museums have similarly attempted to apply their practices and policies to the Web, having historically acted as gatekeepers and mediators. Getting institutions to be open and free with their data is a challenge, many asking why they should share. Hamma believes the museum needs to break free from the constraints of the catalogue, and needs to rethink its containers.

John Sheridan from The National Archives framed his talk around the Coalition Agreement, which provides the guiding principles for the publication of public sector information, or as he put it, the “ten commandments for the civil service”. The Agreement mandates what is in fact a very liberal licensing regime with a commitment to publishing in open standards, and the National Archives have taken this opportunity to publish data in Linked Data form and make it available via the data.gov.uk website. John acknowledged that not all data consumers will want data in RDF form from a SPARQL endpoint, so they’ve also developed Linked Data APIs with the facility to deliver data in other formats, the software code for this is being available open source. John also mentioned that the National Archives have generated a vocabulary for government organisational structure called the ‘Central Government Ontology’, and they’ve also been using a datacube to aid the creation of lightweight vocabularies for specific purposes. John believes that it is easier to publish Linked Data now than it was just a year ago, and ‘light years easier’ than five years ago.

Data provenance is a current important area for the National Archives, and they now have some ‘patterns’ for providing provenance information.  He also mentioned that they’ve found the data cleansing tools available from Google Refine to be very useful. It has extensions for reconciling data that they’ve used against the government data sets, as well as extensions for creating URIs and mapping data to RDF. This all sounded very interesting, with John indicating that they are now managing to enable non-technical people to publish RDF simply by clicking, and without having to go anywhere near code.

John certainly painted a rosy picture of how easy it is to do things, one I have to say I don’t find resonates that closely with my own experience on the Locah project, where we’re publishing Linked Data for the Archives Hub and Copac services. I had a list of questions for John that I didn’t get to ask on the day. I’ll be sure to point John to these:

  • What are the processes for publication of Linked Data, and how are these embedded to enable non-technical people to publish RDF?
  • Are these processes documented and available openly, such as in step-by-step guides?
  • Do you have generic tools available for publishing Linked Data that could be used by others?
  • How did you deal with modelling existing data into RDF? Are there tools to help do this?
  • Does the RDF data published have links to other data sets, i.e. is it Linked Data in this sense?
  • Would they consider running or being involved in hands on Linked Data publishing workshops?

Hugh Glaser from Seme4 outlined a common problem existing at the British Museum and many other places: that many separate research silos exist within organisations. The conservation data will be in one place, the acquisition data in another place, and the cataloguing data in yet another. Fusing this data together for the museum website by traditional means is very expensive, but the use of Linked Data based on the CIDOC Conceptual Reference Model ontology for the catalogue, and the <sameAs> service to tie things together, can make things more cost effective.  He then gave a quick demo of RKBExplorer, a service that displays digests of semantic relationships. Despite Hugh’s engaging manner, I’m not sure the demonstrations would have been enough to persuade people of the benefits of Linked Data to the cultural heritage sector.

In the short panel session that followed, John Sheridan noted that the National Archives are using named graphs to provide machine-readable provenance trails for legislation.data.gov.uk, employing the Open Provenance Model Vocabulary in combination with Google Refine processing. Hugh made the interesting point that he thinks we can get too hung up on the modelling of data, and the publication of RDF. As a result, the data published ends up being too complex and not fit for purpose. For example, when we’re including provenance data, we might want to ask why we are doing this. Is it for the user, or really just for ourselves, serving no real purpose. Big heavyweight models can be problematic in this respect.

The problem of having contradictory assertions about the same thing also came up. In Linked Data, all voices can be equal, so attribution may be important. However, even with the data the British Museum creates, there will be some contradictory assertions.  John Sheridan pointed out that data.gov.uk has aided the correction of data. The publication of data about bus stops revealed that 20,000 specified locations weren’t in the right place, these then being corrected by members of the public. Hugh reminded us that the domain of a Web URI, such as http://www.britishmuseum.org/does itself provide a degree of attribution and trust.

Alanas Kiryakov from Ontotext was the first speaker to sound a warning note or two, with what I thought was an admirably honest talk. Ontotext provide a service called FactForge, and to explain this, Alanas talked a little about how we can make inferences using RDF, for example, the statement ‘is a parent’ infers the inverse statement ‘is a child’. He noted that the BBC were probably the first to use concept extraction, RDF and triple stores on a large scale site, the solution having been chosen over a traditional database solution, with the semantic web delivering a cheaper product.

So why is the semantic web is still not used more? Alanas believes it’s because there are still no well-established Linked Data ‘buys’ to convince business enterprise.  Linked Data he suggests is like teenage sex – many talk about it, but not many do it. Nevertheless, he does believe that Linked Data facilitates better data integration, and adds value to proprietary data through better description whilst being able to make data more open. However, Linked Data is hard for people to comprehend, and its sheer diversity comes at a price. Getting specific information out of DBpedia is tough. The Linked Data Web is also unreliable, exhibiting high down times. One point that really struck me was how slow he says the distributed data web is, with a SPARQL query over just two or three servers being unacceptably slow.

Overcoming these limitations of the Linked Data Web form the basis of Ontotext’s ‘reason-able’ approach, which is to group selected datasets (DBPedia, Freebase, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj, MusicBrainz ) and ontologies (Dublin Core, SKOS, RSS, FOAF) into a compound set which is then cleaned-up and post processed.  It does strike me that this re-centralising, dare I say it, portal approach seems to defeat much of the point of the Linked Data Web, with inherent issues arising from a not-unbound data set and out of sync data, albeit I realise it aims to provide an optimised and pragmatic solution. Alanas suggests that many real time queries would be impossible without services like Factforge.

Alanas then explained how the wide diversity of the Linked Data Web often leads to surprising and erratic results, the example given that the most popular entertainer in Germany according to a SPARQL query is the philosopher Nietzsche, as demonstrated using the Factforge query interface.  This arises from what Alanas calls the ‘honey and sting’ of owl:sameas, the semantic web concept that allows for assertions to be made that two given names or identifiers refer to the same individual or entity. This can generate a great multiplicity of statements, and give rise to many different versions of the same result.

Dominic Oldman from the British Museum’s Information Services Development section concluded the day talking about the ResearchSpace project based at the British Museum. The project aims to create a research collaboration and digital publication environment, the idea being that the data becomes a part of the space, along with the tools and collaboration. It consists of things like blogging tools, forums, and wikis in an environment alongside the data, which is imported in RDF using the CIDOC Conceptual Reference Model. An example was shown of a comparison of a drawing and painting of same thing by an artist, and the ability to bring these together.

Ground Control to Major Tom

Wednesday, March 17th, 2010 by Adrian Stevenson

Terra Future Seminar, Ordnance Survey, Southampton, 10th March 2010

The Linked Data movement was once again in rude health at last week’s ‘Terra Future’ seminar where it joined forces with the UK GIS crowd at the Ordnance Survey HQ on a sun soaked Southampton morning.

Peter ter Haar from Ordnance Survey opened the day by raising the question, “why bring geo-spatial data and Linked Data together?”. He suggested it is important because the word “where” exists in 80% of our questions. Where things happen matters greatly to us.

It was down to the ever-polished ‘Major’ Tom Heath from Talis to set the Linked Data store. He talked through a clever analogy between virtual data links and physical transport links. Linked Data promises to improve the speed and efficiency of virtual data networks in the same way the development of road and rail networks improved physical travel over the era of canal networks. Tom’s journey to work from Bristol to Birmingham is only made possible by an interlinking network of roads, cycle routes and railways built using agreed standards such as rail track gauges. Similarly, many data applications will only be made possible by a network of standardised Linked Data. As building physical networks has added value to the places connected, so will building virtual networks add value to things that are connected.

Tom Heath

Tom Heath speaking at Terra Future 2010

Liz Ratcliffe from Ordnance Survey gave us a brief history of geography, complete with some great slides, explaining how much of the subject is about linking different aspects of geography together, whether it be information about physical geography, climatology, coastal data, environmental data, glaciology and so on. Liz was the first to mention the geographical concept of topographical identifiers (TOIDs). Geography uses TOIDs to connect information, with every geographical feature being described by its TOID. It’s also possible to ‘hang’ your own information on TOIDs. The difference between the concepts of location and place were explained, location being a point or position in physical space, and place being a portion of space regarded as distinct and measured off. Liz concluded with the seemingly obvious, but perhaps taken for granted observation that “everything happens somewhere”. Tom Heath made the point on twitter that the next step here is to make the links between the topographic IDs explicit, and then expose them to the Web as http URIs. John Goodwin reported that he’s “working on it”, so it sounds like we can look forward to some progress here.

Silver Oliver from the BBC stood in for a mysteriously absent Tom Scott. It was more or less a re-run of his Linked Data London talk on what the BBC are doing in the area of news and journalism, and is covered below.

Next up was an unscheduled surprise guest, none other than ‘inventor of the Web’, Sir Tim Berners-Lee himself for a quick pep talk, expressing how crucially important geo-spatial data is for Linked Data. He observed that one of the first things people do is to map things. This is very valuable information, so we need to surface this geospatial data in ways that can be linked.  Surprise guest star number two, Nigel Shadbolt from the University of Southampton and ‘Information Advisor’ to the Prime Minister, then took to the podium for a lightning talk on data.gov.uk. He gave part of the credit for the fact that data.gov.uk was launched on time to agile programming methodologies, and suggested this was worthwhile considering when thinking about procurements. He then gave us a quick tour of some of the interesting work going on, including the ASBOrometer iPhone app  that measures the level of anti-social behaviour at your current geo-location based on data from data.gov.uk.

Tim Berners-Lee

Tim Berners-Lee speaking at Terra Future 2010

John Sheridan from the UK Government’s Office of Public Sector Information (OPSI), then talked some more about data.gov.uk. He mentioned that an important part of the data.gov.uk exercise was to re-use existing identifiers by co-opting them for things when moving them into the Linked Data space. The point is to not start anew. They’re trying to find ways to make publishing data in RDF form as easy as possible by finding patterns for doing things based on Christopher Alexander’s design patterns ideas. He also asked “how the hell would we meet the INSPIRE directives without Linked Data?”. These directives require the assignment of identifiers for spatial objects and the publishing of geospatial information. Now that data.gov.uk has been launched, the next step is to build capability and tools around the design patterns they’ve been constructing.

Brian Higgs from Dudley Metropolitan Borough Council talked a little about how location is important for local government in delivering its services, for which geo-information services are business critical. Ian Holt, Senior Technical Product Manager for Ordnance Survey Web Services then spoke about some of the interesting map things that have been happening as result of Web 2.0. He illustrated how it’s possible to create your own maps with a GPS unit using the Open Street Map service. He mentioned how the world community had filled in the data on the Haiti open map following the recent earthquakes, greatly helping the recovery effort. Tim Berners-Lee also referred to this in a recent TED talk.

Hugh Glaser from the University of Southampton closed the presentations with some technical demonstrations of sameAs.org, a service that helps you to find co-references between different data sets, and rkbexplorer.com, a ‘human interface to the ReSIST Knowledge Base’. A sameAs.org search on for example, John Coltrane will give you a set of URIs that all refer to the same concept of John Coltrane. Hugh admitted that gathering all the links is a challenge, and that some RDF is not reliable. All that’s needed to have a ‘same as‘ data problem, is for one person to state that the band ‘Metallica’ is the same as the album ‘Metallica’. Hugh also alluded to some of issues around trusting data, leading Chris Gutteridge to make a mischievous suggestion, “Hughs examples of sameAs for metallica gives me an evil idea: sameAs from bbc & imdb to torrents on thepiratebay”. I raised this as an issue for Linked Data on twitter leading to a number of responses, Ben O’Steen suggesting that “If people are out to deceive you, or your system, they will regardless of tech. Needing risk management is not new!”.  I think this is a fair point. Many of these same issues have been tackled in that past, for example in the area of relational databases.  The adoption of Linked Data is still fairly new, and it seems perfectly plausible that Linked Data will be able to address and resolve these same issues in time.

All in all, it was a great day, with the strong sense that real things are happening, and a great sense of excitement and optimism for the future of open, linked and geo-located data.

'Open Your Data' Sticker given out by Tim Berners-Lee

'Open Your Data' Sticker given out by Tim Berners-Lee

The full archive of #terrafuture tweets is available via twapper.

Collective Intelligence Amplification

Monday, March 15th, 2010 by Adrian Stevenson

JISC Developer Days, University of London Union, London, 24th-27th February 2010

Following straight on from the ‘Linked Data Meet-up 2‘, I was immediately into the JISC UKOLN Dev8d Developer Days (http://dev8d.org/) held at the same location. Although I may be considered to be a little biased given I work for UKOLN, I have to say I was mightily impressed by this fantastic event. The details that went into the organisation, as well as the multitude of original ideas to enhance the event were well beyond anything I’ve seen before.

I was mainly there to get a few video interviews, and I’ve included these below. It was great to chat to Ed Summers from the Library of Congress who passed on his usual code4lib to attend dev8d, and gave us a few comments on how the events compare. It was also exciting to hear that Chuck Severance is intending to enhance the degree course he teaches on back in the US, using things he’s learnt at dev8d. All the interviewees clearly found the event to be really useful for creating and collaborating on new ideas in a way that just isn’t possible to the same degree as part of the usual working week. Just walking around the event listening in to some of the conversations, I could tell some great developer brains were working optimally. The workshops, expert sessions and project zones all added to the overall effect of raising the collective intelligence a good few notches. I’m sure we’ll hear about some great projects arising directly from these intense hot housing days.

You can get more reflections via the dev8d and JISC Information Environment Team blogs.

Ed Summers Chuck Severance Tim Donahue
John O’Brien Steve Coppin Chris Keene
Marcus Ramsden Lin Clark Tom Heath