Posts Tagged ‘Open Data’

Same Old Same Old?

Wednesday, January 19th, 2011 by Adrian Stevenson

Cultural Heritage and the Semantic Web British Museum and UCL Study Day, British Museum, London,  January 13th 2011

“The ability of the semantic web to cheaply but effectively integrate data and breakdown data silos provides museums with a long awaited opportunity to present a richer, more informative and interesting picture. For scholars, it promises the ability to uncover relationships and knowledge that would otherwise be difficult, if not impossible, to discover otherwise.”

Such was the promise of the ‘Cultural Heritage and the Semantic Web Study Day’ held in the hallowed halls of the Museum last week. Dame Wendy Hall from the University of Southampton opened the case for the defence, citing some anecdotes and giving us an outline of ‘Microcosm’, a cataloguing system she helped develop that was used for the Mountbatten archive back in 1987. Microcosm employed the semantic concept of entity-based reasoning using object-concept-context triples. Hall talked through some of the lessons learnt from her involvement in the early days of the Web. She believes that big is beautiful, i.e. the network is everything, and that scruffy works, meaning that link fail is ok, as it’s part of what makes the Web scale. Open and free standards are also very important. Hall mentioned a number of times that people said the Web wouldn’t scale, whether it was the network itself, or the ability to search it, but time had proved them wrong. Although Hall didn’t make the point explicitly, the implication was that the same thing would prove to be the case for the semantic web.  As to why it didn’t take off ten years ago, she believes it’s because the Artificial Intelligence community became very interested, and took it down “an AI rat hole” that it’s only now managing to re-emerge from. She does acknowledge that publishing RDF is harder than publishing web pages, but believes that it is doable, and that we are now past the tipping point, partly due to the helping push from data.gov.uk.

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Dame Wendy Hall speaking at the British Museum on January 13th 2011

Ken Hamma spoke about ‘The Wrong Containers’. I found the gist of the talk somewhat elusive, but I think the general message was the now fairly familiar argument that we, in this case the cultural heritage sector (read ‘museums’ here), are too wedded to our long cherished notions of the book and the catalogue, and we’ve applied these concepts inappropriately as metaphors to the Web environment. He extended this reasoning to the way in which museums have similarly attempted to apply their practices and policies to the Web, having historically acted as gatekeepers and mediators. Getting institutions to be open and free with their data is a challenge, many asking why they should share. Hamma believes the museum needs to break free from the constraints of the catalogue, and needs to rethink its containers.

John Sheridan from The National Archives framed his talk around the Coalition Agreement, which provides the guiding principles for the publication of public sector information, or as he put it, the “ten commandments for the civil service”. The Agreement mandates what is in fact a very liberal licensing regime with a commitment to publishing in open standards, and the National Archives have taken this opportunity to publish data in Linked Data form and make it available via the data.gov.uk website. John acknowledged that not all data consumers will want data in RDF form from a SPARQL endpoint, so they’ve also developed Linked Data APIs with the facility to deliver data in other formats, the software code for this is being available open source. John also mentioned that the National Archives have generated a vocabulary for government organisational structure called the ‘Central Government Ontology’, and they’ve also been using a datacube to aid the creation of lightweight vocabularies for specific purposes. John believes that it is easier to publish Linked Data now than it was just a year ago, and ‘light years easier’ than five years ago.

Data provenance is a current important area for the National Archives, and they now have some ‘patterns’ for providing provenance information.  He also mentioned that they’ve found the data cleansing tools available from Google Refine to be very useful. It has extensions for reconciling data that they’ve used against the government data sets, as well as extensions for creating URIs and mapping data to RDF. This all sounded very interesting, with John indicating that they are now managing to enable non-technical people to publish RDF simply by clicking, and without having to go anywhere near code.

John certainly painted a rosy picture of how easy it is to do things, one I have to say I don’t find resonates that closely with my own experience on the Locah project, where we’re publishing Linked Data for the Archives Hub and Copac services. I had a list of questions for John that I didn’t get to ask on the day. I’ll be sure to point John to these:

  • What are the processes for publication of Linked Data, and how are these embedded to enable non-technical people to publish RDF?
  • Are these processes documented and available openly, such as in step-by-step guides?
  • Do you have generic tools available for publishing Linked Data that could be used by others?
  • How did you deal with modelling existing data into RDF? Are there tools to help do this?
  • Does the RDF data published have links to other data sets, i.e. is it Linked Data in this sense?
  • Would they consider running or being involved in hands on Linked Data publishing workshops?

Hugh Glaser from Seme4 outlined a common problem existing at the British Museum and many other places: that many separate research silos exist within organisations. The conservation data will be in one place, the acquisition data in another place, and the cataloguing data in yet another. Fusing this data together for the museum website by traditional means is very expensive, but the use of Linked Data based on the CIDOC Conceptual Reference Model ontology for the catalogue, and the <sameAs> service to tie things together, can make things more cost effective.  He then gave a quick demo of RKBExplorer, a service that displays digests of semantic relationships. Despite Hugh’s engaging manner, I’m not sure the demonstrations would have been enough to persuade people of the benefits of Linked Data to the cultural heritage sector.

In the short panel session that followed, John Sheridan noted that the National Archives are using named graphs to provide machine-readable provenance trails for legislation.data.gov.uk, employing the Open Provenance Model Vocabulary in combination with Google Refine processing. Hugh made the interesting point that he thinks we can get too hung up on the modelling of data, and the publication of RDF. As a result, the data published ends up being too complex and not fit for purpose. For example, when we’re including provenance data, we might want to ask why we are doing this. Is it for the user, or really just for ourselves, serving no real purpose. Big heavyweight models can be problematic in this respect.

The problem of having contradictory assertions about the same thing also came up. In Linked Data, all voices can be equal, so attribution may be important. However, even with the data the British Museum creates, there will be some contradictory assertions.  John Sheridan pointed out that data.gov.uk has aided the correction of data. The publication of data about bus stops revealed that 20,000 specified locations weren’t in the right place, these then being corrected by members of the public. Hugh reminded us that the domain of a Web URI, such as http://www.britishmuseum.org/does itself provide a degree of attribution and trust.

Alanas Kiryakov from Ontotext was the first speaker to sound a warning note or two, with what I thought was an admirably honest talk. Ontotext provide a service called FactForge, and to explain this, Alanas talked a little about how we can make inferences using RDF, for example, the statement ‘is a parent’ infers the inverse statement ‘is a child’. He noted that the BBC were probably the first to use concept extraction, RDF and triple stores on a large scale site, the solution having been chosen over a traditional database solution, with the semantic web delivering a cheaper product.

So why is the semantic web is still not used more? Alanas believes it’s because there are still no well-established Linked Data ‘buys’ to convince business enterprise.  Linked Data he suggests is like teenage sex – many talk about it, but not many do it. Nevertheless, he does believe that Linked Data facilitates better data integration, and adds value to proprietary data through better description whilst being able to make data more open. However, Linked Data is hard for people to comprehend, and its sheer diversity comes at a price. Getting specific information out of DBpedia is tough. The Linked Data Web is also unreliable, exhibiting high down times. One point that really struck me was how slow he says the distributed data web is, with a SPARQL query over just two or three servers being unacceptably slow.

Overcoming these limitations of the Linked Data Web form the basis of Ontotext’s ‘reason-able’ approach, which is to group selected datasets (DBPedia, Freebase, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj, MusicBrainz ) and ontologies (Dublin Core, SKOS, RSS, FOAF) into a compound set which is then cleaned-up and post processed.  It does strike me that this re-centralising, dare I say it, portal approach seems to defeat much of the point of the Linked Data Web, with inherent issues arising from a not-unbound data set and out of sync data, albeit I realise it aims to provide an optimised and pragmatic solution. Alanas suggests that many real time queries would be impossible without services like Factforge.

Alanas then explained how the wide diversity of the Linked Data Web often leads to surprising and erratic results, the example given that the most popular entertainer in Germany according to a SPARQL query is the philosopher Nietzsche, as demonstrated using the Factforge query interface.  This arises from what Alanas calls the ‘honey and sting’ of owl:sameas, the semantic web concept that allows for assertions to be made that two given names or identifiers refer to the same individual or entity. This can generate a great multiplicity of statements, and give rise to many different versions of the same result.

Dominic Oldman from the British Museum’s Information Services Development section concluded the day talking about the ResearchSpace project based at the British Museum. The project aims to create a research collaboration and digital publication environment, the idea being that the data becomes a part of the space, along with the tools and collaboration. It consists of things like blogging tools, forums, and wikis in an environment alongside the data, which is imported in RDF using the CIDOC Conceptual Reference Model. An example was shown of a comparison of a drawing and painting of same thing by an artist, and the ability to bring these together.

Collective Intelligence Amplification

Monday, March 15th, 2010 by Adrian Stevenson

JISC Developer Days, University of London Union, London, 24th-27th February 2010

Following straight on from the ‘Linked Data Meet-up 2‘, I was immediately into the JISC UKOLN Dev8d Developer Days (http://dev8d.org/) held at the same location. Although I may be considered to be a little biased given I work for UKOLN, I have to say I was mightily impressed by this fantastic event. The details that went into the organisation, as well as the multitude of original ideas to enhance the event were well beyond anything I’ve seen before.

I was mainly there to get a few video interviews, and I’ve included these below. It was great to chat to Ed Summers from the Library of Congress who passed on his usual code4lib to attend dev8d, and gave us a few comments on how the events compare. It was also exciting to hear that Chuck Severance is intending to enhance the degree course he teaches on back in the US, using things he’s learnt at dev8d. All the interviewees clearly found the event to be really useful for creating and collaborating on new ideas in a way that just isn’t possible to the same degree as part of the usual working week. Just walking around the event listening in to some of the conversations, I could tell some great developer brains were working optimally. The workshops, expert sessions and project zones all added to the overall effect of raising the collective intelligence a good few notches. I’m sure we’ll hear about some great projects arising directly from these intense hot housing days.

You can get more reflections via the dev8d and JISC Information Environment Team blogs.

Ed Summers Chuck Severance Tim Donahue
John O’Brien Steve Coppin Chris Keene
Marcus Ramsden Lin Clark Tom Heath

The Case for Manchester Open Data City

Monday, February 8th, 2010 by Adrian Stevenson

As part of what might be considered my extra-curricular activities, I’ve been attending Manchester’s thriving Social Media Cafe from when it began back in November 2008. I initially got involved with this group more from the perspective of being a director of the Manchester Jazz Festival and a Manchester music blogger in the guise of The Ring Modulator. The interesting thing is that it usually turns out to be more relevant to my UKOLN ‘day’ job, this being the case when Julian Tait, one of the media cafe’s founders, asked me to give a talk on Linked Data, which I duly did last year.

The crossover is even more apparent now that Julian, as part of his role in Future Everything, has become involved in a project to make Manchester the UK’s first Open Data City. He spoke about this at the last excellent cafe meeting, and did a great job helping amplify some thoughts I’ve been having on this.

Julian Tait speaking at the Manchester Social Media Cafe

Julian Tait speaking at the Manchester Social Media Cafe, BBC Manchester, 2nd February 2010

Julian posed the question of what an open data city might mean, suggesting the notion that in a very real sense, data is the lifeblood of a city. It allows cities to function, to be dynamic and to evolve. If the data is more open and more available, then perhaps this data/blood can flow more freely. The whole city organism could then be a stronger, fitter and a healthier place for us to live.

Open data  has the potential to make our cities fairer and more democratic places. It is well known that information is a valuable and precious commodity, with many attempting to control and restrict access to it. Julian touched on this, inviting to us to think how different Manchester could be if everyone had access to more data.

He also mentioned the idea of hyper-local data specific to a postcode that could allow information to be made available to people on a street by street scale. This sounds very like the Postcode Paper mentioned by Zach Beauvais from Talis at a recent CETIS meeting. There was mention of the UK government’s commitment to open data via the data.gov.uk initiative, though no specific mention was made of linked data. In the context of the Manchester project, I think the ‘linked’ part may be some way down the road, and we’re really just talking about the open bit here. Linked Data and open data do often get conflated in an unhelpful way. Paul Walk, a colleague of mine at UKOLN, recently wrote a blog post, ‘Linked, Open, Semantic?‘ that helps to clarify the confusion.

Julian pointed us to two interesting examples, ‘They Work For You‘ and ‘MySociety‘, where open data is being absorbed into the democratic process thereby helping citizens hold government to account. There’s also the US innovation competition, ‘Apps for Democracy‘, Julian quoting an ear-catching statistic that an investment of 50,000 dollars is estimated to have generated a stunning return of 2.3 million dollars. Clearly an exemplar case study for open data there.

4IP‘s forthcoming Mapumental looks to be a visually engaging use of open data, providing intuitive visualisations of such things as house price indexes and public transport data mappings. Defra Noise Mapping England was also mentioned as the kind of thing that could be done, but which demonstrates the constraints of not being open. Its noise data can’t actually be combined with other data. One can imagine the benefits of being able to put this noise pollution data with house prices or data about road or air traffic.

Another quirky example mentioned was the UK developed SF Trees for iPhone app that uses San Francisco Department of Public Works data to allow users to identify trees in the city.

So open data is all about people becoming engaged, empowered, and informed. Julian also drew our attention to some of the potential risks and fears associated with this mass liberation of data. Will complex issues be oversimplified? Will open transparent information cause people to make simplistic inferences and come to invalid conclusions? Subtle complexities may be missed with resulting mis-information. But surely we’re better off with the information than without? There are always risks.

Open data should also be able to provide opportunities for saving money, Julian noting that this is indeed one of the major incentives behind the UK’s ‘smarter government‘ as well as US and Canadian government initiatives.

After the talk there was some lively debate, though I have to say I was somewhat disappointed by the largely suspicious and negative reaction. Perhaps this is an inevitable and healthy wariness of any government sanctioned initiative, but it appears that people fear that the openness of our data could result in some undesirable consequences. There was a suggestion for example, that data about poor bin collection in an  area could adversely affect house prices, or that hyper-local geographical data about traffic to heart disease information websites could be used by life insurance companies. Perhaps hyper-local data risks ghettoising people even more? Clearly the careful anonymisation of data is very important. Nevertheless, it was useful to be able to gauge people’s reactions to the idea of an open data city, as any initiative like this clearly needs people on board if it is to be a success.