Creating Linked Data: more reflections from the coal face

September 22nd, 2010 by Jane

This post is to highlight some of the barriers and challenges to the creation of Linked Data.  This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.

Vocabulary

It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’  within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.

Landscape

For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due,  the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling.  In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object.  This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…

Conclusions

For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield,  or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them.  Linked Data provides a way to link archives into…well, into just about anything else.

Tags: , , , , ,

9 Responses to “Creating Linked Data: more reflections from the coal face”

  1. Finding inconsistencies with data even within a very limited collection from a single institution, you have my sympathies dealing with a much more varied set of records!

    I guess in a project like this, data quality is always going to come up, and it is sometimes necessary to walk a line between working with what you have, and improving the data. One route might be to have elements that simply recreate data from the original record and some that represent data (where possible) in an ‘improved’ way.

    For example in the British Library RDF representation of BNB records I note that they seem to just be dumping the contents of the 250$a field (edition) as “isbd:hasEditionStatement”. While this gets the data out there in RDF, it seems a bit of a missed opportunity (to me) of expressing the edition as a pure integer (’1′, ’2′) as opposed to the textual content of 250$a (’2nd’, ’3rd’).

    However, actually the 250$a can hold different kinds of edition statement, and this won’t always translate to an integer (e.g. could be something like ‘Special education ed.’)

    In this case I wonder if there is an opportunity to do both where it is easy and leave it where it is hard – so if 250$a is simple ’2nd ed.’ grab the numeric and put into a new data element, but where it isn’t so obvious just leave it as the ‘edition statement’ they are already doing.

    It feels like the translation to RDF gives this opportunity as you can do this easily without a huge overhead?

  2. Avatar of Jane Jane says:

    Yes, this is an interesting area for exploring. Its funny really. We had quite a shift in emphasis with the Hub data over the last few years where we are very firmly in the camp of ‘we don’t change the data because it is the contributors’ data’. This makes me think that it might be worth consulting contributors on this idea of making some changes – not to the essence of the content but just to help with consistency and meaning. Some things would, I hope, be uncontroversial, such as changing references slightly to make them all consistent (this is do do with what goes into attributes and what is content), or adding the language – although we can’t assume that it is English of course.

    If I could make one change to the data – what would that be? That’s an interesting question and I think it might be that I would add geographic names as index terms. That’s because I can see so much potential in geodata in terms of use cases and visualisations. Some descriptions that have come to us from exports just have the geographic location as ‘UK’. Sigh. But many don’t use this field at all, even when it would be very relevant to do so.

  3. Jakob says:

    There are very good tutorials on data modeling. Just because they are mainly focused on relational databases, does not mean they are not helpful at all! For instance Silverston’s “Data Model Resource Book” series and Simsion’s “Data modeling essentials”. My favorite is “Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design” by Terry Halpin – it gives a good overview of different data modeling techniques and provides a better alternative to UML and ERM in my point of view. By the way a major drawback of RDFS and OWL is its lack of a *visual* modeling language.

    @Owen Stephens As you already pointed out “edition” does not simply map to integers. Maybe it is better to model editions via “nextEdition” and “prevEdition” relationships? Integers imply intervals, but editions are ordinal only. I bet there are even cases when editions form a directed acyclic graph instead of a simple order.

  4. Another thought on the back of your comment – I think another opportunity is to get others contributing to this information. So, perhaps suggest a way of enabling ‘the public’ (basically, anyone) to make statements about location which you capture as linked data statements. You could both provide an easy to use interface that you support (perhaps they can pin stuff on a map, which produces some triples?), and just advice to linked data experts/tinkerers “this is how we’d like you to make ‘location’ statements about our stuff”.

    Where you get reliable information (e.g. confirmed by a number of people, or by specific people), you could then offer this back to the source archives… But even if they didn’t take it, the ability to make this kind of statement is, of course, built into the linked data model – so it may not matter so much if the source archive want to integrate back into their source data.

  5. Jane Stevenson says:

    Yes, that’s got potential. We have had the idea of user contributions to data for a while. I think LD only strengthens the case for doing this, with the principle of data enrichment through different sources.

  6. Lukas Koster says:

    Good to read that we’re not the only ones struggling with practical implementation issues of linked data. I will publish a blog post about our project this week (URL will be http://commonplace.net/2010/10/dutch-culture-link/)

  7. [...] blog. Pete Johnston has also posted about our approach to URI patterns, and our blog post on the challenges of exposing linked data has been well [...]

  8. [...] have also been finding numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name for [...]