Last week Pete Johnston, Bethan Ruddock and I got together and shut ourselves in a room for 5 hours with a whiteboard, flipchart and with our thinking caps on. Pete has already posted some thoughts about architecture and workflows following this meeting. I thought I would share some more informal thoughts of my own – from the perspective of an archivist and someone gradually getting to grips with Linked Data and RDF modelling.
Now that I understand a bit more about RDF, I can see where some of my misunderstandings were leading me astray. Firstly, it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data. This might seem obvious to those conversant in Linked Data, but I’ve been dealing with records as the unit of information for the last 20 years. With Linked Data you have to get away from this and think about the actual concepts within the data. The record (the EAD description in this case) exists as an entity along with everything else, but it can be misleading to take it as the starting point for data modelling.
I found actually getting a ‘starting point’ a bit difficult. I think this is because everything can be a starting point, and also because I kept going back to thinking of something like <http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton> as the starting point (the record itself). I then moved away from this and started thinking about the archival creator as a central concept. I knew that in RDF this person or organisation would be a subject. I also knew that this subject would need a URI and that we might want to tell people about stuff related to this subject, but I struggled with how we would provide URIs for subjects like this, and also how we would link the creator as subject to things like the index term subjects.
After a quick chat with Pete Johnston I started to understand the real role of URIs within Linked Data. We are probably going to create URIs ourselves for things (concepts) within the Hub. So, we might create a URI for every archival creator, and a URI for every repository, etc. We agreed that we needed to model the data within our world before looking too much at linking to data outside of it. Whilst I had listened to, and read a good deal of literature on Linked Data, I somehow hadn’t quite got the idea that you might create URIs yourself for your own concepts and that these would be documents in their own right, so then you can link to these URIs within your statements and you can include whatever information you think will be useful within these documents.
For example, we were using Sir Ernest Henry Shackleton as a sample record (the famous Antarctic Explorer). He would have a URI – something like archiveshub.ac.uk/id/person/sirernesthenryshackleton. By providing him with a URI we can then create triples (statements) that include this URI. For example:
archiveshub.ac.uk/id/person/sirernesthenryshackleton ‘created’ http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton.
We can then decide what information we will put in this document that identifies Sir Ernest, so that when researchers look up the URI, they get useful information. We can include links to external locations and we can look at using the ‘sameAs’ relationship to link to other representations of the same person.
Some URIs are fairly straightforward. We will create URIs for archival levels, and then these can in theory be used by others who want to identify levels within the data. For something like language, we will probably use URIs that are already available.
It is useful within data modelling to distinguish the real from the conceptual. So, going back to Sir Ernest, he is a flesh and blood person, and he can also be represented as a concept. If we are thinking about subjects used as index terms within the data, you might have ‘Exploration’ as a subject. We want Sir Ernest, the man described within our description, to be associated with this subject, so we can do this by making him into a concept, and giving that concept a URI. We can then link that to a literal value – his name. In our meeting we discussed one of the advantages of conceptual agents as being that we can distinguish between the person or organisation in its entirety and the person or organisation within this particular context. Archives often only represent a small part of someone’s life or an organisation’s activities, so it is helpful to talk about ‘Sir Ernest Shackleton’ as the explorer and leader of the British National Antarctic Expedition of 1907-1909.
So, we are now starting to move towards a model where we have URIs for a number of key concepts within the Hub. Our intention is to limit the number of concepts that we create URIs for, at least at this stage. We will also simplify some areas with the EAD modelling that we can then open up for investigation later on. For example, it would be good to look at version control and how we might filter changes to Hub descriptions through to the RDF XML, but we think that initially it is a good idea to create Linked Data from our basic model so that we can get feedback and also benefit from the learning process.
The main text heavy field that we are planning to create URIs for at this stage is the Biographical and Administrative History. We haven’t yet explored this thoroughly, but with URIs for archival creators and URIs for administrative and biographical histories, one’s thoughts start to turn to name authorities and EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons and Families – a means to markup information about archival creators in XML). We are not looking at creating EAC descriptions, but it would be good to keep in line with this in whatever ways we can in order to facilitate the subsequent creation of EAC records, or incorporation of our data into EAC records.
We will soon be able to share our current data model, so keep an eye on our blog. We welcome any feedback that the community might have.
Tags: Archives Hub, barriers, EAD, jiscexpo, linkeddata, locah
The example you give of creating a URI for the ‘concept’ of “Sir Ernest Shackleton as the explorer and leader of the British National Antarctic Expedition of 1907-1909″ sounds like it might be a good case for using the new foaf:focus – see http://wiki.foaf-project.org/w/term_focus for some more information (note although at time of posting the wiki says foaf:focus doesn’t exist, it was, in fact, added to FOAF recently in v0.98 http://xmlns.com/foaf/spec/#term_focus)
Hi Owen,
Funny you should mention that. We are, indeed, planning on using foaf:focus for this. I think we’ll be using it quite a bit, to link concepts to things that are represented in the Hub access points. We can then link several concepts to the same thing.
The issue of modelling the data rather than the record is something I’m struggling with at the moment for our library catalogue records of course materials. These are in MARC format, with quite a bit of local practice dictating where non-standard information (such as the course code) goes.
The MARC records bring together stuff that would be better separated out (especially in terms of ‘carrier’ vs ‘content’). This especially happens where we have several pieces of content in a single carrier (e.g. several items on a single DVD) – as the MARC record tends to focus on cataloguing the ‘carrier’ first, and the information about the ‘content’ finds itself relegated to unstructured fields. I don’t know if EAD gives you a better place to start from this perspective?
Yes, I see what you mean. I think the ‘carrier’ information for us will largely be the properties of the main unit of description – extent and access being two good examples. We’ll have URIs for the finding aid and for the EAD finding aid. After that, its basically modelling the content.
…well, i say that…I’ve just had a conversation with Beth and we’re realising that its not so easy to separate content from carrier all the time. Makes me think of the whole CIDOC Conceptual Reference Model approach where they are very strict about this, but have ended up with something very complex as a result.
We do have the same issue that much of the content is in more unstructured fields of course. But we have archival creator, repository, language and access points which are quite nicely structured.
Playing with the very unstructured biographical history should be fun…
I’d love to come and have a chat with you about it all – I quite fancy a trip to Leam again!
I’m not sure about the assertion:
archiveshub.ac.uk/id/person/sirernesthenryshackleton ‘created’ http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton
He created the material that is described in the resource:
http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton
but that’s not quite the same thing…
Hi John,
Yes, you’re right. In reality, the triple will need to refer to the unit of description rather than the finding aid. We will also be thinking about the concept of ‘created’, which is different in an archival sense. In fact, within our data model, we do have the finding aid (and the EAD encoded finding aid) separated out from the unit of description.
[...] this subject. Linked data will generally require some data modeling, and as the Locah project report, this may mean having to change your data model [...]