Assessing Linked Data

December 1st, 2010 by Jane

For me, the journey from an understanding of modelling data, and creating our own models for the Hub and Copac, to being able to understand the processes and decisions involved in creating XML RDF has been challenging. It has raised one question that often applies when dealing with something quite technical: how much should a manager (in my case an archivist managing an online archive service) be expected to understand the ‘technical’ aspects of something?  This is a question I have spoken about and written about before; mainly in terms of what archivists (or other information professionals) in the digital age need to know in order to understand the implications of choices around things like data structure and software systems.  In the case of Linked Data, I am still not sure how much I need to know about the detail of Linked Data, the RDF model, the use of RDF XML, the benefits of other output fomats, the application of stylesheets, etc. I have been thinking about how hard it is to create Linked Data – I have had a few enquiries from colleagues who are interested in doing the same sort of thing already and I want to be able to offer useful advice.

One thing that occurs to me is that it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed. But in either case there is the the same need for a manager to understand what the system offers and be able to offer the best service to researchers. I think what is important from the point of view of the manager is to be involved with the decision-making process and understand the implications of Linked Data; you need to know what you are saying about your own data. I am not sure that this requires a thorough knowledge of RDF and certainly I only have a rudimentary knowledge of stylesheets (and no knowledge of programming).

Service managers or administrators are not expected to understand systems from an in-depth technical point of view. But in fact, I think that one of the advantages of RDF model is that it is easier to get a sense of what is going on in terms of data processing than typically occurs with a database management system. After about six months of learning about Linked Data and RDF (I estimate that this translates into about one month of fairly intense learning), I can look at the stylesheet that we have for the Archives Hub, to transform the EAD data into RDF XML, and I can look at an RDF document representing the entities that we are describing, and I have a reasonably good overall sense of what it all means, which helps me with my main role: to understand the outputs and the potential benefits of Linked Data. For Locah, we’ve used XSLT, but there is no requirement for this, and maybe one of the challenges of outputting Linked Data is that there are a number of options in terms of translating your RDF model into an output.

There is no doubt that choices made now about how we model the data will have implications for what users can do with it, and some choices may limit future potential more than others. For example, which ‘things’ do we choose to represent? Should we have a conceptualisation of a person as an entity represented in the description, and to link this to a conceptualisation of the person? Which information should we provide as URIs and which as literals? I’m only gradually coming to understand the implications of these decisions, as we start to explore the potential of the data. Of course, this is always true, whatever data, structures and systems we are working with. This brings me to another point that I think is probably particularly relevant for Locah: we are doing this work at a time when we are very much early adopters. Whilst the classic Linked Data diagram may give the impression that the world has embraced Linked Data, the reality is that it is still very much at a hand-crafted level: we have not had tools available to us to aid us in this work, and in the case of EAD, there has been very little activity up till now. It is therefore difficult to judge how feasible it might be to output RDF in the future, as it is likely that more tools will be developed, and there will be greater awareness and skills built up around the whole Semantic Web. However, I wonder if we are currently still at that difficult point where we need to build the momentum of the Linked Data movement, but it is still very unfamiliar and poorly understood by many data providers?

Many Linked Data evangelists claim that Linked Data is ‘easy’. I’m not sure that it is necessarily easy, and I don’t think that it’s very helpful to say that it is easy. Easy compared to what? Easy for whom? It’s easy if you know how, if you have the requisite skills and experience, but we need to persuade people who don’t yet know how that it is worth doing, and provide a realistic assessment of the skills that are required. I suppose the question of how easy it is does rest in large part on the data you are working with as well. Archival finding aids are quite challenging. As Mark Matienzo, archivist at Yale University, states in his presentation on Linked Data and Archival Description: “Archival description is inherently multi-level and relational” and “EAD is both too flexible and too unforgiving” to be Linked Data friendly…and database-friendly for that matter. Also, ISAD(G) recommends the non-repetition of information and archival description generally contains implicit information. I suppose Linked Data might help provide the opportunity and impetus to move towards a more Web-friendly way of describing archives, if it does become more widely used.

At present, I can’t help thinking that if archive repositories and libraries would like to output their data as Linked Data, many of them will struggle, and I would have thought it might be similar for other types of data providers. I do think that expertise is required, and time needs to be invested in understanding some key aspects of Linked Data. On the other hand, this is the case whenever you are looking at creating effective means to output structured (but often inconsistent) data. However, I think that it makes good sense for the Archives Hub and Copac to do this work, as it is on behalf of our contributors, so it effectively will allow these repositories and libraries to output Linked Data.  In other words, it may be that for Linked Data to really take hold, it will benefit from this kind of aggregated set-up, where skills and resources can be pooled. At present, I’m inclined to think that it is worth the investment of time and resources by our Locah team because it is benefitting a large number of data providers. I think it will be important for us to convey to our contributors, and indeed to other archivists and librarians, what we are doing and why, what the implications are and what the benefits may be. I have already had contact with two people, one representing another aggregation of content, interested in benefitting from our work. This is really important, because it potentially makes the investment more worthwhile.

We are in a fortunate position with the Locah project because we are part of a JISC-funded innovations project, with a team of people with a variety of skills, and we have support from Talis, who have significant experience of Linked Data.  If we can work on behalf of our community, then I feel that the time invested may be worthwhile. For the second half of our year-long project we will want to explore the benefits more thoroughly – we will be looking at the crucial issues of creating links to other data, which is really Linked Data’s key selling point, and we will be developing a prototype to show some potential benefits for researchers.

(With thanks to Pete and Ade for their contributions to this blog post).

Tags: , , , , , ,

5 Responses to “Assessing Linked Data”

  1. Hi Jane – lots in this post!

    I think the point that we are on the cutting edge with this stuff is definitely true – as I look at models for bibliographic data I find the ground shifting as (for example) the British Library have done three iterations of their RDF representations of Bibliographic data in the last three months, and not just that, others have taken what the BL have published and then changed and republished, proposing different ways of representing the same data.

    Some of this will be resolved in time – to the extent that we can expect ‘community norms’ to be adopted for representations of this type of data – and I guess for archives etc. to. (although we have to recognise that we may see more communities external to the MLA sector publishing similar data but adopting different practices)

    In terms of the skills and knowledge required – some of the really hard stuff may go away – creating models for data from scratch for example – we can guess that as agreement is reached in the community standard models will be incorporated into the tools we use. However, I do think that as ‘information professionals’ we do need to put more effort into understanding how data is modelled and what that means for our ability to manipulate that data.

    I think this is a shift for both information professionals and the technical staff they work with. Previously we have dealt with MARC/AACR2/ISAD(G)/EAD etc. which are perhaps somewhere between formats and models – but certainly for MARC/AACR2 I’d argue the modelling isn’t rigorous. Behind the scenes programmers have had to deal with this, and no doubt created their own modelling within software. With Linked Data this more rigorous modelling is more exposed and up-front – and I think engagement with this from both sides will give us benefits.

    I wrote a blog post a few months ago which laid out my thoughts about the challenges of linked data http://www.meanboyfriend.com/overdue_ideas/2010/04/whats-so-hard-about-linked-data/. My conclusion at the time (and I think I’d still stand by this) was:

    “I used to think the technical aspects of Linked Data were the hard bits – RDF, SPARQL, and a whole load of stuff I haven’t mentioned. While there is no doubt that these things are complicated, and complex, I now believe the really difficult bits are the modelling and reuse aspects. I also think that there is an overlap here with the areas where domain experts need to have an understanding of ‘computing’ concepts, and computing experts need to understand the domain – and this kind of crossover is always difficult.”

  2. Jane Stevenson says:

    Hi Owen,

    Sounds to me like we’re definitely in accord on these issues. Yes, I think the modelling is more exposed and up-font, and I do think that this can be seen as an opportunity for us (info profs), even though initially it might be a little off-putting to begin with. I do wonder if ISAD(G) – and maybe other domain standards – are not really up to the task and need updating themselves in order to reflect the Web-driven information environment that we are now living in. I know that there are plans afoot to do this, and I hope that developments such as Linked Data are taken into account.

    You say “I also think that there is an overlap here with the areas where domain experts need to have an understanding of ‘computing’ concepts, and computing experts need to understand the domain – and this kind of crossover is always difficult.”

    Yes, absolutely. I spoke about this at a Society of Archivists’ conference a few years ago and wrote a chapter in ‘What Are Archives’ that expanded on this theme. Part of my aim with Locah is to try to clarify what I think domain experts need to know – where the crossover lies – and it would be interesting to talk to you about this.

  3. “it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed.”

    Data modelling is not programming, while there may be some skills overlaps.

  4. Avatar of Jane Jane says:

    No, data modelling is not programming. I was talking about the process of creating Linked Data – so thsi would be getting into the actual output of RDF once you have your data model.

  5. Livia Predoiu says:

    The relationship between data modelling and programming is a little bit more complex since there is also the concept of declarative programming in computer science and there you just model the data and it’s relationships which also declaratively states how conclusions can be drawn automatically. Pretty similar to RDF.

    However, working with data will require more elaborate programming skills in the future, especially when you want to model, extract or visualize the information within.