Archives Hub Linked Data Release

May 9th, 2011 by Adrian Stevenson

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

Tags: , , , , , , , , ,

14 Responses to “Archives Hub Linked Data Release”

  1. Richard Light says:

    Hi,

    The extraction of some fields includes the element name and attributes. See for example the Beverley Skinner entry:

    Processing: p xmlns=”"Description by Althea Greenan, MAKE 2002. Submitted to the Archives Hub as part of Genesis 2009 Project./p

    Otherwise … good work!!

  2. [...] the Locah announcement. Tags: academia, archives hub, jisc, locah Comment (RSS) [...]

  3. Hi Richard,

    The intent was, for several of the EAD elements, just to pass the content through as an XML Literal, but I’m not sure it’s being handled correctly i.e. the XML markup is being escaped.

    I’m not sure it’s terribly useful to use the XML Literals anyway, so it might be better just to “dumb it down” to a plain literal. I’ll have a look at it….

  4. Richard Light says:

    I didn’t even know that RDF allowed XML literals as a “value”. The Turtle and JSON formats will presumably not support that?

  5. Hi, this looks promising.

    Couple of minor practical suggestions:
    I think it might be useful to add a rollover underline behaviour, to make it clear where one link ends and another begins, when mousing over sequences of links.

    Also, how about including a search within the page? People could use CTRL-F if they know about that, but if they don’t, and just want to scan down to see if a particular string is mentioned, this could be helpful.

    Best wishes
    Martin

  6. JasonZ says:

    Hi Peter,

    Good work!

    I have two questions. 1. Concepts such as fonds, series, …, file and the concept Level are not defined in the same file. Are there any specific reasons to do so?
    2. I noticed in one sample that you use the unesco thesaurus. But only part of the whole thesaurus is loaded. Why?

  7. Hi Richard,

    XML Literals for RDF are defined here

    http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral

    I think they are supported in Turtle and the Talis RDF/JSON format

    http://docs.api.talis.com/platform-api/output-types/rdf-json

    as just another typed literal.

    But it seems my misunderstanding of some of the subtleties of XML Namespaces has slightly scuppered my attempt to use them! In the end, I don’t think using the XML Literal adds much so I think a short term fix is to tweak the transform process to replace them with plain literals.

  8. Hi Jason,

    Your questions raise some interesting issues, I think. I’m not saying what we have at the moment is “right” but I’ll try to explain how it comes about!

    In separating the URI of the class “Level” from the URIs of the individual levels (instances of that class), we’re essentially partitioning our “URI-space” between:

    - the “ontology”/”vocabulary”: the set of classes and properties – which we hope will remain relatively stable, at least once we get through a bit more testing and which may be referenced by many datasets (I’m working on another project with another university to create a dataset which will reference the LOCAH classes and properties). And currently we just maintain that data as a “hand-edited” XML document.

    - the “instance data”: the descriptions of archival resources (and people, places etc) derived from EAD docs, which is likely to be more “dynamic” – in the sense that we might tweak the current descriptions (e.g. to fix the XML Literal problem) or to add more data over time. This data is derived from the EAD XML docs and stored in a triple store with the “linked data” pages generated by queries against the store.

    (I think this is what is sometimes referred to as the T-Box/A-Box distinction?)

    Having said all that, the case of the “levels” is arguably a case where there is a set of conceptualisations which is common to multiple datasets, and maybe they should be in our /def/ URI-space.

    And in fact we do supplement the data from the EAD docs (where the info “about” the level is just a minimal mention in an attribute value) with some additional data providing textual definitions, which we load to the triple store alongside the data derived from the EAD docs.

    Which sort of leads into the territory of your second question….

    With the exception of the “levels” case, we aren’t “loading any thesauri” as such. The descriptions of concepts and concept schemes are all generated from the EAD XML data.

    i.e. we coin URIs for concepts only for those concepts that are “mentioned” in the EAD docs (in the controlaccess element). For each of those mentioned concepts, we’re also generating a triple to “say” that it is a member of the named “concept scheme” like

    http://data.archiveshub.ac.uk/id/conceptscheme/unesco

    So when the data is merged in the triple store, we have triples relating that concept scheme to each of the member concepts that were mentioned in the EAD docs. And that is what provides the list on the “description” of the scheme:

    http://data.archiveshub.ac.uk/id/conceptscheme/unesco

    i.e. it’s “saying”:

    “there’s a thing of type skos:ConceptScheme called UNESCO and here’s a list of member concepts from that thesaurus – but there may be other member concepts we don’t know about”

    And if in the future we extend the input dataset and process more EAD docs, then we may find additional UNESCO concepts mentioned, and the list on that page would grow.

    (In an ideal world, I guess we’d just be citing URIs for the UNESCO concepts provided by its maintenance agency, as we do for the language URIs, and we wouldn’t bother providing data.archiveshub.ac.uk URIs for them.)

  9. Hi Martin

    Thanks for your comments. We’re aware that some of the styling and layout elements are not as user friendly as they could be at the moment. We’ll be collating the feedback regarding usability amongst other things, and we’ll see what we can do, either in the short term, or if it’s more substantial work for a second release.

    Cheers, Adrian

  10. Wow.

    The data looks very different to what we archivists are used to when inputting data or viewing data on the web.
    I think it’s going to take us a while to get our heads around this!

    I know we’ve been talking about it for a while, but this is the first time I’ve seen it for archive data. And the main thing that struck me is that the data is very much for someone else (like a developer) rather than for an archivist.
    It both is ‘our data’ and not our data at the same time… if any of that makes sense.

    Looking forward to seeing more! Brave new worlds and all that
    Teresa

  11. eFoundations says:

    LOCAH releases Linked Archives Hub dataset…

    The LOCAH project, one of the two JISC-funded projects to which I’ve been contributing, this week announced the availability of an initial batch of data derived from a small subset of the Archives Hub EAD data as linked data. The……

  12. [...] comment on the blog post announcing the release of the Hub Linked Data maybe sums up what many archivists will think: “the main thing that struck me is that the [...]

  13. [...] Archives Hub Linked Data Release « LOCAH Project (tags: rdf linkeddata catalogue archives opac library uk bibliothèques metadonnees) [...]