@@ To read the most-up-to-date of version of this section, in the context of the entire report, please see our wiki page
The success of linked library data relies on the ability of its practitioners to identify, re-use or connect to existing datasets and data models. Linked datasets and vocabularies that are essential in the library and related domains, however, have previously been unknown or unfamiliar to many.
The complexity and variety of available vocabularies, overlapping coverage, derivative relationships and alignments, all result in layers of uncertainty for re-use or connection efforts. Therefore, a current and reliable birds eye view is essential for both novices seeking an overview of the library linked data domain and experts needing a quick look-up or refresher for a library linked data project.
The LLD XG thus prepared a side deliverable that identifies a set of useful resources for creating or consuming linked data in the library domain. These are classified into three main groups, which are non mutually exclusive as shown in our side deliverable: metadata element sets, value vocabularies, and datasets.
- Metadata element sets: A metadata element set is a namespace that contains terms used to describe entities. In the linked data paradigm, such element sets are materialized through (RDF) schemas or (OWL) ontologies, with RDF vocabulary occasionally being used as an umbrella term. It may help to think of metadata elements sets as defining the model as distinct from the instance data (which fall into the value vocabulary or dataset categories below). Some examples:
- Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).
- FRBR defines entities such as Work and Manifestation and elements that link and describe them.
- MARC21 defines elements (fields) to describe bibliographic records and authorities.
- FOAF and ORG define elements to describe people and organisations as might be used for describing authors and publishers
- Value vocabularies : A value vocabulary could be thought of as a specialized dataset that focuses on the management of discrete value/label literals for use in metadata records and/or user displays. Value vocabularies commonly focus on specific areas such as topic labels, art styles, author names, etc. They are not typically used to manage complex bibliographic resources such as books, but they are appropriate for related components, such as personal names, languages, countries, codes, etc. These act building blocks with which more complex metadata record structures can be built. Many libraries require specific value vocabularies for use in particular metadata elements. A value vocabulary thus represents a controlled list of allowed values for an element. Broad categories of value vocabularies include: thesaurus, code list, term list, classification scheme, subject heading list, taxonomy, authority file, digital gazetteer, concept scheme, and other types of knowledge organisation systems. Note however, that value vocabularies often have http URIs assigned to the label/value, which could be used in a metadata record instead of or in addition to the literal value. Some examples:
- LCSH defines topics of books
- Art and Architecture Thesaurus defines a.o. art styles
- VIAF defines authorities
- GeoNames defines geographical locations (e.g. cities).
- Datasets : A dataset is a collection of structured metadata (aka instance data) descriptions of things, such as books in a library. Library records consist of statements about things, where each statement consists of an element (attribute or relationship) of the entity, and a value for that element. The elements that are used are often selected from a set of standard elements, such as Dublin Core. The values for the elements are either taken from value vocabularies such as LCSH, or are free text values. Similar notions to dataset include collection or metadata record set. Note that in the Linked Data context, Datasets do not necessarily consist of clearly identifiable records. Some examples:
- a record from a dataset for a given book could have a Subject element drawn from Dublin Core, and a value for Subject drawn from LCSH.
- the same dataset may contain records for authors as first-class entities that are linked from their book, described with elements like name from FOAF
- a dataset may be self describing in that it contains information about itself as a distinct entity for example with a modified date and maintainer/curator elements drawn from Dublin Core
Instances of these categories are listed in the side-deliverable along with a brief introduction, basic description and links to their locations. For metadata element sets and value vocabularies, use cases collected by the LLD XG are listed under each entry, which provides a clear context of the usage. For the available metadata element sets, namespaces and descriptions of their domain coverage are briefly presented. Two visuzaliations are also presented to help reveal the inter-relations of metadata element sets and the relationships between datasets and value vocabularies registered in CKAN.
Our side deliverable aims at a broad coverage for each of these categories. However, we are well aware that our report cannot capture the entire diversity of what is out there, especially given the dynamic nature of linked data: new resources are continuously made available, and existing ones are regularly updated. To get a representative overview, we intentionally grounded our work on the use cases that our group has gathered from the community. Additional coverage has been added by the experts who participated in LLD XG to ensure that the most visible resources available at the time of writing have not been forgotten. Finally, to help make our report useful in a longer run, we have included a number of links to tools or web spaces, which we believe can help a reader get a more continuously updated snapshot after this incubator group has ended its work. Notably, we have set up a Library Linked Data group in the CKAN repository to gather information on relevant library linked datasets. We hope to actively maintain this CKAN group, but for the sake of long-term success the entire community is invited to contribute.
The coverage of available metadata element sets and value vocabularies is encouraging. Many such resources have been released over the past couple of years, including some flagship value vocabularies already used by many libraries, such as the Library of Congress Subject Headings, or the Dewey Decimal Classification. Referece metadata frameworks are also provided in a linked data-compatible form, including Dublin Core or various FRBR implementations.
The main concern regarding coverage is the relatively low availability of bibliographic datasets. Descriptions of individual books and other library-held items are slightly less important than metadata element sets and value vocabularies, when re-use come into play. And indeed, tools like union catalogues already realize a significant level of exchange of book-level data. Yet it remains crucial and it is truly one of the expected benefits of linked data applied in our domain that library-related datasets get published and interconnected, rather than continue to exist in their own silos.
The level of maturity or stability of available resources vary greatly. Many resources we found are the result of (ongoing) project work, or the result of individual initiatives, and advertise themselves as mere prototypes. The abundance of such efforts is an sign of healthy activity going on in the library linked data domain. In fact it should come as no surprise, when the whole linked data endeavor encourages a much more agile view on data than in any previous paradigm. Yet this somehow jeopardizes the long-term availability and support for library linked data resources.
From this perspective, we find it encouraging that more and more established institutions are committing resources to linked data projects, from the national libraries of Sweden and Hungary, to the Food and Agriculture Organization of the United Nations, not to mention the Library of Congress or OCLC.
Establishing connections across various datasets is a core aspect of linked data technology, and a key condition to its success. Many semantic links across value vocabularies are already available, some of them obtained through high-quality manual work, like in the MACS or CRISSCROSS projects. And many value vocabulary publishers clearly strive to establish and maintain links to resources that are close to theirs. VIAF, for example, merges authority records from over a dozen national and regional agencies. And although quantitative evaluation was outside the scope of our effort, we hypothesize that many more such links are possible. Consumers of library linked data should be aware of the open world assumption that characterizes it, i.e., data cannot generally be assumed to be complete, and more data could always be released for any given entity.
A similar concern can be voiced regarding metadata element sets. As testified in the LOV inventory, practitioners generally follow the good practice of re-using existing element sets or building application profiles of them, but the lack of long-term support for them threatens their enduring meaning and common understanding. Further, some reference frameworks, notably FRBR, have been implemented in different RDF vocabularies, which are not always connected together. Such situation lowers the semantic interoperability of the datasets expressed using these vocabularies. Here, we hope that better communication between the creators and maintainers of these resources, as encouraged by our own incubator group or the LOD-LAM initiative, will help to consolidate the conceptual connections between them.
At the level of datasets, one may observe the same phenomenon as for the previous categories. For example, Open Library has started attaching OCLC numbers to its manifestations. We note however that efforts are being undertaken, and that the community is already well aware of challenges such as the de-duplication one.
We also observe that links are being built between library-originated resources and resources originating in other organizations or domains, DBpedia being an obvious case. Again, VIAF provides an example by taking the merged authority records and linking them to DBpedia whenever possible. This illustrates one of the expected benefits of linked data, where data can be easily networked, irrespective of its origins. The library domain can thus benefit from re-using data from other fields, while library data can itself contributes to initiatives that do not strictly fall into the library scope. In the same vein, LLD efforts could benefit from the availability of generic tools for linking data such as Silk Link Discovery Framework, Google Refine, or Google Refine Reconciliation Service API. However, the community needs to gain experience using them, sharing linking results, and possibly building more tools that are better suited to the LLD environment.
I appreciate that the focus of this report is based on “library-held” datasets, however I don’t think the opportunities for journal articles are addressed. Whilst it is unlikely that most academic libraries will be cataloguing each journal article within their own catalogue – especially when there is Web of Science or SCOPUS available – end-users don’t necessarily see the distinction – or perhaps shouldn’t have to see the distinction between material types, or library’s decisions on ownership vs. access within their collection which will be reflected in their catalogue but not in the service provided. Sorry if this isn’t clear, but the service provided by a particular (academic) library is provided through a portfolio of electronic resource discovery tools, of which only one, the library catalogue, is the library responsible for creating the content, limiting linked data potential to “library-held” information may only be a small part of the information landscape for a particular user and isn’t building bridges to the work going on in the area of linking research data & publications (citing data etc)
This is a really useful comment, thanks. I think when we used the expression “library-related resources” we were in fact thinking of data such as scientific publisher’s bases. But being more explicit would be useful. In fact the availability problem may be even more acute, for these specific datasets.
I infer from two sentences within this paragraph that datasets are the implementation of value vocabulary terms, structured by a metadata element set, or sets. If true, explicitly saying this would be a helpful picture of linked data architecture. If not true, I’m still in the dark as to how the three resource groups relate.
Datasets are indeed concrete data (e.g. the British national bibliography in RDF) that re-use elements from value vocabularies (e.g., LCSH), and are structured according to the specifications of metadata element sets (e.g., Dublin Core). We got other comments in the same line, and agree that the current wording can be improved. We’ll try to make our explanations clearer.
This paragraph is unclear. Is it saying that bibliographic datasets for what would commonly be referred to as “library catalog data” have low availability (and if so, can you speculate as to why that is?) or that those datasets ARE available, but that they aren’t that important? This would be an appropriate place to mention the need for software tools that help libraries to convert their bibliographic datasets to linked data.
Would like to see some expansion of this topic. This is a very important consideration for the migration strategies recommendation later in the document.
An analyze which datasets and which vocabularies are using which ontology (metadata fields) would be helpful to get an better idea which ontologies are widely used and which are not very common.
Thank you. These are useful resources; one of the difficulties with starting the BNB linked data project was knowing what is available and useful.
A minor point, but LCSH is not limited to the topics of books.
Maintenance and development of these deliverables is highly desirable and we welcome the steps that have been taken.
The British Library has recently made available a preview of the British National Bibliography Dataset http://www.bl.uk/bibliographic/datafree.html . The difficulties involved in this undertaking were considerable and go a long way to explaining the lack of published datasets
The BL chose to make BNB the focus of its linked data work because it is a large data set for which the BL, as the national library of the United Kingdom, is responsible and its scope can be reasonably clearly defined.
This is partly a reflection of the [im]maturity of the technologies and the complexity of applying linked data to MARC data, which as is acknowledged elsewhere in the report, libraries are currently locked into. What is needed are models and tools to enable the conversion of MARC data. BL is looking at what would be necessary in order to release the tools we have used/created.
CKAN is mentioned here for the first time. Thus, it should be shortly explained or referred to a section where it is explained.
Thanks a lot for spotting this, Adrian. We’ve updated the end of that section, see http://www.w3.org/2005/Incubator/lld/wiki/Draft_Vocabularies_Datasets_Section
We hope it’s alright!
Two JISC funded projects based around Cambridge, Open Bibliography and COMET have made large library related datasets available, but it only goes so far.
I second Catherines’ point about article / citation level data, there is serious value here.
Furthermore, libraries could consider exposing operational data, holdings and anonymised circulation information to facilitate a richer range of interactive and recommendation based services.
Seconded. Matching against identifiers takes time and is prone to error. Some recommendation over which ones to focus on would be great