Application profiles and metadata for repositories
RSS icon Email icon Home icon
  • What is ePub?

    Posted on December 10th, 2010 Talat Chaudhri No comments

    ePub is a standard packaging format designed for ebook readers. Here is the definition given in the entry in Wikipedia:

    [...] ePub [...] is a free and open e-book standard by the International Digital Publishing Forum (IDPF). Files have the extension .epub.

    EPUB is designed for reflowable content, meaning that the text display can be optimized for the particular display device used by the reader of the EPUB-formatted book. The format is meant to function as a single format that publishers and conversion houses can use in-house, as well as for distribution and sale.

    That is to say that ePub contains within it the Open Packaging Format (for convenience, we can ignore the other structural parts for the purposes of this discussion), which defines the structure of both the metadata for the item contained within the file and the presentational (XML, XHMTL, CSS) elements of the standard. It is similar in many ways to a .docx file (MS Word 2007 onwards) in being effectively a specialised type of .zip file.

    So why is ePub of interest from the point of view of metadata and application profiles? The IDPF’s Open Packaging Format gives this description:

    Dublin Core metadata is designed to minimize the cataloging burden on authors and publishers, while providing enough metadata to be useful. This specification supports the set of Dublin Core 1.1 metadata elements (, supplemented with a small set of additional attributes addressing areas where more specific information is useful. For example, the OPF role attribute added to the Dublin Core creator and contributor elements allows for much more detailed specification of contributors to a publication, including their roles expressed via relator codes.

    Content providers must include a minimum set of metadata elements, defined in Section 2.2, and should incorporate additional metadata to enable readers to discover publications of interest.

    In which case, how is the metadata contained within ePub any different to Dublin Core 1.1? This is the interesting part:

    Because the Dublin Core metadata fields for creator and contributor do not distinguish roles of specific contributors (such as author, editor, and illustrator), this specification adds an optional role attribute for this purpose. See Section 2.2.6 for a discussion of role.

    To facilitate machine processing of Dublin Core creator and contributor fields, this specification adds the optional file-as attribute for those elements. This attribute is used to specify a normalized form of the contents. See Section 2.2.2 for a discussion of file-as.

    This specification also adds a scheme attribute to the Dublin Core identifier element to provide a structural mechanism to separate an identifier value from the system or authority that generated or defined that identifier value. See Section 2.2.10 for a discussion of scheme.

    This specification also adds an event attribute to the Dublin Core date element to enable content providers to distinguish various publication specific dates (for example, creation, publication, modification). See Section 2.2.7 for a discussion of event.

    Using these addition attributes, it is possible to define more accurately what certain fields contain, a standard, normalised format for agent metadata such as personal names, schemes defining the format in which a particular field is expected to appear, identifiers to provide a mechanism to link that metadata to the generating system or authority, and events to describe more accurately the events that have occurred during the life cycle of the item. By applying such constraints that are beyond the scope of DC 1.1, the ePub format effectively contains a de facto application profile, identified by its own namespace. Further, ad hoc metadata can be added using the (X)HTML meta element:

    One or more optional instances of a meta element, analogous to the XHTML 1.1 meta element but applicable to the publication as a whole, may be placed within the metadata element [...]. This allows content providers to express arbitrary metadata beyond the data described by the Dublin Core specification. Individual OPS Content Documents may include the meta element directly (as in XHTML 1.1) for document-specific metadata. This specification uses the OPF Package Document alone as the basis for expressing publication-level Dublin Core metadata.

    It would seem, however, that this last option suffers from the weakness that such metadata is invented on the fly, and does not have to follow the constraints of any schema or authority.

    Nonetheless, it would seem overall that the ePub “application profile” does significantly add to the functionality DC 1.1 in a potentially useful way. Different types of agent defined in DC, such as creator, contributor, can be further defined, for example author, editor, illustrator and thesis supervisor for higher degrees. Potentially, this could be leveraged for use with a number of different types of resources and for various purposes, although ePub by it’s very nature is designed for reflowable content, which by and large means textual resources such as books, articles, manuals and so on. Illustrations, tables, charts, images and other non-reflowable content can potentially create a problem on the small screens of mobile devices such as ebook readers.

    The structure of this application profile is very simple and easy to use, unlike for example the classic form of SWAP, whose structure is based directly upon its conceptual data model, a simplified version of FRBR. It would be extremely interesting to compare the two, since they are fundamentally similar, relatively simple solutions that are limited in scope to online publications and similar resources. It would be most revealing to see whether what SWAP seeks to achieve can be done in a simpler way, and whether either SWAP or the ePub application profile have functionality that the other cannot provide.

    Ultimately, the purpose of this investigation could be to provide online textual content, for example in repositories, via increasingly popular hand-held devices, and to capitalise on the rapid growth of commercial ebooks. It would probably be necessary to provide .epub files in such systems as well as the usual .pdf and .doc(x) formats that are common in publishing, and consequently in institutional repositories. Either this would need to be done by converting the existing content, and likewise new content after it is deposited, or in addition by providing tools to enable the ePub format to be more immediately accessible to service providers and depositors in future.

    UKOLN is holding an ePub event (unfortunately postponed due to inclement winter weather: new date to be announced), as a collaboration between the Application Profile Support Project and DevCSI, to investigate exactly these issues in a hands-on, practical way: the aim is to get repository managers and other information professionals together with developers and investigate the feasibility of demonstrator solutions that could encourage software development to enable repository content to be available to ebook readers in future.

    The details of the rescheduled event will be announced here in due course. Watch this space!

  • Drupal, RDFa and the “fauxpository”

    Posted on May 19th, 2010 Talat Chaudhri 2 comments

    Drupal 7 is likely to be released soon, and will include native support for RDFa. The RDF module for Drupal 6 already allows this functionality. Why is this important? Because it makes relationships between resources much easier to describe through Drupal’s user-friendly interface and, in the process, would allow documents to be available as linked data.

    In Drupal terminology, a “node” is effectively a metadata record, and various Drupal modules enable the easy customisation of metadata. In effect, you could build a repository on the basis of Drupal, by-passing the need for platform-specific knowledge tied to open source software that has increasingly moved towards the “enterprise solution” space, along with all of the technical tie-in that it usually entails. For the service provider, it is not dissimilar to the tie-in experienced with commercial software, especially in the case of information librarians or other professionals who are not developers, or even developers are not part of that particular open source development team.

    Application Profiles are essentially structured metadata comprising elements and (usually) relationships, and are therefore inherently linked data solutions. They vary in complexity according to their particular functional requirements: for instance, in the world of scholarly publications, there is a spectrum between the straightforward, unstructured way that DSpace implements Dublin Core (which should perhaps be called the DSpace Application Profile), the simplified FRBR structure of the Scholarly Works Application Profile (SWAP) and the complex entity-relationship model of CERIF, the standard developed for Current Research Information Systems (CRISs). This latter is a de facto application profile, even if it is not normally referred to as such.

    Why should Drupal be any better than the repository platforms that already exist? In many ways, it depends on what you need to do with it, and on the resources at your disposal. But the advantage is that Drupal is a flexible Content Management Framework that is designed to be leveraged for any sort of content, and for new modules to be designed easily for new purposes. After all, what does a repository actually do that other websites cannot? They put metadata records and bitstreams (the actual documents or files) on the Web, and add a few additional services such as OAI-PMH, SWORD and statistics. But repositories are only a particular specialised subset of content management systems. Drupal is accessible to any PHP developer without any initial requirement of particular specialist platform knowledge, which is relatively easy to obtain. The community is large and support is quite easily available, as are modules that can be adapted for local purposes. It is designed to be easy to customise and theme.

    Sarah Currier recently talked about the idea of a “fauxpository”. If I remember correctly, she pointed out that it could even be based on WordPress. This is clearly a workable idea, although hardly suitable for production use as a university service. I would maintain that Drupal could easily be suitable for such a use with relatively little work, and could make use of and adapt application profiles in a way that the major open source repository platforms have been slow to do, and are still only just beginning to enable as something of an afterthought. UKOLN are investigating how Drupal can be used to make it possible to make use of the JISC’s Dublin Core Application Profiles (DCAPs), and using Drupal is intended to show how it can work independently of tie-in to any specific platforms.

  • JISC Repositories and Preservation Programme Meeting, 6-7 May 2009

    Posted on May 8th, 2009 Talat Chaudhri 1 comment

    Application profiles received considerable attention at the two-day Repositories and Preservation Programme Meeting held by JISC at the Aston Business School, Birmingham.

    Workshop: Application Profiles in Practice, 6 May 2009

    This was an event in two parts: firstly, an introduction to the user testing methodology being developed by the AP Support project in collaboration with the IEMSR and the IE Demonstrator project; secondly, an iteration of the paper prototyping element of the user testing. On this occasion the audience was comprised largely of experts rather than an especially representative group of typical users – quite understandably, given the nature of the meeting. (While it is very helpful to engage repository managers in user testing, it is more difficult to involve entirely non-specialist users, so there is a need for further work in facilitating this.) The session proved to be a success in raising considerable interest in current developments in application profiles.

    It was always the intention to use this particular event as a platform for consulting colleagues in the repositories community about the usefulness of the approach. In this respect, the workshop was highly successful: attendees responded positively to the intention of engaging users in order to analyse and address the strengths and weaknesses of the various application profiles, raising some insightful questions and contributing to an animated debate. Rachel Bruce of JISC commended the workshop in her speech closing the Programme Meeting on the following day.

    “Working with the Repositories Community: WRAP Project” (Jenny Delasalle, Warwick University), 6 May 2009

    Jenny Delasalle referred to the difficulties faced in pioneering an implementation of SWAP in an institutional repository based on EPrints 3.0. Unlike in its successor EPrints 3.1, versioning was unsupported at the time, which to a great extent hampered the SWAP effort in WRAP at Warwick. She considered that in its present form, SWAP represents too complex a metadata model for adoption by the typical IR. But since there is not necessarily a need to employ all of the SWAP metadata terms (any more than one would necessarily need to employ all of the terms in DC Simple or Qualified DC), it must be presumed that the FRBR structure and the lack of automated means to populate fields with structural metadata represent a significant part of the problem. It would be useful to get a clarification from Jenny on this.

    That the feasibility of complex metadata schemas could be radically improved by the use of text mining to autopopulate metadata fields, thus requiring far less input and/or correction from the user, was raised later in the Forum in the discussion “How can text mining support repository tasks?”, convened by James Farnhill of JISC and led principally by Brian Rea of NaCTeM, University of Manchester. This would be of obvious and immediate relevance to the liklihood of SWAP being more widely implemented, whether in its present form or following the recommendations from the user testing effort.

    Repositories Roadmap Session (Rachel Heery, external consultant for JISC), 7 May 2009

    Rachel Heery gave a summary of her Digital Repositories Roadmap Review, revised from the original version by herself and Andy Powell in 2006.  Recommendation 11 referred to SWAP specifically, proposing a cut-down version without the FRBR entity-relationship model and a re-analysis of the sort undertaken in the current user testing programme; Recommendation 12 made an interesting reference to OAI-ORE in the context of SWAP.

    Recommendation 11: Explore deployment of a cut down version of SWAP, possibly at the copy level, retaining the cataloguing rules to ensure a consistent approach to linking to full text. Evaluate whether use of SWAP is consistent with a Web architecture approach to repositories.

    Recommendation 12: Explore use of OAI-ORE to enable applications to handle complex objects. Demonstrate how OAI-ORE facilitates the re-use of research outputs and research data. Clarify different roles of OAI-ORE and SWAP.


    There was considerable discussion of SWAP on Twitter among colleagues at Eduserv, UKOLN and elsewhere on both days of the meeting, focussing on both the structure and implementation of SWAP as it was originally intended, and in response to Rachel Heery’s recommendations. The need to solve the lack of implementation of the Dublin Core Application Profiles appears to have regained significant impetus from the interest in the series of user testing events planned by UKOLN. In particular, new impetus has been given to the SWAP implementation effort, in which expectations had previously subsided. Given Rachel Heery’s review, it is clear that SWAP may need to be considered once more as an ongoing project rather than a past product that failed to gain support, and one that may need substantial revision in future iterations. It is important to keep an open mind about the nature of those revisions, which should be conditioned by the results of the ongoing user testing effort.