JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for the 'Archiving' Category

BBC – Opening Up the Archives

Posted by Marieke Guy on 1st August 2012

The BBC has now completed a series of online films looking at looking at research carried out to support the BBC archives. The films are around 10 minutes each and each look at a particular area of work.

Posted in Archiving | Comments Off

DPC Technology Watch Reports

Posted by Marieke Guy on 15th May 2012

The Digital Preservation Coalition and Charles Beagrie Limited have announced the continuation of their collaboration, producing 3 more Technology Watch Reports.

‘Five new Technology Watch Reports have already been produced – or are in production – and have been enthusiastically received’, said William Kilbride of the DPC. ‘The next three will ensure that the production process continues through 2013 with themes and topics proposed and refined by DPC members to help them with digital preservation.’

The three new reports will be:

  • Web Archiving, Maureen Pennock

  • Preserving Computer Aided Design, Alex Ball (jointly with DCC)
  • Preservation Metadata, Brian Lavoie and Richard Gartner

Two of the reports are completely new, and a third one updates one of the more popular reports that has become dated since it was first published in 2005.

The DPC Technology Watch Report series was established in 2002 and has been one of the Coalition’s most enduring contributions to the wider digital preservation community. They exist to provide authoritative support and foresight to those engaged with digital preservation or having to tackle digital preservation problems for the first time. These publications support members work forces’, they identify disseminate and discuss best practices and they lower the barriers to participation in digital preservation.

‘Each ‘Technology Watch Report’ analyses a particular topic in digital preservation, evaluating workable solutions, and investigating new tools and techniques appropriate for different contexts,’ explained Neil Beagrie, series editor. ‘The reports are written by leaders-in-the-field and are peer-reviewed prior to publication. The intended audience is worldwide, especially in the UK, Europe, Australia New Zealand, USA, Canada.’

‘We expect that these reports will have a wide readership. The audience includes members and non-members of the coalition; staff of commercial and public agencies; repository managers, librarians and archivists charged with managing electronic resources; senior staff and executives of intellectual property organizations in the private and public sectors; those who teach and train information scientists; as well as policy advisors requiring an advanced introduction to specific issues and researchers developing DP solutions.’

Further publicity on each report in the series will be released over the course of the next year and DPC members will be engaged in the process throughout: draft outlines of each reports will be distributed to members for comment, members will be given access to previews before reports are released; and the whole process will be overseen by an editorial board drawn from the DPC.

Posted in Archiving, trainingmaterials | Comments Off

DCC Tools Catalogue

Posted by Marieke Guy on 8th May 2012

The Digital Curation Centre (DCC) has has recently updated their catalogue of tools and services for managing and curating research data.

The catalogue is available from

This is more than a new look; the catalogue has been overhauled to focus on software and services that directly perform curation and management tasks. It splits these resources into five major categories, based on who the intended users are and what stage of the data lifecycle they will be most useful in.

There is a category for Archiving and Preserving Information Packages with sub categories including:

  • Access Platforms – Tools to publish content and metadata to the web.
  • Backup and Storage Management – Tools to coordinate responsible storage and preservation strategies.
  • Creating and Manipulating Metadata – Enriching object descriptions and standardising records.
  • Emulation – Re-creating obsolete software environments to access old formats.
  • File Format ID and Validation – Defining and validating digital files.
  • Metadata Harvest and Exposure – Using OAI-PMH to share records across repositories.
  • Normalisation and Migration – Transferring digital materials into preservation-friendly formats.
  • Persistent ID Assignment – Creating unique identifiers for digital objects.
  • Repository Platforms – Enabling deposit, preservation, and access to digital content.

Sub-categories contain tables for quick comparison of tools against others that perform similar functions, linked to in-depth descriptions of how the resource can help.

This resource will evolve; if you have suggestions of tools to add please send them to info@dcc.ac.uk

Posted in Archiving, dcc, irg, tools | Comments Off

Launch Workshop for DataFlow and ViDaaS

Posted by Marieke Guy on 5th March 2012

Mark Thorley, data management co-ordinator for NERC set the tone for the day when he explained that “Data management is too important to leave to the data managers, it needs to be an important part of research“. The launch event, hald at the Saïd Business school, University of Oxford, on Friday 2nd March 2012 for two new UMF-funded infrastructure projects, was all about embedding research data management (RDM) into workflow using shared services. The UMF programme aims to help universities and colleges deliver better efficiency and value for money through the development of shared services.

Data Management at Oxford

Paul Jeffreys, director of IT, University of Oxford, gave an introduction to current data management practice at the University of Oxford. Currently activities in Oxford are varied and rarely co-ordinated. Although there is a RDM portal comprising of a research skills toolkit, RDM checklist, a University statement on research data management (based on the University of Edinburgh’s ’10 commandments’) and a training programme in place there are many people/areas they are failing to meet. One area for concern is non-funded research (i.e. people for whom their research is their life’s work). It remains very tricky to build in generic support and activities need to be flexible.

Introduction to DataFlow

DataFlow was introduced by David Shotton, the DataFlow PI. DataFlow is a collaborative project led by the University of Oxford. It is a two-tier data management infrastructure that allows users to manage and store research data. The project builds on a prototype developed in the JISC-funded ADMIRAL project.

The first tier, called DataStage, is a file store which can be accessed through private network drives or the web. Users can upload research data files and the service is backed up nightly. DataStage is likely to be used by single research groups and deployment can be on a local server or on an institutional or commercial cloud. There is optional integration with DropBox and other Web services.

The second tier is DataBank, which, through a web submission interface, allows users to select and package files for publication. Files are accompanied by a simple metadata and contain an RDF manifest, which is then displayed as linked open data. They are packaged using the BagIt service. Databank is a scalable data repository where data packages are published and released under a CCZero licence, though users can chose to keep data private or add an optional embargo period.

DataFlow is now at beta release v0.1. The DataFlow team are keen to build a user community and have lots of processes in place allowing users to comment on developments.

Introduction to ViDaaS

James Wilson, ViDaaS project manager introduced us to ViDaaS. Virtual Infrastructure with Database as a Service (ViDaaS) comprises of two separate elements. DaaS is a web based system that enables researchers to quickly and intuitively build an online database from scratch, or import an existing database. The virtual infrastructure (VI) is an infrastructure which enables the DaaS to function within a cloud computing environment, it is known as the ORDS service – Online research database service. It builds on ideas developed in the JISC-funded sudamih projects The ViDaaS service currently has three business models:

  • £600 per year for a standard project (25gb)
  • £2000 per year for large project (100gb)
  • Later option for public cloud for hosting

ViDaaS is officially launching this summer.

Further details on interoperability between ViDaaS are contained within the Data Management Rollout at Oxford (DaMaRO) Project.

Both services are seen as being ‘sheer curation’. This is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. http://en.wikipedia.org/wiki/Digital_curation#Sheer_curation

So Why Use these Services?

Many of the other speakers from the day attempted to convince us of why we should use these services. It seems that despite the efforts of many, including the DCC data curation is often seen as a ‘fringe activity’. There are negligible rewards for creating metadata and there is a noticeable skills barriers in metadata– researchers have raw data – institutions have repositories that are empty. The principle of ‘sheer curation’ – allow tools to work with you rather than against you. It is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. Both DataFlow and ViDaaS offer integration with simple workflows and immediate benefits.

Use of shared infrastructure services is supported by JISC. They offer potential cost savings, transferability and reuse of tools.

The key for getting people to use the services lies in getting buyin from users and allowing flexibility. As user Chris Holland explained “we are inherently creative people are going to do things in our own way”. There is a need to make services flexible and intuitive as no system can be all things to all researchers.

What about the Cloud?

Peter Jones, Shared Infrastructure Services Manager at Oxford University Computing service began his session introducing the Oxford cloud Infrastructure with a quote from Randy Heffner: “The trouble with creating a “cloud strategy”? You’re focusing on technology, not business benefit.” He explained how the main barriers to cloud adoption include understanding costs, reliability (network), portability (lock-in), control, performance and security. However the biggest issue was inertia and reluctance to change. He concluded that a local private cloud overcomes a number of these issues and that the most likely approach is a public private hybrid

It is becoming apparent that the cloud exposes a cost that was previously hidden. However research institutions need to stand by the data they create, therefore the costs need to be observed and paid. James Wilson, ViDaaS project manager, observed that this is how libraries work, however it is not yet recognised in the research world in which people are still trying to offload costs on to other people.

The afternoon breakout allowed more interaction and discussion around some of the highlighted issues, primarily cost, the cloud and national services.

Resources from the day are available on the DataFlow Website.

Tags: , ,
Posted in Archiving, Events, irg, rdm | Comments Off

DPC Report: Preserving Email

Posted by Marieke Guy on 21st February 2012

The Digital Preservation Coalition (DPC) has released a new report on Preserving Email, authored by Chris Prom, Assistant University Archivist, University of Illinois. The report (available as a PDF at http://dx.doi.org/10.7207/twr11-01 provides a comprehensive advanced introduction to the topic for anyone who has to manage a large email archive in the long term and offers practical advice on how to ensure email remains accessible.

Email is a defining feature of our age and a critical element in all manner of transactions. Industry and commerce depend upon email; families and friendships are sustained by it; government and economies rely upon it; communities are created and strengthened by it. Voluminous, pervasive and proliferating, email fills our days like no other technology. Complex, intangible and essential, email manifests important personal and professional exchanges. The jewels are sometimes hidden in massive volumes of ephemera, and even greater volumes of trash. But it is hard to remember how we functioned before the widespread adoption of email in public and private life.

The report is published by the DPC in association with Charles Beagrie Ltd.

Posted in Archiving | 1 Comment »

Web Archiving and the IIPC

Posted by Marieke Guy on 28th September 2011

There is a great video on the importance of Web Archiving courtesy of the International Internet Preservation consortium. This video is also available in German, Spanish, French, Japanese and Arabic.

The IIPC is the home of world-wide experts in collecting and preserving information from the Web.

Posted in Archiving | Comments Off

Videos from the ICE Forum

Posted by Marieke Guy on 19th August 2011

Some vox pop videos created at the JISC International Curation Education (ICE) Forum are now available:

Stuart MacDonald from EDINA refers to selection and appraisal.

YouTube Preview Image

Natalie Walters from the Wellcome Library talks about the need to listen to researchers/users.

YouTube Preview Image
Mike Furlough from Penn State University is concerned about building capacity in the libraries to work with researchers.

YouTube Preview Image

Bill Veillette from the North Eastern Conservation Centre talks about how to provide effective training.

YouTube Preview Image

Posted in Archiving, Events | Comments Off

Update on the LOC Twitter Archive

Posted by Marieke Guy on 3rd June 2011

It’s all been very quiet on the Twitter front at the Library of Congress since their announcement last year so it was good to see an update written by Audrey Watters from the O’Reilly Radar. The article entitled How the Library of Congress is building the Twitter archive is a write up by Audrey following a conversation with Martha Anderson, the head of the LOC’s National Digital Information Infrastructure and Preservation Program (NDIIP), and Leslie Johnston, the manager of the NDIIP’s Technical Architecture Initiatives. It gives us a little insight into how the LOC is dealing with the challenges and opportunities of archiving digital data of this kind.

The article cites the biggest challenges as the size of the archive (we are now producing 140 million tweets per day!), the composition of a tweet (a JSON file with a lot of Twitter metadata) and the layers of complexity (e.g. dealing with all the url links).

Dealing with these complexities efficiently is big work.

This requires a significant technological undertaking on the part of the library in order to build the infrastructure necessary to handle inquiries, and specifically to handle the sorts of inquiries that researchers are clamoring for….Expectations also need to be set about exactly what the search parameters will be — this is a high-bandwidth, high-computing-power undertaking after all.

No decision has been made yet on which tools to use but the library is “testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop“.

We wait with bated breath!

For those who like analogies Martha Anderson has just written an interesting post on how saving digital information is a lot like jazz. In Digital Preservation Jazz Martha talks about the creative, diverse, and collaborative nature of digital preservation.

Tags: ,
Posted in Archiving | Comments Off

Preserving your Emails

Posted by Marieke Guy on 2nd March 2011

Anyone who works at the University of Bath will be having a strange week this week…Last week the University email server ‘broke’ and since Thursday afternoon a limited service has been running. We currently have email but cannot see any messages that were sent and/or stored before Thursday afternoon.

A summary of the events leading up to the email downtime and the planned course of action over the next few days is given on the University of Bath Web site:

To briefly summarise the events prior to the downtime:

  • On Monday 21 February at 2pm it was noticed there were some errors being detected on the backup mail store – at which point we raised a call to Oracle the supplier of the components.
  • Thursday 24 February pm – errors spotted this time on Main mail server.
  • Shortly after corruptions became apparent and the service came to a halt.

The loss of email has left most us in a bit of a mess – there can’t be many of us who don’t rely heavily on email. Email is now such a core part of our business processes that not being able to refer to old messages or see those that arrived last week (many people were on holiday during the half-term break) is very disoreintating.

Brian Kelly has written a thought provoking blog post asking if the situation suggests that it is Time to Move to GMail?

He argues:

So yes there will be problems with externally-hosted systems, just as there will be problems with in-house systems (and ironically the day before the BUCS email system went down and two days before GMail suffered its problems my desktop PC died and I had to spend half a day setting up a new PC!). It may therefore be desirable to develop plans for coping with such problems – and note that a number of resources which provide advice on backing up GMail have been provided recently, including a Techspot article on “How to Backup your Gmail Account” and a Techland article on “How to backup GMail“.

But in addition to such technical problems there are also policy challenges which need to be considered. At the University of Bath email accounts are deleted when staff and students leave the institution (and for a colleague who retired recently the email account was deleted a day or so before she left). One’s GMail account, on the other hand, won’t be affected by changes in one’s place of study or employment. In light of likely redundancies due to Government cutbacks isn’t it sensible to consider migration from an institutional email service? And shouldn’t those who are working or studying for a short period avoid making use of an institutional email account which will have a limited life span?

Personally I continue to use Hotmail when out-of-work but I have no back up plan and the loss of my messages would be fairly devestating. Even losing my phone contacts left me in a pickle.

The JISC Beginner’s Guide to Digital Preservation has a section on preserving email which references the DCC’s Curating e-mails paper.

It’s times like these you really wish you had a plan…

Posted in Archiving | Comments Off

The End of NeSCForge: Preserving Software

Posted by Marieke Guy on 10th December 2010

On 20 December 2010, the NeSCForge service, a collaborative software development tool for the UK e-Science community, will be turned off. The main reason for this is that there isn’t any money to keep thee service running. The official message on the site is as follows:

Posted By: David McNicol
Date: 2010-09-27 13:48
Summary: NeSCForge closure 20/12/2010

Dear NeSCForge community,

Because of various grants finishing, we will be losing the IT staff and skills required to keep the NeSCForge service running properly. Rather than leaving it running until something goes wrong with no clear idea of ownership and responsibility for the service, we have taken the difficult decision to shut it down on Monday 20th December 2010.

We would encourage you to review your projects and take copies of any code or documentation you wish to keep before that date. Unfortunately, the software that NeSCForge runs is bespoke and fairly obfuscated so we cannot offer a method of extracting bug reports, forum posts and so on.

If you have any questions, please email them to:


David McNicol

A shame that the some of the issues, such as the bespoke nature of the software, were not addressed earlier down the line!

The closure leaves many organisations, including the National Grid Service (NGS), without a software repository. The UK National Grid Infrastructure has now moved it’s data with the help of the The Software Sustainability Institute, which offers a collection of guides inlcuding one on Retrieving project resources from NeSCForge. The NeSCForge portfolio of projects includes DIALOGUE, ComparaGRID, BRIDGES and Triana.

The only option left to many is Sourceforge, a resource for open source software development and distribution.

The JISC Beginner’s Guide to Digital Preservation has a section on how to preserve Software.

Is your software held in NeSCForge service? How sustainable are other services like Sourceforge? How do you archive and preserve your software?

Posted in Archiving | Comments Off