JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for the 'Case studies' Category

Had your Heart Broken by Data Loss?

Posted by Marieke Guy on 13th October 2011

Then maybe it’s time to share the pain…

The National Digital Stewardship Alliance (NDSA) Outreach group are collecting stories about data loss and preservation. If you’ve had your heart broken or uplifted by data loss or preservation, please fill out the form at http://j.mp/datastories.

There’s no deadline, but they will probably be taking a look at what they have in November.

Posted in Case studies | Comments Off

The End of Delicious?

Posted by Marieke Guy on 17th December 2010

Oh dear, another ‘end of’ post. Are we going to see a lot more termination of services as the economic situation really starts to hit the Web?

Yeterday news that Yahoo plans to kill off a handful of services (including Yahoo Buzz, Altavista and the bookmarking service Delicious) made it into the mainstream. The source was a internal Yahoo slide showing future plans leaked by a Yahoo employee, Eric Marcoullier. Yahoo have recently had to implement cuts and lay off staff. In repsonse to the leak a company spokesperson explained:

Part of our organizational streamlining involves cutting our investment in underperforming or off-strategy products to put better focus on our core strengths and fund new innovation.

Delicious is a very well used service. I presonally have been a member for several years and currently have 1295 links bookmarked. The service is embedded in many of my Web pages and on my blogs. For my remote worker blog I have even set up a Google custom search allowing searching of the 300+ remote working urls I have collected. The JISC Beginner’s Guide to Digital Preservation also has 300+ urls associated with it.

We are all aware that Web 2.0 services come and go and that this has many implications for digital preservation. The JISC PoWR project took a look at related issues and offered a set of pragmatic guidelines on the approaches we can use to safeguard our data. Here is a chance to put theory into practice…

On hearing the news about Delicious my initial reaction was one of panic…all my urls would be lost! This isn’t actually the case. The termination of the service has yet to be confirmed and already several campaigns have sprung up (act.ly, save delicious, …) petitioning to save the service. Also one would like to believe that if the service is to be terminated users would be given advance warning on a switch off date which will give them the opportunity to get their data out. Whatever happens it makes sense to take action to protect any investment you have in Delicious.

Exporting Data

Delicious has an Export / Download Your Delicious Bookmarks feature. This is available from the Settings tab, under the bookmarks subheading. This will allow you to save the generated page (as HTML) and import it into your browser, or anything else that accepts bookmarks in a standard format. Save the delicious html file somewhere safe.

Although this now means that you have a copy of your urls (which is a step in the right direction) you really need to import them into another bookmarking service to make use of tags, bundles and other functionality.

Lots of people are turning to Diigo (there is a page on how to import bookmarks from Delicious), other options include Connotea, Citeulike, Trunk.ly and Stumbleupon – a more comprehensive list is available from Wikipedia. SearchEngine Land have also compiled a list of their 10 best alternatives to Delicious.

Web pages that use Delicious

Some though also needs to be given to the other ways you use Delicious – in Web pages, on blogs etc. Untill the confirmation that Delicious is going it seems a little early to act here. Diigo can do most of the things Delicious does, so it will be a case of using it from now and and at some point changing all embeds. What I personally will be doing is compiling a list of the places in which I currently use Delicious. All very time consuming and maybe something I should have already been doing?

Digital preservation of Web 2.0 services is an important area but not something people have given much consideration to in the past.

It seems that there may suddenly be a lot more case studies for us to consider…

Tags:
Posted in Case studies | 3 Comments »

Case study: Archiving the JISC PoWR blog

Posted by Marieke Guy on 11th October 2010

A while back I wrote the following case study on how we had archived the JISC PoWR blog. This case study sits along side the Approaches To Archiving Professional Blogs Hosted In The Cloud paper Brian Kelly and I wrote for iPres 2010.

Defining Archiving

As the Digital Preservation Coalition explains the term digital archiving is used differently within sectors.

The library and archiving communities often use it interchangeably with digital preservation. Computing professionals tend to use digital archiving to mean the process of backup and ongoing maintenance as opposed to strategies for long-term digital preservation.

From http://www.dpconline.org/advice/preservationhandbook/introduction/definitions-and-concepts

In this case study the ‘archiving’ term is being used to describe ways in which blog content can be migrated to alternative environments in order to satisfy a number of business functions, including the re-creation of the original environment. The approaches taken are steps in capturing blog content and considerations about short-term continuity. They involve a mixture of technical considerations, management and agreed policy decisions. It is possible that they could be used at a later stage as part of a digital preservation strategy. Such a strategy may become more necessary if continued access to the original resource is no longer available.

About

The JISC PoWR (Preservation of Web Resources) project was funded by the JISC and provided by a partnership of UKOLN and ULCC. The project ran from April – November 2008. A WordPress blog was used to support the project work which was hosted by the JISC on their jiscinvolve platform.
Content for the blog was provided by staff from the two partner organisations. It was agreed in advance that blog posts would be published under a Creative Commons licence and a statement to this effect was provided on the blog. In the academic sector use of Creative Commons (a way to communicate which creators reserve and waive with regard to content) is seen as good practice and recommended by the JISC. It was agreed that having a licence in place would avoid possible confusions regarding ownership of the content and lay an effective foundation for future archiving.

A decision was made to host the blog on a platform provided by the project’s funding body rather than using the host institution of either of the project partners. Although this avoids the risk of unanticipated changes to terms and conditions for the service the PoWR team are aware that expected cuts in funding for higher education could result in withdrawal of the service or a failure for the service to be developed.

The Preservation of Web Resources (PoWR) project was funded by JISC to organise a series of workshops and produce a handbook that specifically addressed digital preservation issues of relevance to the UK HE/FE web management community. With preservation forming the heart of the project and bearing in mind reliance on service that could be withdrawn there has been an interest in blog archiving and migration from the onset.

Background

After the end of the project in November 2008 the original intention was to continue to publish occasional posts on the JISC POWR blog related to Web preservation issues. These posts would be published at regular intervals but at a significantly lower frequency than when the project was active. It was initially agreed that the team could continue to provide least 3 posts per month and the blog would be regarded as still functioning. This happened over the 2008 – 2009 and period. Initially the intention was to allow the blog to be reused if additional funding became available to continue the JISC PoWR work in providing advice on best practices for the preservation of Web resources. However although the team were successful in obtaining additional funding this covered a broader area than that of Web preservation.

Reusing the blog?

In April 2010 one of the project team began work on a new project (The JISC Beginner’s Guide to Digital Preservation). A proposal was put forward to use the PoWR blog for this new project. There were a number of reasons given for this possible approach. The PoWR blog has lots of valuable content and it was felt that reusing it could be a way to keep it alive and channel effort into its upkeep. It was also likely that the new guide would pull on lots of the JISC PoWR work already carried out and tackle many similar areas, there would be a lot of overlap. Initially there was a view that reusing the blog could tie in with ‘green’ ideas about reusing and repurposing content.

However reusing the blog would require that substantial changes were made to the blog, including renaming, removing pages, re-skinning the site and retagging old entries. After an interesting discussion it was decided that this approach was not appropriate and that the ‘green’ angle was actually a ‘red herring’, no resources or time would be saved in the long run. There were also concerns that it might weaken the integrity of the JISC PoWR blog as a ‘record series’.

In the world of paper and non-digital records, the approach of using existing files and adding new records on top of them is sometimes taken. This approach often results in confusion and extra work, both for the administrators, for the records manager, and the archivist as it disrupts continuity. A record series (as defined by US National Archives) is a group of files or documents are kept together (either physically or intellectually) because they relate to a particular subject or function, result from the same activity, document a specific type of transaction, take a particular physical form, or have some other relationship arising out of their creation, receipt, maintenance, or use.” It was agreed that it was preferable to “close the blog and pass the baton on”. A new blog was created for the JISC Beginner’s Guide to Digital Preservation and it was decided that the PoWR blog would close and a case study would be written about the process to be used by the new JISC Beginner’s Guide to Digital Preservation.

Process

During the life of the JISC PoWR project that created the blog there were many opportunities to consider what elements and behaviours of Web sites, and in turn blogs, are important to capture. Chapter 6 of the JISC PoWR Handbook is entitled ‘What elements do we capture and preserve?’ The 3 key elements are defined as: content, appearance and behaviour. The first two elements are relatively straightforward to capture but the third, behaviour, which includes features like RSS feeds, comments, site administration and tagging features. It was with a view to capture all three elements that the approaches taken were chosen.

After some investigation into possible archive methods (See Approaches To Archiving Professional Blogs Hosted In The Cloud for more detail) it was decided that that the following would be done: Firstly the site would be frozen but remain accessible for the indefinite future, then an XML Dump would be made of the content in the blog and stored on a UKOLN server. Finally a ‘copy blog’ would be created on the UKOLN Intranet. The XML dump and copy blog would not be for public viewing. In parallel use would also be made of external archiving such as the Internet Archive and UK Web Archive

External Archiving: The UK Web Archive

At the start of the project the JISC PoWR blog was suggested by JISC as a possible target for archiving for the UK Web Archive. The site has now been archived on 5 occasions between January 2009 and January 2010. The archive is incomplete as there are no more current instances (so posts are missing) but it is still a useful resource for the project team and those interested in the look, feel and content of a JISC project blog from 2008/2010. For example when some elements of the blog were lost due to an upgrade the UKWA archived version of the blog provided a simple way to reinstate the look and feel of the blog. The archived version may also prove useful for those interested in the changing nature of blogs and their use instead of project Web sites. More information on why we would want to preserve (e.g. for legal, cultural and reputation reasons) are given in the JISC PoWR handbook.

External Archiving: Internet Archive

The JISC PoWR blog has been archived by the Internet Archive but for some reason only pages up to July 30th 2008 have been archived, it is possible that a processing delay means that the site has been captured more recently and not yet made available. The site has been resubmitted to the Open Directory site. Again this archive is incomplete but still a useful resource for the reasons mentioned under UKWA.

Upgrade to the blog

As explained the blog is hosted by JISC Involve who provide blogs for the JISC community. Till June 2010 JISC Involve was running on an old version of WordPress (1.2.5). In early June 2010 the JISC Digital Communications Team upgraded their server to the latest version of WordPress (2.9.2) and then migrated JISC Involve’s blogs over to the new installation. Although blog posts, comments, attachments, user accounts, permissions and customisations were expected to move over easily JISC Involve users were encouraged to back-up the content of drafts etc. ‘just in case’.

Unfortunately there were some technical problems migrating the content and as a consequence the original theme was lost and URLs now redirect. Luckily the JISC PoWR team were able to locate the original theme and reinstall it. A number of sidebar widgets were also lost during the migration, for example the widget containing licence information was lost. The CC widget and other sidebar widgets were reinstated on 21st July 2010. These occurrences made them aware of the need to record details of the technical components and architecture of the blog. After an upgrade it is also useful to check if archiving permalinks work, embeds and images work and that comments are showing.

Freezing the blog

The process of freezing the JISC PoWR blog involved publishing a preliminary post indicating the intention to freeze the blog and asking for comments on the data that should be recorded (Cessation of posts to the JISC PoWR blog). After this post was published a final check of the blog was carried out. Duplicate categories were removed, unsuitable widgets were removed (for example the calendar widget would have little relevance for an archived blog) and checks were made that the blog was functioning correctly.

On Friday 23rd July a final post was published stating that the blog would now be frozen. An audit of the blog technologies used was also carried out and published. This included details of the WordPress plugins installed and theme and widgets used.

Archive Page on JISC PoWR blog

The final statistics were published in an archive page. It contained the following information:

  • Active Dates – These were the dates the blog had run from and to. Dates allow the blog to be put into context by future users.
  • Number of posts – Data on posts shows the scale of blog it’s productivity over it’s lifecycle.
  • Number of comments – Data on comments shows the interactivity level of the blog. There were some valuable discussions held in the comments field on posts which may have influenced the final PoWR handbook.
  • Akismet (spam catcher) statistics – By preventing spam from being posted to the site Akismet has proved itself to be a valuable tool.
  • Details of contributors – Links were included to the contributor’s staff pages. This will enable those with questions regarding content on the blog to contact the appropriate person and will enable easier reuse of content.
  • Details of blog theme – As mentioned previously recording details of the blog theme can be very useful in allowing the look and feel to be consistent.
  • Details of plugins used – Plugin use will have an affect on the functionality of the blog. Recording details of those used will allow the functionality to be consistent.
  • Details of type and version of software used – Recording technical details will help with any future problems and allow the site creators to have more control over the future degrading process.
  • Blog licence – Clarity with regard to the licence of the blog posts, comments and other items in the blog is important to enable future reuse. The JISC PoWR blog is licenced under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License. Comments posted to this blog also have the same licence. Stating this fact will allow others to reuse the resources in the blog.

To ensure that those who arrive at the blog are clear on its current status the blog title was edited to include the phrase [Archived blog] after the initial title.

Comments were closed. Closing comments retrospectively is not straightforward in WordPress. It can be done on a post by post basis but with a total of 140 posts this would have been too time-consuming. The comments feature was closed by entering the Settings > Discussion area and unchecking Allow people to post comments on new articles and checking Users must be registered and logged in to comment.
An Extended Comments plugin is available to automate this process.

A UKOLN briefing paper is available on policies for blog comments.

XML Dump of blog

On the day the blog was frozen an XML dump was taken of the blog and stored on a UKOLN server. An XML file is created by using the WordPress export function (Tools (left hand contents list) > Export). It is likely that the export process is not the same on different blog platforms but most blogs will have an export feature.

This XML was also imported into the UKOLN blogs wordpress system in an internal facing private blog on the UKOLN Intranet. This has enabled the UKOLN members of the JISC PoWR team to have a better understanding of how the blog was used and to analyse the contents of the blog using a variety of WordPress plugins.

This process had already been carried out with older versions of the blog data. The availability of the backup copy of the blog meant that it was possible to change configuration options, which administrators would not want to do on a live blog. The number of RSS items provided was set to a large number so that the entire contents of the blog posts and comments could be made available via an RSS feed. The RSS feed was used to produce a Wordle word cloud which provides a visualization of the contents of the blog and the comments which have been provided. The RSS feed was also processed by Yahoo Pipes. This enabled the contents of the blog to be processed by an RSS to PDF tool, with a series of PDF files being produced in chronological order (with the capability of applying additional filtering if so desired).

Other Approaches

The approach taken with the JISC PoWR blog is only one possible approach. Other approaches are outlined in the Approaches To Archiving Professional Blogs Hosted In The Cloud iPres paper.

One option is the production of a new static master version of the content. To achieve this the contents of the blog could be migrated as static HTML pages. There are then requirements to preserve the new static site but this may be easier for an organisation to achieve.

Another approach is the migration of the content to an alternative platform. For example it t may be felt necessary to migrate the contents of a blog to an alternative blogging platform in order to ensure that the blogging characteristics will continue to be available. This might include the migration of a live blog to an alternative platform (which would not normally be described as archiving) but could also involve copying the blog’s rich content in order to support data mining or other business processes which may not be possible on the original environment.

One possible way to do this is by use of the ArchivePress tools and methodology . ArchivePress is a JISC-funded blog-archiving project being undertaken by the University of London Computer Centre and the British Library Digital Preservation department. As an alternative to the web crawling/harvesting approach of the Internet Archive and the UK Web Archive, ArchivePress tested the viability of using RSS feeds and blog APIs to harvest blog content (including comments, embedded content and metadata). The archived content is stored and managed using instances of WordPress, thereby maintaining the blogs’ native data structures, formats and relationships.

Blog creators may want to produce a physical manifestation of their content, such as a hard copy printout. This may be for various reasons, for example for marketing purposes or to provide access to the content when online access is not possible. Many self publishing services now give book creation from blogs as an option e.g. http://www.blurb.com/create/book/blogbook.

Although the JISC PoWR blog has been backed up on an internal server there are different ways ‘backing up’ can be achieved. The 2007 School of Information and Library Science University of North Carolina at Chapel Hill blog survey found that possible preservation methods tended to focus on backing up the site. They included: download and save to personal hard drive, download and save to network hard drive, download and save to external media, use of an archiving service, printing out the blog and use of another service or package e.g.PANDORA, Rsync; MSWord file, etc. There are WordPress plugins that aid in this area. For example the Remote Database Backup plugin helps you backup your WordPress blog at any time by creating SQL dumps of your database.

Internal blogs may have different requirements from those that externally facing. Some may have been created on inhouse software and there is likely to be more control over the outputs created. Chris Gutteridge of the University of Southampton has given a checklist for mothballing an internal bespoke blog. He suggests that administrators keep a copy of the software and a mysqldump of the database, keep a copy of the site in a structure format, such as XML or Atom and capture the site as plain HTML using a recursive wget or similar. This can be useful if the site software gets bitrot (nobody to maintain it) and you want to preserve the articles at their original URLs.

Best Practice

The work in understanding appropriate solutions for archiving the JISC PoWR blog hosted in the Cloud has helped identify appropriate practices which may be particularly relevant others. They are particularly relevant for funding bodies who wish to ensure that their project-funded activities, which make use of blogs provided by third parties, implement appropriate approaches for ensuring that the content provided does not disappear unexpectedly.

A possible checklist could include the following steps:

  • Planning: Consideration of what the blog contains, what you want to preserve, why you want to preserve it and what you want people to have access to in the future. A strategy for moving forward: responsibilities, resources etc., comments and other) and what others can do with the content.
  • Monitoring of technologies used: An audit of the current technologies the blog uses and related issues.
  • Identification of migration strategy: Decision on the approach to be taken, if any.
  • Auditing: Auditing of various aspects of the blog including the number of posts, technologies use and any other relevant areas. Consideration of any problem areas and checking the blog is in ‘good shape’ before archiving. The Mothballing Your Web Site briefing paper may be useful here.
  • Implementation of migration strategy: Implementation of the strategy. It is useful to have a list of key dates here.
  • Dissemination: Ensuring that others are aware of the change in status of the blog, sharing any experiences and best practice learnt.

Conclusion
The archiving approaches taken with the JISC PoWR can be summarised as:

A record of the status of a project blog was taken and published. A rich copy of the contents of the blog was held on a WordPress blog on the UKOLN Intranet which provides a backup managed within the organisation.

This approach is just one of many that could have been taken but will hopefully safeguard future use of the JISC PoWR blog contents.

Blog archiving is still a very new area and it is important that those in the HE sector share experiences and best practices learnt. It may be difficult to know the full impact of the approaches taken till much further down the line and those interested in preservation of Web sites and blogs would do well to watch patterns of use in the forthcoming months and years.

Posted in Case studies | 1 Comment »

Case Study: Tap into Bath

Posted by Marieke Guy on 8th September 2010

There is a lot of value in digital preservation case studies. Building knowledge from observation by sharing approaches can save others a lot of time and effort.

My colleague Ann Chapman has written the following brief case study on the archiving of the Tap into Bath demonstrator project. The demonstrator project contained elements including a Web site, data base and software. Ann can be contacted via her UKOLN staff page.

Tap into Bath

Tap into Bath was a demonstrator project to create a searchable database of collection-level descriptions as part of the Collection Description Focus work programme. The lead partners were UKOLN and University of Bath Library; and twenty-five contributing partners from archive, library and museum collections in both public and private sectors in the City of Bath.

Tap into Bath Home Page

The project began on 12 January 2004. The database was created by a member of the University of Bath library staff using the RSLP Metadata Schema for collection description and a MySQL database. A programmer was hired to create the search and display interfaces; it was part of the contract that the project would be making this available as open source software. The completed database and the project web pages were held on a University of Bath server. Partner organisations submitted collection entries as Word documents and the data was entered into the database by University library staff. The Tap into Bath database was formally launched at the Guildhall in Bath on 8 December 2004.

The MySQL database and the search and data entry interfaces were designated as open source and the un-populated database and accompanying software offered for re-use with accreditation. Several enquiries were received; some did not proceed (typically because funding was unavailable for data collection and data entry tasks) but the following resources were created, both of which have Web links to the Tap into Bath site.

The Southern Cross Resource Finder (SCRF) is a web-based resource that enables users to discover collections from libraries, archives and museums which hold resources useful for the study of Australia and/or New Zealand. Produced by and is maintained by the Menzies Centre for Australian Studies, King’s College London, it was launched in 2005.

Milton Keynes Inspire is an online searchable database to promote access to the collections in museums, archives, galleries and libraries in Milton Keynes, launched on 4 November 2005.

In 2007, project partners were contacted and asked to review their entries and supply any additional or amended data and the database was updated.

Tap into Bath record for the Holburne museum

In 2010 UKOLN received notification that the University of Bath server would be de-commissioned later that year and reviewed the status of the project. The conclusion was that the data needed further updating immediately and on a regular basis in the future and that it would benefit from a re-designed search and display interface. Neither UKOLN nor any of the partners has the resources to host, maintain and/or develop the resource and so it was decided to take the following actions.

  • Archive on UKOLN server and burn onto DVD
    • Populated Tap into Bath database
    • Un-populated database
    • Web apps for data entry and search & display interfaces
    • Word documents ‘High Level Design’ and ‘System Maintenance Guide’
    • Metadata schema
    • Guidelines for data entry
    • Screenshots of search and display pages in use
  • Create a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use
    • Create new Web pages for the project on the UKOLN server that:
    • Record the history of the project
    • State that the resource has been taken down
    • State that the Tap into Bath email address is no longer active
    • Provide access to a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use.
  • Notify partners of new URL and request they remove the old URL if this is currently displayed on their Web site
  • Notify Southern Cross and Milton Keynes Inspire of change of URL so they can update the acknowledgement link on their Web pages
  • Tap into Bath email address: messages currently go to a member UKOLN’s Outreach & Community team. This to be changed so messages go to a member of UKOLN’s Systems and Support team.
  • Record all of the above activity for UKOLN resource management purposes

Carrying out the above processes has ensured that the Tap into Bath site and data has been effectively archived for the short-term. Openly documenting the process enables interested parties to be aware of the archive process and know who should be contacted if any information or data is required.

Tags:
Posted in Case studies | 2 Comments »

Theory to Practice: Digital Preservation Case studies

Posted by Marieke Guy on 21st May 2010

I’m sure you’d agree that experience counts for a lot. In the digital preservation world when you need to do something a little tricky that you haven’t done before it can really help to have a case study close to hand. I am hoping that we will be able to include a number of these in the JISC Beginner’s Guide to Digital Preservation. Although we are on the hunt for new case studies there are already some available:

Digital Preservation Coalition Case Notes
The DPC have published a series of 4 case studies looking at the National Archives has approach to the UK’s Cabinet Papers, the Freeze Frame project’s use of their institutional repository, the Archival Sound Recordings 2 project’s use of METS and a complex digitisation project at the National Library of Wales
SCARP Project Case Studies
The Digital Curation Centre SCARP project (2007-2009) used a series of immersive case studies to identify disciplinary approaches to data deposit, sharing and re-use, curation and preservation
DCC Case Studies
The DCC also have several other case studies from the following projects: the Integrative Biology, JHOVE, PrestoSpace, CARMEN and Wide Field Astronomy Unit (WFAU)
JISC Digital Preservation Policies Study
A list of useful case studies covering institutional preservation policies are listed in the JISC Digital Preservation Policies Study carried out in 2008
JISC Preservation of Web Resources Case Studies
The JISC Preservation of Web Resources blog and handbook both offer case studies in the area of Web preservation
JISC Digital Media Case studies
JISC Digital Media hold case studies in their ‘Learning Lessons from Other Digitisation Projects’ area, although they are primarily about digitisation many do also cover preservation
AHDS Case Studies
The Arts and Humanties Data Service has now ceased to be but the Web site still houses case studies on digitisation and preseservation
National Organisations
There are a number of national library and large national organisation case study approaches available including the National Library of New Zealand, the Theater Instituut Nederland (TIN) and the National Library of Australia

Any one know of any other useful case studies available?

Tags:
Posted in Case studies | 4 Comments »