A while back I wrote the following case study on how we had archived the JISC PoWR blog. This case study sits along side the Approaches To Archiving Professional Blogs Hosted In The Cloud paper Brian Kelly and I wrote for iPres 2010.
As the Digital Preservation Coalition explains the term digital archiving is used differently within sectors.
“The library and archiving communities often use it interchangeably with digital preservation. Computing professionals tend to use digital archiving to mean the process of backup and ongoing maintenance as opposed to strategies for long-term digital preservation.”
In this case study the ‘archiving’ term is being used to describe ways in which blog content can be migrated to alternative environments in order to satisfy a number of business functions, including the re-creation of the original environment. The approaches taken are steps in capturing blog content and considerations about short-term continuity. They involve a mixture of technical considerations, management and agreed policy decisions. It is possible that they could be used at a later stage as part of a digital preservation strategy. Such a strategy may become more necessary if continued access to the original resource is no longer available.
The JISC PoWR (Preservation of Web Resources) project was funded by the JISC and provided by a partnership of UKOLN and ULCC. The project ran from April – November 2008. A WordPress blog was used to support the project work which was hosted by the JISC on their jiscinvolve platform.
Content for the blog was provided by staff from the two partner organisations. It was agreed in advance that blog posts would be published under a Creative Commons licence and a statement to this effect was provided on the blog. In the academic sector use of Creative Commons (a way to communicate which creators reserve and waive with regard to content) is seen as good practice and recommended by the JISC. It was agreed that having a licence in place would avoid possible confusions regarding ownership of the content and lay an effective foundation for future archiving.
A decision was made to host the blog on a platform provided by the project’s funding body rather than using the host institution of either of the project partners. Although this avoids the risk of unanticipated changes to terms and conditions for the service the PoWR team are aware that expected cuts in funding for higher education could result in withdrawal of the service or a failure for the service to be developed.
The Preservation of Web Resources (PoWR) project was funded by JISC to organise a series of workshops and produce a handbook that specifically addressed digital preservation issues of relevance to the UK HE/FE web management community. With preservation forming the heart of the project and bearing in mind reliance on service that could be withdrawn there has been an interest in blog archiving and migration from the onset.
After the end of the project in November 2008 the original intention was to continue to publish occasional posts on the JISC POWR blog related to Web preservation issues. These posts would be published at regular intervals but at a significantly lower frequency than when the project was active. It was initially agreed that the team could continue to provide least 3 posts per month and the blog would be regarded as still functioning. This happened over the 2008 – 2009 and period. Initially the intention was to allow the blog to be reused if additional funding became available to continue the JISC PoWR work in providing advice on best practices for the preservation of Web resources. However although the team were successful in obtaining additional funding this covered a broader area than that of Web preservation.
Reusing the blog?
In April 2010 one of the project team began work on a new project (The JISC Beginner’s Guide to Digital Preservation). A proposal was put forward to use the PoWR blog for this new project. There were a number of reasons given for this possible approach. The PoWR blog has lots of valuable content and it was felt that reusing it could be a way to keep it alive and channel effort into its upkeep. It was also likely that the new guide would pull on lots of the JISC PoWR work already carried out and tackle many similar areas, there would be a lot of overlap. Initially there was a view that reusing the blog could tie in with ‘green’ ideas about reusing and repurposing content.
However reusing the blog would require that substantial changes were made to the blog, including renaming, removing pages, re-skinning the site and retagging old entries. After an interesting discussion it was decided that this approach was not appropriate and that the ‘green’ angle was actually a ‘red herring’, no resources or time would be saved in the long run. There were also concerns that it might weaken the integrity of the JISC PoWR blog as a ‘record series’.
In the world of paper and non-digital records, the approach of using existing files and adding new records on top of them is sometimes taken. This approach often results in confusion and extra work, both for the administrators, for the records manager, and the archivist as it disrupts continuity. A record series (as defined by US National Archives) is a group of files or documents are kept together (either physically or intellectually) because they relate to a particular subject or function, result from the same activity, document a specific type of transaction, take a particular physical form, or have some other relationship arising out of their creation, receipt, maintenance, or use.” It was agreed that it was preferable to “close the blog and pass the baton on”. A new blog was created for the JISC Beginner’s Guide to Digital Preservation and it was decided that the PoWR blog would close and a case study would be written about the process to be used by the new JISC Beginner’s Guide to Digital Preservation.
During the life of the JISC PoWR project that created the blog there were many opportunities to consider what elements and behaviours of Web sites, and in turn blogs, are important to capture. Chapter 6 of the JISC PoWR Handbook is entitled ‘What elements do we capture and preserve?’ The 3 key elements are defined as: content, appearance and behaviour. The first two elements are relatively straightforward to capture but the third, behaviour, which includes features like RSS feeds, comments, site administration and tagging features. It was with a view to capture all three elements that the approaches taken were chosen.
After some investigation into possible archive methods (See Approaches To Archiving Professional Blogs Hosted In The Cloud for more detail) it was decided that that the following would be done: Firstly the site would be frozen but remain accessible for the indefinite future, then an XML Dump would be made of the content in the blog and stored on a UKOLN server. Finally a ‘copy blog’ would be created on the UKOLN Intranet. The XML dump and copy blog would not be for public viewing. In parallel use would also be made of external archiving such as the Internet Archive and UK Web Archive
External Archiving: The UK Web Archive
At the start of the project the JISC PoWR blog was suggested by JISC as a possible target for archiving for the UK Web Archive. The site has now been archived on 5 occasions between January 2009 and January 2010. The archive is incomplete as there are no more current instances (so posts are missing) but it is still a useful resource for the project team and those interested in the look, feel and content of a JISC project blog from 2008/2010. For example when some elements of the blog were lost due to an upgrade the UKWA archived version of the blog provided a simple way to reinstate the look and feel of the blog. The archived version may also prove useful for those interested in the changing nature of blogs and their use instead of project Web sites. More information on why we would want to preserve (e.g. for legal, cultural and reputation reasons) are given in the JISC PoWR handbook.
External Archiving: Internet Archive
The JISC PoWR blog has been archived by the Internet Archive but for some reason only pages up to July 30th 2008 have been archived, it is possible that a processing delay means that the site has been captured more recently and not yet made available. The site has been resubmitted to the Open Directory site. Again this archive is incomplete but still a useful resource for the reasons mentioned under UKWA.
Upgrade to the blog
As explained the blog is hosted by JISC Involve who provide blogs for the JISC community. Till June 2010 JISC Involve was running on an old version of WordPress (1.2.5). In early June 2010 the JISC Digital Communications Team upgraded their server to the latest version of WordPress (2.9.2) and then migrated JISC Involve’s blogs over to the new installation. Although blog posts, comments, attachments, user accounts, permissions and customisations were expected to move over easily JISC Involve users were encouraged to back-up the content of drafts etc. ‘just in case’.
Unfortunately there were some technical problems migrating the content and as a consequence the original theme was lost and URLs now redirect. Luckily the JISC PoWR team were able to locate the original theme and reinstall it. A number of sidebar widgets were also lost during the migration, for example the widget containing licence information was lost. The CC widget and other sidebar widgets were reinstated on 21st July 2010. These occurrences made them aware of the need to record details of the technical components and architecture of the blog. After an upgrade it is also useful to check if archiving permalinks work, embeds and images work and that comments are showing.
Freezing the blog
The process of freezing the JISC PoWR blog involved publishing a preliminary post indicating the intention to freeze the blog and asking for comments on the data that should be recorded (Cessation of posts to the JISC PoWR blog). After this post was published a final check of the blog was carried out. Duplicate categories were removed, unsuitable widgets were removed (for example the calendar widget would have little relevance for an archived blog) and checks were made that the blog was functioning correctly.
On Friday 23rd July a final post was published stating that the blog would now be frozen. An audit of the blog technologies used was also carried out and published. This included details of the WordPress plugins installed and theme and widgets used.
Archive Page on JISC PoWR blog
The final statistics were published in an archive page. It contained the following information:
- Active Dates – These were the dates the blog had run from and to. Dates allow the blog to be put into context by future users.
- Number of posts – Data on posts shows the scale of blog it’s productivity over it’s lifecycle.
- Number of comments – Data on comments shows the interactivity level of the blog. There were some valuable discussions held in the comments field on posts which may have influenced the final PoWR handbook.
- Akismet (spam catcher) statistics – By preventing spam from being posted to the site Akismet has proved itself to be a valuable tool.
- Details of contributors – Links were included to the contributor’s staff pages. This will enable those with questions regarding content on the blog to contact the appropriate person and will enable easier reuse of content.
- Details of blog theme – As mentioned previously recording details of the blog theme can be very useful in allowing the look and feel to be consistent.
- Details of plugins used – Plugin use will have an affect on the functionality of the blog. Recording details of those used will allow the functionality to be consistent.
- Details of type and version of software used – Recording technical details will help with any future problems and allow the site creators to have more control over the future degrading process.
- Blog licence – Clarity with regard to the licence of the blog posts, comments and other items in the blog is important to enable future reuse. The JISC PoWR blog is licenced under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License. Comments posted to this blog also have the same licence. Stating this fact will allow others to reuse the resources in the blog.
To ensure that those who arrive at the blog are clear on its current status the blog title was edited to include the phrase [Archived blog] after the initial title.
Comments were closed. Closing comments retrospectively is not straightforward in WordPress. It can be done on a post by post basis but with a total of 140 posts this would have been too time-consuming. The comments feature was closed by entering the Settings > Discussion area and unchecking Allow people to post comments on new articles and checking Users must be registered and logged in to comment.
An Extended Comments plugin is available to automate this process.
A UKOLN briefing paper is available on policies for blog comments.
XML Dump of blog
On the day the blog was frozen an XML dump was taken of the blog and stored on a UKOLN server. An XML file is created by using the WordPress export function (Tools (left hand contents list) > Export). It is likely that the export process is not the same on different blog platforms but most blogs will have an export feature.
This XML was also imported into the UKOLN blogs wordpress system in an internal facing private blog on the UKOLN Intranet. This has enabled the UKOLN members of the JISC PoWR team to have a better understanding of how the blog was used and to analyse the contents of the blog using a variety of WordPress plugins.
This process had already been carried out with older versions of the blog data. The availability of the backup copy of the blog meant that it was possible to change configuration options, which administrators would not want to do on a live blog. The number of RSS items provided was set to a large number so that the entire contents of the blog posts and comments could be made available via an RSS feed. The RSS feed was used to produce a Wordle word cloud which provides a visualization of the contents of the blog and the comments which have been provided. The RSS feed was also processed by Yahoo Pipes. This enabled the contents of the blog to be processed by an RSS to PDF tool, with a series of PDF files being produced in chronological order (with the capability of applying additional filtering if so desired).
The approach taken with the JISC PoWR blog is only one possible approach. Other approaches are outlined in the Approaches To Archiving Professional Blogs Hosted In The Cloud iPres paper.
One option is the production of a new static master version of the content. To achieve this the contents of the blog could be migrated as static HTML pages. There are then requirements to preserve the new static site but this may be easier for an organisation to achieve.
Another approach is the migration of the content to an alternative platform. For example it t may be felt necessary to migrate the contents of a blog to an alternative blogging platform in order to ensure that the blogging characteristics will continue to be available. This might include the migration of a live blog to an alternative platform (which would not normally be described as archiving) but could also involve copying the blog’s rich content in order to support data mining or other business processes which may not be possible on the original environment.
One possible way to do this is by use of the ArchivePress tools and methodology . ArchivePress is a JISC-funded blog-archiving project being undertaken by the University of London Computer Centre and the British Library Digital Preservation department. As an alternative to the web crawling/harvesting approach of the Internet Archive and the UK Web Archive, ArchivePress tested the viability of using RSS feeds and blog APIs to harvest blog content (including comments, embedded content and metadata). The archived content is stored and managed using instances of WordPress, thereby maintaining the blogs’ native data structures, formats and relationships.
Blog creators may want to produce a physical manifestation of their content, such as a hard copy printout. This may be for various reasons, for example for marketing purposes or to provide access to the content when online access is not possible. Many self publishing services now give book creation from blogs as an option e.g. http://www.blurb.com/create/book/blogbook.
Although the JISC PoWR blog has been backed up on an internal server there are different ways ‘backing up’ can be achieved. The 2007 School of Information and Library Science University of North Carolina at Chapel Hill blog survey found that possible preservation methods tended to focus on backing up the site. They included: download and save to personal hard drive, download and save to network hard drive, download and save to external media, use of an archiving service, printing out the blog and use of another service or package e.g.PANDORA, Rsync; MSWord file, etc. There are WordPress plugins that aid in this area. For example the Remote Database Backup plugin helps you backup your WordPress blog at any time by creating SQL dumps of your database.
Internal blogs may have different requirements from those that externally facing. Some may have been created on inhouse software and there is likely to be more control over the outputs created. Chris Gutteridge of the University of Southampton has given a checklist for mothballing an internal bespoke blog. He suggests that administrators keep a copy of the software and a mysqldump of the database, keep a copy of the site in a structure format, such as XML or Atom and capture the site as plain HTML using a recursive wget or similar. This can be useful if the site software gets bitrot (nobody to maintain it) and you want to preserve the articles at their original URLs.
The work in understanding appropriate solutions for archiving the JISC PoWR blog hosted in the Cloud has helped identify appropriate practices which may be particularly relevant others. They are particularly relevant for funding bodies who wish to ensure that their project-funded activities, which make use of blogs provided by third parties, implement appropriate approaches for ensuring that the content provided does not disappear unexpectedly.
A possible checklist could include the following steps:
- Planning: Consideration of what the blog contains, what you want to preserve, why you want to preserve it and what you want people to have access to in the future. A strategy for moving forward: responsibilities, resources etc., comments and other) and what others can do with the content.
- Monitoring of technologies used: An audit of the current technologies the blog uses and related issues.
- Identification of migration strategy: Decision on the approach to be taken, if any.
- Auditing: Auditing of various aspects of the blog including the number of posts, technologies use and any other relevant areas. Consideration of any problem areas and checking the blog is in ‘good shape’ before archiving. The Mothballing Your Web Site briefing paper may be useful here.
- Implementation of migration strategy: Implementation of the strategy. It is useful to have a list of key dates here.
- Dissemination: Ensuring that others are aware of the change in status of the blog, sharing any experiences and best practice learnt.
The archiving approaches taken with the JISC PoWR can be summarised as:
A record of the status of a project blog was taken and published. A rich copy of the contents of the blog was held on a WordPress blog on the UKOLN Intranet which provides a backup managed within the organisation.
This approach is just one of many that could have been taken but will hopefully safeguard future use of the JISC PoWR blog contents.
Blog archiving is still a very new area and it is important that those in the HE sector share experiences and best practices learnt. It may be difficult to know the full impact of the approaches taken till much further down the line and those interested in preservation of Web sites and blogs would do well to watch patterns of use in the forthcoming months and years.