JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Preserving the Internet using….paper?

Posted by Marieke Guy on November 11th, 2010

Maybe digital preservation is just a black hole and the best way to preserve the Internet is using paper?

The Paper Internet Project takes the stance that you could do worse than “preserve important bits of our civilization for future centuries using a bundle of paper sealed in plastic“.

Saving the web, one page at a time

They are building a series of time capsules containing photos, music, technical journals, and descriptions of everyday life in Right Now, A.D. The time capsules are buried by volunteers at specific locations all over the world. Each node contains the locations of all the others, forming a network.

So far they have built 3 nodes and have curated the data for dozens more. The work is being funded by donations. Each time capsule of 2,000-4,000 pages costs $40-$60 for printing and $20-$30 for the epoxy. They become more cost-effective as they scale up.

Good idea? What do people think?

Posted in Project news | 1 Comment »

New Principles for Access to Digital Materials

Posted by Marieke Guy on November 9th, 2010

Last week the Collections Trust announced a new set of Principles for Supporting Long-term Access to Digital Material, commissioned by MLA and produced by the Collections Trust with the support of a range of organisations including The National Archives, Heritage Lottery Fund, Archaeology Data Service, British Library, Digital Preservation Coalition, Museums Galleries Scotland, Joint Information Systems Committee and UKOLN.

The Principles form the first part of a programme of work during 2010-11 to develop guidance to support both funders and cultural institutions in developing digital resources that are more sustainable, both through Digital Preservation and more generally through the management of the Digital Content Supply Chain. The Principles Paper is available to download from Collections Link.

To keep in touch with the development of this work, and the related standards and guidelines, join the Digitisation Standards network.

Posted in General | Comments Off

Addressing the Research Data Challenge

Posted by Marieke Guy on November 8th, 2010


Last week the Digital Curation Centre (DCC) ran a series of inter-linked workshops aimed at supporting institutional data management, planning and training. The roadshow will travel round the UK but the first one was held in central Bath. The event ran over 3 days and provided Institutions with advice and guidance tailored to a range of different roles and responsibilities.

Day one (Tuesday 2nd November) looked the Research Data Landscape and offered a selection of case studies highlighting different models, approaches and working practice. Day two (Wednesday 3rd November) considered the research data challenge and how we can develop an institutional response. Day three (Thursday 4th November) comprised of 2 half-day training workshops: Train the Trainer and Digital Curation 101.

Unfortunately due to other commitments I could only make the second day of the roadshow, but found it really useful and would thoroughly recommend anyone interested in institutional curation of research data to attend the next workshop (to be held in Sheffield early next year – watch this space!).

The Research Data Challenge: Developing an Institutional Response

Liz Lyon Presenting

Day two of the roadshow was aimed at high-level managers and researchers with the intention of getting them to work together to identify first steps in developing an institutional strategic plan for research data management support and service delivery. Although there was a huge amount of useful information to take in (if only I’d come across more of it when writing the Beginner’s Guide! Currently waiting for the go ahead for release.) it was very much a ‘working day’. We were to get our hands dirty looking at real research curation and preservation situations in our own institutions.

After coffee and enjoying some of the biggest biscuits I’ve seen we were introduced to the DCC and given a quick overview by Kevin Ashley, Director DCC, University of Edinburgh. The majority of the day was facilitated by Dr Liz Lyon, Associate Director, DCC and Director of UKOLN, University of Bath. Liz reiterated the research data challenge we face but pointed out that there are both excellent case-studies and excellent tools now available for our use. Two that are worth highlighting here are DMP Online (DCC’s data management planning tool) and University of Southampton’s IDMB: Institutional data management blueprint. The slides Liz used during the day were excellent, they are available from the DCC Web site in PPT format and can be downloaded as a PDF from here.

During the day we worked in groups on a number of exercises. The idea is that we would start fairly high level and then drill down into more specific actions. In the first exercise my group took a look at motivations and benefits for research data management and the barriers that are currently in place. Naturally the economic climate was mentioned a fair amount during the day but some of the long-standing issues still remain: where responsibility lies, lack of skills, lack of a coherent framework, taking data out of context, storage issues and so on. After our feedback Liz gave another plenary on Reviewing Data Support Services: Analysis, Assessment, Priorities. The key DCC key tool in this area is the Data Asset Framework (formerly the Data Audit Framework) which provides organisations with the means to identify, locate, describe and assess how they are managing their research data assets – very useful for prioritising work. Useful reports include those from the Supporting Data Management Infrastructure for the Humanities (Sudamih) project. There was a feeling that looking into this area was becoming easier, people tend to be more open than they were a few years back, there is definitely groundswell.

Group Exercises

In exercise 2 we carried out a SWOT analysis of current research data. In the feedback there were a few mentions of the excellent Review of the State of the Art of the Digital Curation of Research Data by Alex Ball. Liz also provided us with a useful resources list (in her slides).

After an excellent lunch and a very brief break (no time to rest when sorting out HE’s data problems!) we returned to another plenary by Liz on Building Capacity and Capability in your Institution: Skills, Roles, Resources whih laid the groundwork for exercise 3 –
a skills and services Audit. This exercise required us to think about the various skills needed for data curation and align them with people in our institutions. There was a recognition that librarians do ‘a lot’ and are more than likely to become the hub for activity in the future. There was also a realisation that there is a fair number of gaps (for example around provenance) and that there can be a bit of a hole between the creation of data by researchers and the passing on of curated data to librarians. Another reason why we need to create more links with our researchers. Again lots of excellent resources that I hope to return to including Appraise & Select Research Data for Curation by Angus Whyte, Digital Curaton Centre, and Andrew Wilson, Australian National Data Service.

Liz then gave her final plenary on Developing a Strategic Plan for Research Data Management: Position, Policy, Structure and Service Delivery. The suggestions on optimising organisational support and looking at quick wins put us in the right frame of mind for the final exercise – Planning Actions and Timeframe. We were required to lay down our ‘real’ and aspirational actions for the short-term (0-12 months), medium-term (1-36 months) and long term (over 3 years). A seriously tricky task! The feedback reflected on the situation we are currently in economically and how it offers us as many opportunities as clallenges. Now is a better time than ever for reform and for information services to take on a leadership role. Kevin Ashley concluded the day with some thoughts on the big who, how and why issues. He stressed that training is so important at the moment. Many skills are in short supply and employing new staff is not an option so reskilling your staff is essential.

Flickr photos from the day (include photos of the flip chart pages created) are available from the UKOLN Flickr page and brief feedback videos are available from the UKOLN Vimeo page. There is also a Lanyard entry for the roadshow. The event tag was #dccsw10.

Tags:
Posted in Conference, Events | 1 Comment »

New Digital Preservation Coalition Case Study

Posted by Marieke Guy on October 27th, 2010

The Digital Preservation Coalition (DPC) have just released a new case study – Small Steps: Long View – in its case notes series. The case study is available as a PDF.

The case study, prepared by Tracey Hawkins, looks at the Glasgow Museums approach to its large and growing digital collections. It considers how the museum service turned an oral history headache into an opportunity and describes how some simple steps in addressing digital preservation have created short and long term opportunities for the museums. The museums have used some very traditional simple and well know approaches – creating an inventory, assessing significance and promoting access – as the basis for building confidence to manage the wider challenges they face.

The benefits of digital preservation can be expressed in terms of new opportunities they create in the short and long term. Even relatively simple steps can bring early rewards if properly embedded within the mission of an organization.

The collection of case notes are available from the DPC site.

Posted in Project news | Comments Off

What’s New at the DPC?

Posted by Marieke Guy on October 15th, 2010

The latest issue of the Digital Preservation Coalition What’s New (Issue 30, October 2010) is now out.

The newsletter includes a review of iPres 2010 written by William Kilbride. The article mentions the Archiving clouds paper and the Library of Congress presentation on the Twitter archive, among other hightlights.

What’s New is a monthly update on areas of interest in Digital Preservation.

Tags:
Posted in Project news | Comments Off

Case study: Archiving the JISC PoWR blog

Posted by Marieke Guy on October 11th, 2010

A while back I wrote the following case study on how we had archived the JISC PoWR blog. This case study sits along side the Approaches To Archiving Professional Blogs Hosted In The Cloud paper Brian Kelly and I wrote for iPres 2010.

Defining Archiving

As the Digital Preservation Coalition explains the term digital archiving is used differently within sectors.

The library and archiving communities often use it interchangeably with digital preservation. Computing professionals tend to use digital archiving to mean the process of backup and ongoing maintenance as opposed to strategies for long-term digital preservation.

From http://www.dpconline.org/advice/preservationhandbook/introduction/definitions-and-concepts

In this case study the ‘archiving’ term is being used to describe ways in which blog content can be migrated to alternative environments in order to satisfy a number of business functions, including the re-creation of the original environment. The approaches taken are steps in capturing blog content and considerations about short-term continuity. They involve a mixture of technical considerations, management and agreed policy decisions. It is possible that they could be used at a later stage as part of a digital preservation strategy. Such a strategy may become more necessary if continued access to the original resource is no longer available.

About

The JISC PoWR (Preservation of Web Resources) project was funded by the JISC and provided by a partnership of UKOLN and ULCC. The project ran from April – November 2008. A WordPress blog was used to support the project work which was hosted by the JISC on their jiscinvolve platform.
Content for the blog was provided by staff from the two partner organisations. It was agreed in advance that blog posts would be published under a Creative Commons licence and a statement to this effect was provided on the blog. In the academic sector use of Creative Commons (a way to communicate which creators reserve and waive with regard to content) is seen as good practice and recommended by the JISC. It was agreed that having a licence in place would avoid possible confusions regarding ownership of the content and lay an effective foundation for future archiving.

A decision was made to host the blog on a platform provided by the project’s funding body rather than using the host institution of either of the project partners. Although this avoids the risk of unanticipated changes to terms and conditions for the service the PoWR team are aware that expected cuts in funding for higher education could result in withdrawal of the service or a failure for the service to be developed.

The Preservation of Web Resources (PoWR) project was funded by JISC to organise a series of workshops and produce a handbook that specifically addressed digital preservation issues of relevance to the UK HE/FE web management community. With preservation forming the heart of the project and bearing in mind reliance on service that could be withdrawn there has been an interest in blog archiving and migration from the onset.

Background

After the end of the project in November 2008 the original intention was to continue to publish occasional posts on the JISC POWR blog related to Web preservation issues. These posts would be published at regular intervals but at a significantly lower frequency than when the project was active. It was initially agreed that the team could continue to provide least 3 posts per month and the blog would be regarded as still functioning. This happened over the 2008 – 2009 and period. Initially the intention was to allow the blog to be reused if additional funding became available to continue the JISC PoWR work in providing advice on best practices for the preservation of Web resources. However although the team were successful in obtaining additional funding this covered a broader area than that of Web preservation.

Reusing the blog?

In April 2010 one of the project team began work on a new project (The JISC Beginner’s Guide to Digital Preservation). A proposal was put forward to use the PoWR blog for this new project. There were a number of reasons given for this possible approach. The PoWR blog has lots of valuable content and it was felt that reusing it could be a way to keep it alive and channel effort into its upkeep. It was also likely that the new guide would pull on lots of the JISC PoWR work already carried out and tackle many similar areas, there would be a lot of overlap. Initially there was a view that reusing the blog could tie in with ‘green’ ideas about reusing and repurposing content.

However reusing the blog would require that substantial changes were made to the blog, including renaming, removing pages, re-skinning the site and retagging old entries. After an interesting discussion it was decided that this approach was not appropriate and that the ‘green’ angle was actually a ‘red herring’, no resources or time would be saved in the long run. There were also concerns that it might weaken the integrity of the JISC PoWR blog as a ‘record series’.

In the world of paper and non-digital records, the approach of using existing files and adding new records on top of them is sometimes taken. This approach often results in confusion and extra work, both for the administrators, for the records manager, and the archivist as it disrupts continuity. A record series (as defined by US National Archives) is a group of files or documents are kept together (either physically or intellectually) because they relate to a particular subject or function, result from the same activity, document a specific type of transaction, take a particular physical form, or have some other relationship arising out of their creation, receipt, maintenance, or use.” It was agreed that it was preferable to “close the blog and pass the baton on”. A new blog was created for the JISC Beginner’s Guide to Digital Preservation and it was decided that the PoWR blog would close and a case study would be written about the process to be used by the new JISC Beginner’s Guide to Digital Preservation.

Process

During the life of the JISC PoWR project that created the blog there were many opportunities to consider what elements and behaviours of Web sites, and in turn blogs, are important to capture. Chapter 6 of the JISC PoWR Handbook is entitled ‘What elements do we capture and preserve?’ The 3 key elements are defined as: content, appearance and behaviour. The first two elements are relatively straightforward to capture but the third, behaviour, which includes features like RSS feeds, comments, site administration and tagging features. It was with a view to capture all three elements that the approaches taken were chosen.

After some investigation into possible archive methods (See Approaches To Archiving Professional Blogs Hosted In The Cloud for more detail) it was decided that that the following would be done: Firstly the site would be frozen but remain accessible for the indefinite future, then an XML Dump would be made of the content in the blog and stored on a UKOLN server. Finally a ‘copy blog’ would be created on the UKOLN Intranet. The XML dump and copy blog would not be for public viewing. In parallel use would also be made of external archiving such as the Internet Archive and UK Web Archive

External Archiving: The UK Web Archive

At the start of the project the JISC PoWR blog was suggested by JISC as a possible target for archiving for the UK Web Archive. The site has now been archived on 5 occasions between January 2009 and January 2010. The archive is incomplete as there are no more current instances (so posts are missing) but it is still a useful resource for the project team and those interested in the look, feel and content of a JISC project blog from 2008/2010. For example when some elements of the blog were lost due to an upgrade the UKWA archived version of the blog provided a simple way to reinstate the look and feel of the blog. The archived version may also prove useful for those interested in the changing nature of blogs and their use instead of project Web sites. More information on why we would want to preserve (e.g. for legal, cultural and reputation reasons) are given in the JISC PoWR handbook.

External Archiving: Internet Archive

The JISC PoWR blog has been archived by the Internet Archive but for some reason only pages up to July 30th 2008 have been archived, it is possible that a processing delay means that the site has been captured more recently and not yet made available. The site has been resubmitted to the Open Directory site. Again this archive is incomplete but still a useful resource for the reasons mentioned under UKWA.

Upgrade to the blog

As explained the blog is hosted by JISC Involve who provide blogs for the JISC community. Till June 2010 JISC Involve was running on an old version of WordPress (1.2.5). In early June 2010 the JISC Digital Communications Team upgraded their server to the latest version of WordPress (2.9.2) and then migrated JISC Involve’s blogs over to the new installation. Although blog posts, comments, attachments, user accounts, permissions and customisations were expected to move over easily JISC Involve users were encouraged to back-up the content of drafts etc. ‘just in case’.

Unfortunately there were some technical problems migrating the content and as a consequence the original theme was lost and URLs now redirect. Luckily the JISC PoWR team were able to locate the original theme and reinstall it. A number of sidebar widgets were also lost during the migration, for example the widget containing licence information was lost. The CC widget and other sidebar widgets were reinstated on 21st July 2010. These occurrences made them aware of the need to record details of the technical components and architecture of the blog. After an upgrade it is also useful to check if archiving permalinks work, embeds and images work and that comments are showing.

Freezing the blog

The process of freezing the JISC PoWR blog involved publishing a preliminary post indicating the intention to freeze the blog and asking for comments on the data that should be recorded (Cessation of posts to the JISC PoWR blog). After this post was published a final check of the blog was carried out. Duplicate categories were removed, unsuitable widgets were removed (for example the calendar widget would have little relevance for an archived blog) and checks were made that the blog was functioning correctly.

On Friday 23rd July a final post was published stating that the blog would now be frozen. An audit of the blog technologies used was also carried out and published. This included details of the WordPress plugins installed and theme and widgets used.

Archive Page on JISC PoWR blog

The final statistics were published in an archive page. It contained the following information:

  • Active Dates – These were the dates the blog had run from and to. Dates allow the blog to be put into context by future users.
  • Number of posts – Data on posts shows the scale of blog it’s productivity over it’s lifecycle.
  • Number of comments – Data on comments shows the interactivity level of the blog. There were some valuable discussions held in the comments field on posts which may have influenced the final PoWR handbook.
  • Akismet (spam catcher) statistics – By preventing spam from being posted to the site Akismet has proved itself to be a valuable tool.
  • Details of contributors – Links were included to the contributor’s staff pages. This will enable those with questions regarding content on the blog to contact the appropriate person and will enable easier reuse of content.
  • Details of blog theme – As mentioned previously recording details of the blog theme can be very useful in allowing the look and feel to be consistent.
  • Details of plugins used – Plugin use will have an affect on the functionality of the blog. Recording details of those used will allow the functionality to be consistent.
  • Details of type and version of software used – Recording technical details will help with any future problems and allow the site creators to have more control over the future degrading process.
  • Blog licence – Clarity with regard to the licence of the blog posts, comments and other items in the blog is important to enable future reuse. The JISC PoWR blog is licenced under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License. Comments posted to this blog also have the same licence. Stating this fact will allow others to reuse the resources in the blog.

To ensure that those who arrive at the blog are clear on its current status the blog title was edited to include the phrase [Archived blog] after the initial title.

Comments were closed. Closing comments retrospectively is not straightforward in WordPress. It can be done on a post by post basis but with a total of 140 posts this would have been too time-consuming. The comments feature was closed by entering the Settings > Discussion area and unchecking Allow people to post comments on new articles and checking Users must be registered and logged in to comment.
An Extended Comments plugin is available to automate this process.

A UKOLN briefing paper is available on policies for blog comments.

XML Dump of blog

On the day the blog was frozen an XML dump was taken of the blog and stored on a UKOLN server. An XML file is created by using the WordPress export function (Tools (left hand contents list) > Export). It is likely that the export process is not the same on different blog platforms but most blogs will have an export feature.

This XML was also imported into the UKOLN blogs wordpress system in an internal facing private blog on the UKOLN Intranet. This has enabled the UKOLN members of the JISC PoWR team to have a better understanding of how the blog was used and to analyse the contents of the blog using a variety of WordPress plugins.

This process had already been carried out with older versions of the blog data. The availability of the backup copy of the blog meant that it was possible to change configuration options, which administrators would not want to do on a live blog. The number of RSS items provided was set to a large number so that the entire contents of the blog posts and comments could be made available via an RSS feed. The RSS feed was used to produce a Wordle word cloud which provides a visualization of the contents of the blog and the comments which have been provided. The RSS feed was also processed by Yahoo Pipes. This enabled the contents of the blog to be processed by an RSS to PDF tool, with a series of PDF files being produced in chronological order (with the capability of applying additional filtering if so desired).

Other Approaches

The approach taken with the JISC PoWR blog is only one possible approach. Other approaches are outlined in the Approaches To Archiving Professional Blogs Hosted In The Cloud iPres paper.

One option is the production of a new static master version of the content. To achieve this the contents of the blog could be migrated as static HTML pages. There are then requirements to preserve the new static site but this may be easier for an organisation to achieve.

Another approach is the migration of the content to an alternative platform. For example it t may be felt necessary to migrate the contents of a blog to an alternative blogging platform in order to ensure that the blogging characteristics will continue to be available. This might include the migration of a live blog to an alternative platform (which would not normally be described as archiving) but could also involve copying the blog’s rich content in order to support data mining or other business processes which may not be possible on the original environment.

One possible way to do this is by use of the ArchivePress tools and methodology . ArchivePress is a JISC-funded blog-archiving project being undertaken by the University of London Computer Centre and the British Library Digital Preservation department. As an alternative to the web crawling/harvesting approach of the Internet Archive and the UK Web Archive, ArchivePress tested the viability of using RSS feeds and blog APIs to harvest blog content (including comments, embedded content and metadata). The archived content is stored and managed using instances of WordPress, thereby maintaining the blogs’ native data structures, formats and relationships.

Blog creators may want to produce a physical manifestation of their content, such as a hard copy printout. This may be for various reasons, for example for marketing purposes or to provide access to the content when online access is not possible. Many self publishing services now give book creation from blogs as an option e.g. http://www.blurb.com/create/book/blogbook.

Although the JISC PoWR blog has been backed up on an internal server there are different ways ‘backing up’ can be achieved. The 2007 School of Information and Library Science University of North Carolina at Chapel Hill blog survey found that possible preservation methods tended to focus on backing up the site. They included: download and save to personal hard drive, download and save to network hard drive, download and save to external media, use of an archiving service, printing out the blog and use of another service or package e.g.PANDORA, Rsync; MSWord file, etc. There are WordPress plugins that aid in this area. For example the Remote Database Backup plugin helps you backup your WordPress blog at any time by creating SQL dumps of your database.

Internal blogs may have different requirements from those that externally facing. Some may have been created on inhouse software and there is likely to be more control over the outputs created. Chris Gutteridge of the University of Southampton has given a checklist for mothballing an internal bespoke blog. He suggests that administrators keep a copy of the software and a mysqldump of the database, keep a copy of the site in a structure format, such as XML or Atom and capture the site as plain HTML using a recursive wget or similar. This can be useful if the site software gets bitrot (nobody to maintain it) and you want to preserve the articles at their original URLs.

Best Practice

The work in understanding appropriate solutions for archiving the JISC PoWR blog hosted in the Cloud has helped identify appropriate practices which may be particularly relevant others. They are particularly relevant for funding bodies who wish to ensure that their project-funded activities, which make use of blogs provided by third parties, implement appropriate approaches for ensuring that the content provided does not disappear unexpectedly.

A possible checklist could include the following steps:

  • Planning: Consideration of what the blog contains, what you want to preserve, why you want to preserve it and what you want people to have access to in the future. A strategy for moving forward: responsibilities, resources etc., comments and other) and what others can do with the content.
  • Monitoring of technologies used: An audit of the current technologies the blog uses and related issues.
  • Identification of migration strategy: Decision on the approach to be taken, if any.
  • Auditing: Auditing of various aspects of the blog including the number of posts, technologies use and any other relevant areas. Consideration of any problem areas and checking the blog is in ‘good shape’ before archiving. The Mothballing Your Web Site briefing paper may be useful here.
  • Implementation of migration strategy: Implementation of the strategy. It is useful to have a list of key dates here.
  • Dissemination: Ensuring that others are aware of the change in status of the blog, sharing any experiences and best practice learnt.

Conclusion
The archiving approaches taken with the JISC PoWR can be summarised as:

A record of the status of a project blog was taken and published. A rich copy of the contents of the blog was held on a WordPress blog on the UKOLN Intranet which provides a backup managed within the organisation.

This approach is just one of many that could have been taken but will hopefully safeguard future use of the JISC PoWR blog contents.

Blog archiving is still a very new area and it is important that those in the HE sector share experiences and best practices learnt. It may be difficult to know the full impact of the approaches taken till much further down the line and those interested in preservation of Web sites and blogs would do well to watch patterns of use in the forthcoming months and years.

Posted in Case studies | 1 Comment »

Web Preservation the UKOLN way

Posted by Marieke Guy on September 30th, 2010

I have had a guest blog post published on the Museums Computer Group blog. The post, entitled Web Preservation the UKOLN way, talks about some of the recent preservation work I have been doing at UKOLN and about our cultural heritage area of the UKOLN Web site.

Tags:
Posted in articles | 1 Comment »

Moving out of the e-Fridge: iPres 2010

Posted by Marieke Guy on September 27th, 2010

Last week I attended the 7th International Conference on Preservation of Digital Objects (iPres 2010) held at the Technical University of Vienna. The conference looks at both research and best practice in the field of digital preservation and comprises of a full week of events including the regular conference, several workshops, the International Web Archiving Workshop (IWAW), the PREMIS implementation fair and lots of organised and impromptu meetings.

@art by Gerald Martineo - the conference art work

This year they had just over 290 people registered and the programme offered keynotes, 2 tracks (made up of regular papers and late breaking results) and poster sessions. The content was an interesting mix of the more traditional presentations looking at areas like metadata and object properties and some more practical talks on areas like preserving Web data.

There was a lot to take on board during the four days I was in Vienna but here are some of my highlights.

Monday 20th September

The Fourth Paradigm

The Monday morning opening keynote entitled The Fourth Paradigm: Data-Intensive Scientific Discovery & the Future Role of Research Libraries was given by Tony Hey. Hey has his roots in the academic sector and was involved in setting up the Digital Curation Centre but he now works for Microsoft; he also has a wife who is a librarian – all this made for an broad perspective on current needs when it comes to preservation of research data. Hey did a good job of putting forward Microsoft’s assurance in this area, he explained that they are committed to open standards, open tools and open technology and keen to be more involved. In the Q&A he actually admitted that Microsoft could do more to ensure its software is properly archived and available to others and that he felt they had a ‘responsibility’ in this area.

Tony Hey gives his opening keynote

Hey’s talk looked at the previous paradigms in science – experimental, theoretical and computational, and the move to a new data-intensive paradigm – the fourth paradigm (the title of his talk and his new book). Science is now overwhelmed with data sets he gave the example of Chronozoom. Rather than shy away from data deluge Hey explained that we should embrace it; the future is collective peer reviewing, collective tagging and lab notebooks as blogs. Hey also talked about software preservation and asked if we can do better? We need to decide upon the key parts and save the valuable, here he explained the relevance of Microsoft – the computing industry is very much closest to the problem. Hey then went on to mention some valuable digital preservation work that Microsoft have had research role input into: Planets project, SCAPE project, APARSEN, datacite, COAR, CNI and ICSTI.

Hey concluded by asking what the future of research libraries is? Is it that librarians have abdicated and are in danger of being disintermediated? His quote from a US General hit the nail on the head “if you do not like change you will like irrelevance even less”. Hey suggested three tasks for libraries: digital library; tools of authoring and publishing; integration of data and publications. Here he advocated that research libraries should be guardians of the research output of the institution and mentioned that they should see the importance of repositories and not be afraid of cloud solutions.

Preserving Web Archives: One Size Fits All?

Straight after an interesting lunch of a pasta pie (Vienna isn’t the best place for vegetarians!) we were offered a panel session on Preserving Web Archives: One Size Fits All? The panel [Libor Coufal (National Library of the Czech Republic ), Andrea Goethals (Harvard University Library), Gina Jones (Library of Congress), Clément Oury (French National Library) and David Pearson (National Library of Australia)] were all members of the Preservation Working Group of the IIPC (International Internet Preservation Consortium), which is made up of about forty institutions that collect web content for heritage purpose. Each member of the panel was given two questions to answer: “Web archiving” Do we have the same understanding of what we are trying to do? What are our preservation strategies for web archives?” Do we have the right technologies?

A good summary of the answers is available from the iPres site. What became clear through the discussion was that there is significant variation in what organisations are capturing (for example the French National Library are keeping spam, seeing its inclusion as a more faithful mirror of contemporary French culture) and what they plan to do with it (there were differences in how ‘public’ different national libraries want to make their Web archives.)

The Q & A session was interesting. Kevin Ashley from the DCC pointed out that Web archiving is not just about rendering single Web pages, it is about the connections. It seems Web Archiving is still an un-cracked nut, as David Pearson put it “Web archives are the opposite of well-formed and homogeneous file systems- migration is going to be difficult”.

Poster Spotlight Session

Marieke Guy in front of the Twapper Keeper poster

Later in the afternoon I was given my 2 minutes of fame and was able to present my poster on Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges.

My very brief talk is available on Vimeo and embedded below. After my presentation there was a lot of interest in the Twapper Keeper software and I was lucky enough to talk to people from the Internet Archive and the Library of Congress.

Welcome at Vienna Rathaus

After all the days sessions we had time to nip back to our hotel to put on our glad rags before a group walk along Vienna’s Ringstraßen boulevard. The welcome drinks reception was in the Coat of Arms Hall (Wappensaal) of the Vienna City Hall (Rathaus). We were treated to fantastic ballroom dancing, great wine and lots of interesting discussion.

The Ballroom dancers at the Rathaus

Tuesday 21st September

Digital Preservation Research: An Evolving Landscape

Tuesday’s keynote was given by Patricia Manson from the European Commission. Manson has been involved in defining a research agenda. She sees the challenge as building new cross-disciplinary teams that integrate computer science with library, archival science and businesses. Manson explained that there is a need to move away from the ‘e-fridge’ idea of digital preservation i.e. locking objects away. She encouraged the view that preservation is about access. Manson also stated that digital preservation not just a research issue and it is too important to be only left to researchers. There is a need for a joined up approach linking policy strategy and technology actions. 10 years of research means that we now understand more complex, dynamic and distributed objects, but there is still much to do, for example Web archiving is not a simple problem but an area that will evolve. Manson also talked about the need to involve other sectors and convince industry of the reasons for preservation. New stake holders include aerospace, health care, finance – science: astronomy and genomics, governmental and broadcasters archives, libraries and Web archives. So far the European Commission has not been very good at handling risk and tended to be risk adverse, they need to build strategies that are more open to advance technologies.

Manson concluded by looking at the trends emerging in the latest call: new infrastructures, cloud, security and trust, open questions on governance, responsibility. The next four years will require more scalable solutions. There is a need for more automation to deal with the sheer volume and a need for less human input.

How Green is Digital Preservation

After lunch (spinach strudel for the second time) Neil Grindley from JISC moderated a panel session looking at How Green is Digital Preservation. I’m interested in environmental issues and the green ICT agenda (and have discussed in more detail on my remote worker blog) so was really looking forward to this particular panel. After a whirlwind introduction by Grindley looking at the points of engagement between digital preservation and the green agenda, which included a quick show of the “delete a petabyte save a polar bear” poster, each speaker was given the opportunity to say where their organisation stood.

Panel session on How Green is Digital Preservation

In a very ‘green’ talk because it was given by video cast Diane McDonald from the University of Strathclyde explained that for her “Green IT begins with Green data”. McDonald’s main points were questioning replication and asking for leadership in this area.

Kris Carpenter Negulescu of the Internet Archive gave a practitioner’s perspective being upfront about the fact that the IA were primarily led by economic drivers. They had found that for them power is the 2nd largest cost behind human resources, and power costs vary ‘wildly’ in North California. A tighter budget now required practices not to be wasteful, so this had helped them be more green in efforts. They had tried out various practices like turning off the air conditioning for 4 months over the year, venting heat into adjacent spaces that are too cool or to outside. Over time they had increased their storage density but their power costs had remained stable.

David Rosenthal from LOCKSS started off by admitting that digital preservation is not green at all. He showed how we have been increasing the time to read a disc from 240s in 1990 to 12000+ today; but transfer speeds don’t increase without capacity.

William Kilbride from the Digital Preservation Coalition explained that unfortunately green is not what politicians talk about when it comes to IT, they are more driven by privacy and economics. He gave a 10 point plan for points of at which to think about the green agenda. These included procurement, planning of new buildings and deletion.
The session ended a little flatly with recognition that we all need to lead in this area but that still little is being done. Hopefully escalating energy prices will mean that big data centres try harder to work collaboratively to reduce individual footprints.

Lightening talks

In a similar session to the poster spotlight one on the previous day all delegates were given the opportunity to talk for a few minutes on an area of interest. Talks included:

  • Amanda Spencer from the National Archives talking about Web Continuity project
  • Ross Spencer from the National Archives talking about contributions to the National Archive PRONOM data
  • John Kunze from the University of California Curation Center talking about EZID – actionable IDs
  • Andreas Rauber from the Vienna University of Technology talking about Challenges in digital preservation
  • Richard Wright from the BBC defining what a digital object is (in the form of a miracle)
  • Stephen Abrams from the California Digital Library talking about curation of microservices
  • Martin Halbert highlighting the Aligning National Approaches To Digital Preservation conference, Talin, 2011

The lightening talks worked really well and were a useful way to highlight people you might want to talk to later.

Later in the afternoon I gave my talk on Approaches To Archiving Professional Blogs Hosted In The Cloud. There were a few interesting questions around which approach we’d felt had worked best, unfortunately there wasn’t any easy answer! My talk was directly followed by probably my favourite one of the conference…

NDIIPP and the Twitter Archives

Martha Anderson from the Library of Congress (LOC) gave the story behind what happened on April 10, 2010, when the LOC and Twitter made the decision that the Library would receive a gift of the archive of all public tweets shared through the service since its inception in 2006. On this day Twitter not only gave their archives to the LOC but also sold them to Google. Anderson began by giving some examples for relevance of Twitter archiving. These included the Iran elections where Twitter would later prove to be a resource for historical research (they are the modern form of diary entries) and business records – the LOC already has a partnership with business and sometimes keeps the business records of .com businesses. She explained that the senate was now using Twitter and the LOC has many personal collections so Twitter is a natural addition. Anderson explained that the Twitter archives are less than 5TB so the conversation is not around space but much more around policy, privacy and access. The right to be forgotten movement has since created sites like #NoLOC.org Keep Your Tweets From Being Archived Forever. Anderson concluded that the issues for the LOC were not technical but social, and for her it had demonstrated that there are no clean boundaries about the work we do.

Martha Anderson talking about the NDIIPP and the Twitter Archives

Reception at the Austrian National Library.

In the evening we attended a drinks reception at the Austrian National Library. There was a tour of the Prunksaal (State Hall) with a talk by Max Kaiser from the Austrian National Library, one of our iPres hosts, about the 30-million-euro deal the library has made with Google to digitise 400,000 copyright-free books. After marvelling at the hall we had a drinks reception in the Aurum of the National Library.

The Austrian National Library ceiling

Wednesday 21st September

The final morning concluded with a number of case study sessions.

Capturing and Replaying Streaming Media in a Web Archive

Helen Hockx-Yu from the British Library talked about the approaches they had taken to archiving streaming media as part of the Anthony Gormley One and Other art project in the UK. The project has involved100 days of continuous occupation of the fourth plinth in Trafalgar square. Over this period 2400 real people had occupied the plinth for sixty minutes each and this time had been streamed over RTMP using Red Stream. The British Library now had the challenge of archiving the outputs. They did this using Jaksta but also needed to carry out validation, spot-checking and repairs. However their main challenges were initially curatorial (people wanted content removed) and legal – the videos are still only valid under a 5 year licence. The main conclusions drawn from the project were that it is highly costly to archive a site like this, there is still no generic solution and that there is a real need to manage expectations. The domain name now redirects to the British Library Web Archive site.

Final Thoughts…

This was the first time I’d attended an iPres conference and it really was quite an impressive event. Everyone was really friendly and I’ve made some great contacts which I hope to follow up. My path into digital preservation has been through the Web archiving route, I’ve always worked on projects that have had pragmatism and practicality at their heart (for example this project and the PoWR project). Some aspects of the conference did seem very research centric and technical, but there was still enough of relevance to me to keep my interest. From speaking to those who have attended before there does seem to be a move by iPres to embrace new digital preservation challenges (like Web archiving) and more hands on research (through the late breaking results papers).

I used the #ipres2010 hashtag a lot at the conference and felt that the insights shared by those tweeting really added to my experience. Unfortunately there was only a relatively small number of people tweeting, though this is likely to change over the next few years. I’d recommend that the iPres organisers themselves use their iPres Twitter account more and specify hashtags for individual sessions, as well as for the whole conference. All the conference tweets have been archived in an iPres2010 Twapper Keeper Archive.

One other thing I would really like to see is links to speaker’s slides. Unfortunately the only resources offered were the papers. These were printed out in the huge proceeding book, which went into the huge conference bag we were given!

After the conference I had the afternoon free to enjoy the delights of Vienna. Below are a few photos of the sights I saw.

More photos from the event are available on Flickr using the ipres2010 tag.

Tags:
Posted in Conference | Comments Off

iPres 2010: Twitter Archiving Using Twapper Keeper

Posted by Marieke Guy on September 15th, 2010

I’ve already mentioned my forthcoming trip to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

As well as presenting a paper on Approaches To Archiving Professional Blogs Hosted In The Cloud I will also be presenting a poster and giving a lightning presentation entitled Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges. The full paper is held on the University of Bath repository and was written by Brian Kelly (UKOLN), Martin Hawksey (JISC RSC Scotland N&E), John O’Brien (Twapper Keeper), Matthew Rowe (University of Sheffield) and myself.

The paper explains that Twitter is now widely used in a range of different contexts, ranging from informal social communications and marketing purposes through to supporting various professional activities in teaching and learning and research. The growth in Twitter use has led to a recognition of the need to ensure that Twitter posts (‘tweets’) can be accessed and reused by a variety of third party applications.

It describes development work to the Twapper Keeper Twitter archiving service to support use of Twitter in education and research. The reasons for funding developments to an existing commercial service are described and the approaches for addressing the sustainability of such developments are provided. The paper reviews the challenges this work has addressed including the technical challenges in processing large volumes of traffic and the policy issues related, in particular, to ownership and copyright.

The paper concludes by describing the experiences gained in using the service to archive tweets posted during the WWW 2010 conference and summarising plans for further use of the service.

A copy of the poster is available on Scribd.

Tags: ,
Posted in Conference, ipres2010, Paper | 3 Comments »

iPres 2010: Archiving Professional Blogs

Posted by Marieke Guy on September 13th, 2010

Next week (20 – 22 September) I will be travelling to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

I will be presenting a long late breaking result paper at the conference entitled Approaches To Archiving Professional Blogs Hosted In The Cloud. The full paper is held on the University of Bath repository and was written by Brian Kelly and myself.

This is a practical paper which recognises that early adopters of blogs will have made use of externally-hosted blog platforms, such as WordPress.com and Blogger.com, due, perhaps, to the lack of a blogging infrastructure within the institution or concerns regarding restrictive terms and conditions covering use of such services. There will be cases in which such blogs are now well-established and contain useful information not only for current readership but also as a resource which may be valuable for future generations.

The paper sees that there is a need to preserve content which is held on such third-party services – ‘the Cloud’ provides a set of new challenges which are likely to be distinct from the management of content hosted within the institution, for which institutional policies should address issues such as ownership and scope of content. Such challenges include technical issues, such as the approaches used to gather the content and the formats to be used and policy issues related to ownership, scope and legal issues.

It describes the approaches taken in UKOLN to the preservation of blogs used in the organisation and covers the technical approaches and policy issues associated with the curation of blogs a number of different types of blogs: blogs used by members of staff in the department; blogs used to support project activities and blogs used to support events.

My slides are available on Slideshare and are embedded below.



Tags:
Posted in Conference, ipres2010, Paper | 4 Comments »