JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for September, 2010

Web Preservation the UKOLN way

Posted by Marieke Guy on 30th September 2010

I have had a guest blog post published on the Museums Computer Group blog. The post, entitled Web Preservation the UKOLN way, talks about some of the recent preservation work I have been doing at UKOLN and about our cultural heritage area of the UKOLN Web site.

Tags:
Posted in articles | 1 Comment »

Moving out of the e-Fridge: iPres 2010

Posted by Marieke Guy on 27th September 2010

Last week I attended the 7th International Conference on Preservation of Digital Objects (iPres 2010) held at the Technical University of Vienna. The conference looks at both research and best practice in the field of digital preservation and comprises of a full week of events including the regular conference, several workshops, the International Web Archiving Workshop (IWAW), the PREMIS implementation fair and lots of organised and impromptu meetings.

@art by Gerald Martineo - the conference art work

This year they had just over 290 people registered and the programme offered keynotes, 2 tracks (made up of regular papers and late breaking results) and poster sessions. The content was an interesting mix of the more traditional presentations looking at areas like metadata and object properties and some more practical talks on areas like preserving Web data.

There was a lot to take on board during the four days I was in Vienna but here are some of my highlights.

Monday 20th September

The Fourth Paradigm

The Monday morning opening keynote entitled The Fourth Paradigm: Data-Intensive Scientific Discovery & the Future Role of Research Libraries was given by Tony Hey. Hey has his roots in the academic sector and was involved in setting up the Digital Curation Centre but he now works for Microsoft; he also has a wife who is a librarian – all this made for an broad perspective on current needs when it comes to preservation of research data. Hey did a good job of putting forward Microsoft’s assurance in this area, he explained that they are committed to open standards, open tools and open technology and keen to be more involved. In the Q&A he actually admitted that Microsoft could do more to ensure its software is properly archived and available to others and that he felt they had a ‘responsibility’ in this area.

Tony Hey gives his opening keynote

Hey’s talk looked at the previous paradigms in science – experimental, theoretical and computational, and the move to a new data-intensive paradigm – the fourth paradigm (the title of his talk and his new book). Science is now overwhelmed with data sets he gave the example of Chronozoom. Rather than shy away from data deluge Hey explained that we should embrace it; the future is collective peer reviewing, collective tagging and lab notebooks as blogs. Hey also talked about software preservation and asked if we can do better? We need to decide upon the key parts and save the valuable, here he explained the relevance of Microsoft – the computing industry is very much closest to the problem. Hey then went on to mention some valuable digital preservation work that Microsoft have had research role input into: Planets project, SCAPE project, APARSEN, datacite, COAR, CNI and ICSTI.

Hey concluded by asking what the future of research libraries is? Is it that librarians have abdicated and are in danger of being disintermediated? His quote from a US General hit the nail on the head “if you do not like change you will like irrelevance even less”. Hey suggested three tasks for libraries: digital library; tools of authoring and publishing; integration of data and publications. Here he advocated that research libraries should be guardians of the research output of the institution and mentioned that they should see the importance of repositories and not be afraid of cloud solutions.

Preserving Web Archives: One Size Fits All?

Straight after an interesting lunch of a pasta pie (Vienna isn’t the best place for vegetarians!) we were offered a panel session on Preserving Web Archives: One Size Fits All? The panel [Libor Coufal (National Library of the Czech Republic ), Andrea Goethals (Harvard University Library), Gina Jones (Library of Congress), Clément Oury (French National Library) and David Pearson (National Library of Australia)] were all members of the Preservation Working Group of the IIPC (International Internet Preservation Consortium), which is made up of about forty institutions that collect web content for heritage purpose. Each member of the panel was given two questions to answer: “Web archiving” Do we have the same understanding of what we are trying to do? What are our preservation strategies for web archives?” Do we have the right technologies?

A good summary of the answers is available from the iPres site. What became clear through the discussion was that there is significant variation in what organisations are capturing (for example the French National Library are keeping spam, seeing its inclusion as a more faithful mirror of contemporary French culture) and what they plan to do with it (there were differences in how ‘public’ different national libraries want to make their Web archives.)

The Q & A session was interesting. Kevin Ashley from the DCC pointed out that Web archiving is not just about rendering single Web pages, it is about the connections. It seems Web Archiving is still an un-cracked nut, as David Pearson put it “Web archives are the opposite of well-formed and homogeneous file systems- migration is going to be difficult”.

Poster Spotlight Session

Marieke Guy in front of the Twapper Keeper poster

Later in the afternoon I was given my 2 minutes of fame and was able to present my poster on Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges.

My very brief talk is available on Vimeo and embedded below. After my presentation there was a lot of interest in the Twapper Keeper software and I was lucky enough to talk to people from the Internet Archive and the Library of Congress.

Welcome at Vienna Rathaus

After all the days sessions we had time to nip back to our hotel to put on our glad rags before a group walk along Vienna’s Ringstraßen boulevard. The welcome drinks reception was in the Coat of Arms Hall (Wappensaal) of the Vienna City Hall (Rathaus). We were treated to fantastic ballroom dancing, great wine and lots of interesting discussion.

The Ballroom dancers at the Rathaus

Tuesday 21st September

Digital Preservation Research: An Evolving Landscape

Tuesday’s keynote was given by Patricia Manson from the European Commission. Manson has been involved in defining a research agenda. She sees the challenge as building new cross-disciplinary teams that integrate computer science with library, archival science and businesses. Manson explained that there is a need to move away from the ‘e-fridge’ idea of digital preservation i.e. locking objects away. She encouraged the view that preservation is about access. Manson also stated that digital preservation not just a research issue and it is too important to be only left to researchers. There is a need for a joined up approach linking policy strategy and technology actions. 10 years of research means that we now understand more complex, dynamic and distributed objects, but there is still much to do, for example Web archiving is not a simple problem but an area that will evolve. Manson also talked about the need to involve other sectors and convince industry of the reasons for preservation. New stake holders include aerospace, health care, finance – science: astronomy and genomics, governmental and broadcasters archives, libraries and Web archives. So far the European Commission has not been very good at handling risk and tended to be risk adverse, they need to build strategies that are more open to advance technologies.

Manson concluded by looking at the trends emerging in the latest call: new infrastructures, cloud, security and trust, open questions on governance, responsibility. The next four years will require more scalable solutions. There is a need for more automation to deal with the sheer volume and a need for less human input.

How Green is Digital Preservation

After lunch (spinach strudel for the second time) Neil Grindley from JISC moderated a panel session looking at How Green is Digital Preservation. I’m interested in environmental issues and the green ICT agenda (and have discussed in more detail on my remote worker blog) so was really looking forward to this particular panel. After a whirlwind introduction by Grindley looking at the points of engagement between digital preservation and the green agenda, which included a quick show of the “delete a petabyte save a polar bear” poster, each speaker was given the opportunity to say where their organisation stood.

Panel session on How Green is Digital Preservation

In a very ‘green’ talk because it was given by video cast Diane McDonald from the University of Strathclyde explained that for her “Green IT begins with Green data”. McDonald’s main points were questioning replication and asking for leadership in this area.

Kris Carpenter Negulescu of the Internet Archive gave a practitioner’s perspective being upfront about the fact that the IA were primarily led by economic drivers. They had found that for them power is the 2nd largest cost behind human resources, and power costs vary ‘wildly’ in North California. A tighter budget now required practices not to be wasteful, so this had helped them be more green in efforts. They had tried out various practices like turning off the air conditioning for 4 months over the year, venting heat into adjacent spaces that are too cool or to outside. Over time they had increased their storage density but their power costs had remained stable.

David Rosenthal from LOCKSS started off by admitting that digital preservation is not green at all. He showed how we have been increasing the time to read a disc from 240s in 1990 to 12000+ today; but transfer speeds don’t increase without capacity.

William Kilbride from the Digital Preservation Coalition explained that unfortunately green is not what politicians talk about when it comes to IT, they are more driven by privacy and economics. He gave a 10 point plan for points of at which to think about the green agenda. These included procurement, planning of new buildings and deletion.
The session ended a little flatly with recognition that we all need to lead in this area but that still little is being done. Hopefully escalating energy prices will mean that big data centres try harder to work collaboratively to reduce individual footprints.

Lightening talks

In a similar session to the poster spotlight one on the previous day all delegates were given the opportunity to talk for a few minutes on an area of interest. Talks included:

  • Amanda Spencer from the National Archives talking about Web Continuity project
  • Ross Spencer from the National Archives talking about contributions to the National Archive PRONOM data
  • John Kunze from the University of California Curation Center talking about EZID – actionable IDs
  • Andreas Rauber from the Vienna University of Technology talking about Challenges in digital preservation
  • Richard Wright from the BBC defining what a digital object is (in the form of a miracle)
  • Stephen Abrams from the California Digital Library talking about curation of microservices
  • Martin Halbert highlighting the Aligning National Approaches To Digital Preservation conference, Talin, 2011

The lightening talks worked really well and were a useful way to highlight people you might want to talk to later.

Later in the afternoon I gave my talk on Approaches To Archiving Professional Blogs Hosted In The Cloud. There were a few interesting questions around which approach we’d felt had worked best, unfortunately there wasn’t any easy answer! My talk was directly followed by probably my favourite one of the conference…

NDIIPP and the Twitter Archives

Martha Anderson from the Library of Congress (LOC) gave the story behind what happened on April 10, 2010, when the LOC and Twitter made the decision that the Library would receive a gift of the archive of all public tweets shared through the service since its inception in 2006. On this day Twitter not only gave their archives to the LOC but also sold them to Google. Anderson began by giving some examples for relevance of Twitter archiving. These included the Iran elections where Twitter would later prove to be a resource for historical research (they are the modern form of diary entries) and business records – the LOC already has a partnership with business and sometimes keeps the business records of .com businesses. She explained that the senate was now using Twitter and the LOC has many personal collections so Twitter is a natural addition. Anderson explained that the Twitter archives are less than 5TB so the conversation is not around space but much more around policy, privacy and access. The right to be forgotten movement has since created sites like #NoLOC.org Keep Your Tweets From Being Archived Forever. Anderson concluded that the issues for the LOC were not technical but social, and for her it had demonstrated that there are no clean boundaries about the work we do.

Martha Anderson talking about the NDIIPP and the Twitter Archives

Reception at the Austrian National Library.

In the evening we attended a drinks reception at the Austrian National Library. There was a tour of the Prunksaal (State Hall) with a talk by Max Kaiser from the Austrian National Library, one of our iPres hosts, about the 30-million-euro deal the library has made with Google to digitise 400,000 copyright-free books. After marvelling at the hall we had a drinks reception in the Aurum of the National Library.

The Austrian National Library ceiling

Wednesday 21st September

The final morning concluded with a number of case study sessions.

Capturing and Replaying Streaming Media in a Web Archive

Helen Hockx-Yu from the British Library talked about the approaches they had taken to archiving streaming media as part of the Anthony Gormley One and Other art project in the UK. The project has involved100 days of continuous occupation of the fourth plinth in Trafalgar square. Over this period 2400 real people had occupied the plinth for sixty minutes each and this time had been streamed over RTMP using Red Stream. The British Library now had the challenge of archiving the outputs. They did this using Jaksta but also needed to carry out validation, spot-checking and repairs. However their main challenges were initially curatorial (people wanted content removed) and legal – the videos are still only valid under a 5 year licence. The main conclusions drawn from the project were that it is highly costly to archive a site like this, there is still no generic solution and that there is a real need to manage expectations. The domain name now redirects to the British Library Web Archive site.

Final Thoughts…

This was the first time I’d attended an iPres conference and it really was quite an impressive event. Everyone was really friendly and I’ve made some great contacts which I hope to follow up. My path into digital preservation has been through the Web archiving route, I’ve always worked on projects that have had pragmatism and practicality at their heart (for example this project and the PoWR project). Some aspects of the conference did seem very research centric and technical, but there was still enough of relevance to me to keep my interest. From speaking to those who have attended before there does seem to be a move by iPres to embrace new digital preservation challenges (like Web archiving) and more hands on research (through the late breaking results papers).

I used the #ipres2010 hashtag a lot at the conference and felt that the insights shared by those tweeting really added to my experience. Unfortunately there was only a relatively small number of people tweeting, though this is likely to change over the next few years. I’d recommend that the iPres organisers themselves use their iPres Twitter account more and specify hashtags for individual sessions, as well as for the whole conference. All the conference tweets have been archived in an iPres2010 Twapper Keeper Archive.

One other thing I would really like to see is links to speaker’s slides. Unfortunately the only resources offered were the papers. These were printed out in the huge proceeding book, which went into the huge conference bag we were given!

After the conference I had the afternoon free to enjoy the delights of Vienna. Below are a few photos of the sights I saw.

More photos from the event are available on Flickr using the ipres2010 tag.

Tags:
Posted in Conference | Comments Off

iPres 2010: Twitter Archiving Using Twapper Keeper

Posted by Marieke Guy on 15th September 2010

I’ve already mentioned my forthcoming trip to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

As well as presenting a paper on Approaches To Archiving Professional Blogs Hosted In The Cloud I will also be presenting a poster and giving a lightning presentation entitled Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges. The full paper is held on the University of Bath repository and was written by Brian Kelly (UKOLN), Martin Hawksey (JISC RSC Scotland N&E), John O’Brien (Twapper Keeper), Matthew Rowe (University of Sheffield) and myself.

The paper explains that Twitter is now widely used in a range of different contexts, ranging from informal social communications and marketing purposes through to supporting various professional activities in teaching and learning and research. The growth in Twitter use has led to a recognition of the need to ensure that Twitter posts (‘tweets’) can be accessed and reused by a variety of third party applications.

It describes development work to the Twapper Keeper Twitter archiving service to support use of Twitter in education and research. The reasons for funding developments to an existing commercial service are described and the approaches for addressing the sustainability of such developments are provided. The paper reviews the challenges this work has addressed including the technical challenges in processing large volumes of traffic and the policy issues related, in particular, to ownership and copyright.

The paper concludes by describing the experiences gained in using the service to archive tweets posted during the WWW 2010 conference and summarising plans for further use of the service.

A copy of the poster is available on Scribd.

Tags: ,
Posted in Conference, ipres2010, Paper | 3 Comments »

iPres 2010: Archiving Professional Blogs

Posted by Marieke Guy on 13th September 2010

Next week (20 – 22 September) I will be travelling to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

I will be presenting a long late breaking result paper at the conference entitled Approaches To Archiving Professional Blogs Hosted In The Cloud. The full paper is held on the University of Bath repository and was written by Brian Kelly and myself.

This is a practical paper which recognises that early adopters of blogs will have made use of externally-hosted blog platforms, such as WordPress.com and Blogger.com, due, perhaps, to the lack of a blogging infrastructure within the institution or concerns regarding restrictive terms and conditions covering use of such services. There will be cases in which such blogs are now well-established and contain useful information not only for current readership but also as a resource which may be valuable for future generations.

The paper sees that there is a need to preserve content which is held on such third-party services – ‘the Cloud’ provides a set of new challenges which are likely to be distinct from the management of content hosted within the institution, for which institutional policies should address issues such as ownership and scope of content. Such challenges include technical issues, such as the approaches used to gather the content and the formats to be used and policy issues related to ownership, scope and legal issues.

It describes the approaches taken in UKOLN to the preservation of blogs used in the organisation and covers the technical approaches and policy issues associated with the curation of blogs a number of different types of blogs: blogs used by members of staff in the department; blogs used to support project activities and blogs used to support events.

My slides are available on Slideshare and are embedded below.



Tags:
Posted in Conference, ipres2010, Paper | 4 Comments »

A Structure for the Guide

Posted by Marieke Guy on 10th September 2010

Back in May I wrote about my brain storming session and how I’d got a rough structure in place. I’ve now written the majority of the guide and have a table of contents to share with you.

Although the structure is based on a series of questions the pages are intereconnected and it is hoped that people will be able to approach the guide in different ways: through tags, through an index, as one off answers to a question, as briefing papers etc.

Anyway here is the structure as it now stands – please do let me know what you think.

  • Why is Digital Preservation Relevant to my JISC Project?
    • Access and Reuse Drivers
    • Legal Drivers
    • Economic Drivers
    • Reputation Drivers
    • Responsibility Drivers
    • Corporate Memory
    • Cultural Drivers
  • What are the Particular Digital Preservation Challenges JISC Projects Face?
    • Responsibility
    • Risk Management
    • Cost
  • What is Digital Preservation?
    • Definition of Digital Preservation
    • Definition of Digital Object
    • Definition of Digital Deterioration
    • Digital Preservation Approaches
    • Lifecycle Model
  • What Exactly do I Need to Preserve?
    • Information Audit
    • JISC Project Outputs
      • Text Based Information
      • Software
      • Numerical Information
      • Audio Visual
      • Emails
      • Event Information
      • Learning Objects and Teaching Materials
      • Web 1.0
      • Web 2.0
        • Blogs
        • Wikis
        • Twitter
      • Personal Data
    • Selection
  • How do I Make my Deliverables Easier to Preserve?
    • Formats
    • IPR and Licences
    • Metadata
  • How do I Preserve Digital Objects?
    • Tools
    • Training
    • Preservation Strategy
    • Preservation Policy
  • Can I Offload Digital Preservation?
    • Repositories
    • External Web Archiving Services
    • Institutional Records Management Processes
    • Outsourcing
  • Appendix
    • Glossary
    • Case Studies

Posted in Project news | Comments Off

Case Study: Tap into Bath

Posted by Marieke Guy on 8th September 2010

There is a lot of value in digital preservation case studies. Building knowledge from observation by sharing approaches can save others a lot of time and effort.

My colleague Ann Chapman has written the following brief case study on the archiving of the Tap into Bath demonstrator project. The demonstrator project contained elements including a Web site, data base and software. Ann can be contacted via her UKOLN staff page.

Tap into Bath

Tap into Bath was a demonstrator project to create a searchable database of collection-level descriptions as part of the Collection Description Focus work programme. The lead partners were UKOLN and University of Bath Library; and twenty-five contributing partners from archive, library and museum collections in both public and private sectors in the City of Bath.

Tap into Bath Home Page

The project began on 12 January 2004. The database was created by a member of the University of Bath library staff using the RSLP Metadata Schema for collection description and a MySQL database. A programmer was hired to create the search and display interfaces; it was part of the contract that the project would be making this available as open source software. The completed database and the project web pages were held on a University of Bath server. Partner organisations submitted collection entries as Word documents and the data was entered into the database by University library staff. The Tap into Bath database was formally launched at the Guildhall in Bath on 8 December 2004.

The MySQL database and the search and data entry interfaces were designated as open source and the un-populated database and accompanying software offered for re-use with accreditation. Several enquiries were received; some did not proceed (typically because funding was unavailable for data collection and data entry tasks) but the following resources were created, both of which have Web links to the Tap into Bath site.

The Southern Cross Resource Finder (SCRF) is a web-based resource that enables users to discover collections from libraries, archives and museums which hold resources useful for the study of Australia and/or New Zealand. Produced by and is maintained by the Menzies Centre for Australian Studies, King’s College London, it was launched in 2005.

Milton Keynes Inspire is an online searchable database to promote access to the collections in museums, archives, galleries and libraries in Milton Keynes, launched on 4 November 2005.

In 2007, project partners were contacted and asked to review their entries and supply any additional or amended data and the database was updated.

Tap into Bath record for the Holburne museum

In 2010 UKOLN received notification that the University of Bath server would be de-commissioned later that year and reviewed the status of the project. The conclusion was that the data needed further updating immediately and on a regular basis in the future and that it would benefit from a re-designed search and display interface. Neither UKOLN nor any of the partners has the resources to host, maintain and/or develop the resource and so it was decided to take the following actions.

  • Archive on UKOLN server and burn onto DVD
    • Populated Tap into Bath database
    • Un-populated database
    • Web apps for data entry and search & display interfaces
    • Word documents ‘High Level Design’ and ‘System Maintenance Guide’
    • Metadata schema
    • Guidelines for data entry
    • Screenshots of search and display pages in use
  • Create a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use
    • Create new Web pages for the project on the UKOLN server that:
    • Record the history of the project
    • State that the resource has been taken down
    • State that the Tap into Bath email address is no longer active
    • Provide access to a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use.
  • Notify partners of new URL and request they remove the old URL if this is currently displayed on their Web site
  • Notify Southern Cross and Milton Keynes Inspire of change of URL so they can update the acknowledgement link on their Web pages
  • Tap into Bath email address: messages currently go to a member UKOLN’s Outreach & Community team. This to be changed so messages go to a member of UKOLN’s Systems and Support team.
  • Record all of the above activity for UKOLN resource management purposes

Carrying out the above processes has ensured that the Tap into Bath site and data has been effectively archived for the short-term. Openly documenting the process enables interested parties to be aware of the archive process and know who should be contacted if any information or data is required.

Tags:
Posted in Case studies | 2 Comments »

Treasuring Twitter

Posted by Marieke Guy on 6th September 2010

My article on Treasuring Twitter: The Why and How of Preserving Tweets has now been published in FUMSI.

FUMSI is the successor to FreePint and publishes articles aimed at helping information professionals do their work.

Tags:
Posted in articles | Comments Off