JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

What’s New at the DPC?

Posted by Marieke Guy on 15th October 2010

The latest issue of the Digital Preservation Coalition What’s New (Issue 30, October 2010) is now out.

The newsletter includes a review of iPres 2010 written by William Kilbride. The article mentions the Archiving clouds paper and the Library of Congress presentation on the Twitter archive, among other hightlights.

What’s New is a monthly update on areas of interest in Digital Preservation.

Tags:
Posted in Project news | Comments Off

Moving out of the e-Fridge: iPres 2010

Posted by Marieke Guy on 27th September 2010

Last week I attended the 7th International Conference on Preservation of Digital Objects (iPres 2010) held at the Technical University of Vienna. The conference looks at both research and best practice in the field of digital preservation and comprises of a full week of events including the regular conference, several workshops, the International Web Archiving Workshop (IWAW), the PREMIS implementation fair and lots of organised and impromptu meetings.

@art by Gerald Martineo - the conference art work

This year they had just over 290 people registered and the programme offered keynotes, 2 tracks (made up of regular papers and late breaking results) and poster sessions. The content was an interesting mix of the more traditional presentations looking at areas like metadata and object properties and some more practical talks on areas like preserving Web data.

There was a lot to take on board during the four days I was in Vienna but here are some of my highlights.

Monday 20th September

The Fourth Paradigm

The Monday morning opening keynote entitled The Fourth Paradigm: Data-Intensive Scientific Discovery & the Future Role of Research Libraries was given by Tony Hey. Hey has his roots in the academic sector and was involved in setting up the Digital Curation Centre but he now works for Microsoft; he also has a wife who is a librarian – all this made for an broad perspective on current needs when it comes to preservation of research data. Hey did a good job of putting forward Microsoft’s assurance in this area, he explained that they are committed to open standards, open tools and open technology and keen to be more involved. In the Q&A he actually admitted that Microsoft could do more to ensure its software is properly archived and available to others and that he felt they had a ‘responsibility’ in this area.

Tony Hey gives his opening keynote

Hey’s talk looked at the previous paradigms in science – experimental, theoretical and computational, and the move to a new data-intensive paradigm – the fourth paradigm (the title of his talk and his new book). Science is now overwhelmed with data sets he gave the example of Chronozoom. Rather than shy away from data deluge Hey explained that we should embrace it; the future is collective peer reviewing, collective tagging and lab notebooks as blogs. Hey also talked about software preservation and asked if we can do better? We need to decide upon the key parts and save the valuable, here he explained the relevance of Microsoft – the computing industry is very much closest to the problem. Hey then went on to mention some valuable digital preservation work that Microsoft have had research role input into: Planets project, SCAPE project, APARSEN, datacite, COAR, CNI and ICSTI.

Hey concluded by asking what the future of research libraries is? Is it that librarians have abdicated and are in danger of being disintermediated? His quote from a US General hit the nail on the head “if you do not like change you will like irrelevance even less”. Hey suggested three tasks for libraries: digital library; tools of authoring and publishing; integration of data and publications. Here he advocated that research libraries should be guardians of the research output of the institution and mentioned that they should see the importance of repositories and not be afraid of cloud solutions.

Preserving Web Archives: One Size Fits All?

Straight after an interesting lunch of a pasta pie (Vienna isn’t the best place for vegetarians!) we were offered a panel session on Preserving Web Archives: One Size Fits All? The panel [Libor Coufal (National Library of the Czech Republic ), Andrea Goethals (Harvard University Library), Gina Jones (Library of Congress), Clément Oury (French National Library) and David Pearson (National Library of Australia)] were all members of the Preservation Working Group of the IIPC (International Internet Preservation Consortium), which is made up of about forty institutions that collect web content for heritage purpose. Each member of the panel was given two questions to answer: “Web archiving” Do we have the same understanding of what we are trying to do? What are our preservation strategies for web archives?” Do we have the right technologies?

A good summary of the answers is available from the iPres site. What became clear through the discussion was that there is significant variation in what organisations are capturing (for example the French National Library are keeping spam, seeing its inclusion as a more faithful mirror of contemporary French culture) and what they plan to do with it (there were differences in how ‘public’ different national libraries want to make their Web archives.)

The Q & A session was interesting. Kevin Ashley from the DCC pointed out that Web archiving is not just about rendering single Web pages, it is about the connections. It seems Web Archiving is still an un-cracked nut, as David Pearson put it “Web archives are the opposite of well-formed and homogeneous file systems- migration is going to be difficult”.

Poster Spotlight Session

Marieke Guy in front of the Twapper Keeper poster

Later in the afternoon I was given my 2 minutes of fame and was able to present my poster on Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges.

My very brief talk is available on Vimeo and embedded below. After my presentation there was a lot of interest in the Twapper Keeper software and I was lucky enough to talk to people from the Internet Archive and the Library of Congress.

Welcome at Vienna Rathaus

After all the days sessions we had time to nip back to our hotel to put on our glad rags before a group walk along Vienna’s Ringstraßen boulevard. The welcome drinks reception was in the Coat of Arms Hall (Wappensaal) of the Vienna City Hall (Rathaus). We were treated to fantastic ballroom dancing, great wine and lots of interesting discussion.

The Ballroom dancers at the Rathaus

Tuesday 21st September

Digital Preservation Research: An Evolving Landscape

Tuesday’s keynote was given by Patricia Manson from the European Commission. Manson has been involved in defining a research agenda. She sees the challenge as building new cross-disciplinary teams that integrate computer science with library, archival science and businesses. Manson explained that there is a need to move away from the ‘e-fridge’ idea of digital preservation i.e. locking objects away. She encouraged the view that preservation is about access. Manson also stated that digital preservation not just a research issue and it is too important to be only left to researchers. There is a need for a joined up approach linking policy strategy and technology actions. 10 years of research means that we now understand more complex, dynamic and distributed objects, but there is still much to do, for example Web archiving is not a simple problem but an area that will evolve. Manson also talked about the need to involve other sectors and convince industry of the reasons for preservation. New stake holders include aerospace, health care, finance – science: astronomy and genomics, governmental and broadcasters archives, libraries and Web archives. So far the European Commission has not been very good at handling risk and tended to be risk adverse, they need to build strategies that are more open to advance technologies.

Manson concluded by looking at the trends emerging in the latest call: new infrastructures, cloud, security and trust, open questions on governance, responsibility. The next four years will require more scalable solutions. There is a need for more automation to deal with the sheer volume and a need for less human input.

How Green is Digital Preservation

After lunch (spinach strudel for the second time) Neil Grindley from JISC moderated a panel session looking at How Green is Digital Preservation. I’m interested in environmental issues and the green ICT agenda (and have discussed in more detail on my remote worker blog) so was really looking forward to this particular panel. After a whirlwind introduction by Grindley looking at the points of engagement between digital preservation and the green agenda, which included a quick show of the “delete a petabyte save a polar bear” poster, each speaker was given the opportunity to say where their organisation stood.

Panel session on How Green is Digital Preservation

In a very ‘green’ talk because it was given by video cast Diane McDonald from the University of Strathclyde explained that for her “Green IT begins with Green data”. McDonald’s main points were questioning replication and asking for leadership in this area.

Kris Carpenter Negulescu of the Internet Archive gave a practitioner’s perspective being upfront about the fact that the IA were primarily led by economic drivers. They had found that for them power is the 2nd largest cost behind human resources, and power costs vary ‘wildly’ in North California. A tighter budget now required practices not to be wasteful, so this had helped them be more green in efforts. They had tried out various practices like turning off the air conditioning for 4 months over the year, venting heat into adjacent spaces that are too cool or to outside. Over time they had increased their storage density but their power costs had remained stable.

David Rosenthal from LOCKSS started off by admitting that digital preservation is not green at all. He showed how we have been increasing the time to read a disc from 240s in 1990 to 12000+ today; but transfer speeds don’t increase without capacity.

William Kilbride from the Digital Preservation Coalition explained that unfortunately green is not what politicians talk about when it comes to IT, they are more driven by privacy and economics. He gave a 10 point plan for points of at which to think about the green agenda. These included procurement, planning of new buildings and deletion.
The session ended a little flatly with recognition that we all need to lead in this area but that still little is being done. Hopefully escalating energy prices will mean that big data centres try harder to work collaboratively to reduce individual footprints.

Lightening talks

In a similar session to the poster spotlight one on the previous day all delegates were given the opportunity to talk for a few minutes on an area of interest. Talks included:

  • Amanda Spencer from the National Archives talking about Web Continuity project
  • Ross Spencer from the National Archives talking about contributions to the National Archive PRONOM data
  • John Kunze from the University of California Curation Center talking about EZID – actionable IDs
  • Andreas Rauber from the Vienna University of Technology talking about Challenges in digital preservation
  • Richard Wright from the BBC defining what a digital object is (in the form of a miracle)
  • Stephen Abrams from the California Digital Library talking about curation of microservices
  • Martin Halbert highlighting the Aligning National Approaches To Digital Preservation conference, Talin, 2011

The lightening talks worked really well and were a useful way to highlight people you might want to talk to later.

Later in the afternoon I gave my talk on Approaches To Archiving Professional Blogs Hosted In The Cloud. There were a few interesting questions around which approach we’d felt had worked best, unfortunately there wasn’t any easy answer! My talk was directly followed by probably my favourite one of the conference…

NDIIPP and the Twitter Archives

Martha Anderson from the Library of Congress (LOC) gave the story behind what happened on April 10, 2010, when the LOC and Twitter made the decision that the Library would receive a gift of the archive of all public tweets shared through the service since its inception in 2006. On this day Twitter not only gave their archives to the LOC but also sold them to Google. Anderson began by giving some examples for relevance of Twitter archiving. These included the Iran elections where Twitter would later prove to be a resource for historical research (they are the modern form of diary entries) and business records – the LOC already has a partnership with business and sometimes keeps the business records of .com businesses. She explained that the senate was now using Twitter and the LOC has many personal collections so Twitter is a natural addition. Anderson explained that the Twitter archives are less than 5TB so the conversation is not around space but much more around policy, privacy and access. The right to be forgotten movement has since created sites like #NoLOC.org Keep Your Tweets From Being Archived Forever. Anderson concluded that the issues for the LOC were not technical but social, and for her it had demonstrated that there are no clean boundaries about the work we do.

Martha Anderson talking about the NDIIPP and the Twitter Archives

Reception at the Austrian National Library.

In the evening we attended a drinks reception at the Austrian National Library. There was a tour of the Prunksaal (State Hall) with a talk by Max Kaiser from the Austrian National Library, one of our iPres hosts, about the 30-million-euro deal the library has made with Google to digitise 400,000 copyright-free books. After marvelling at the hall we had a drinks reception in the Aurum of the National Library.

The Austrian National Library ceiling

Wednesday 21st September

The final morning concluded with a number of case study sessions.

Capturing and Replaying Streaming Media in a Web Archive

Helen Hockx-Yu from the British Library talked about the approaches they had taken to archiving streaming media as part of the Anthony Gormley One and Other art project in the UK. The project has involved100 days of continuous occupation of the fourth plinth in Trafalgar square. Over this period 2400 real people had occupied the plinth for sixty minutes each and this time had been streamed over RTMP using Red Stream. The British Library now had the challenge of archiving the outputs. They did this using Jaksta but also needed to carry out validation, spot-checking and repairs. However their main challenges were initially curatorial (people wanted content removed) and legal – the videos are still only valid under a 5 year licence. The main conclusions drawn from the project were that it is highly costly to archive a site like this, there is still no generic solution and that there is a real need to manage expectations. The domain name now redirects to the British Library Web Archive site.

Final Thoughts…

This was the first time I’d attended an iPres conference and it really was quite an impressive event. Everyone was really friendly and I’ve made some great contacts which I hope to follow up. My path into digital preservation has been through the Web archiving route, I’ve always worked on projects that have had pragmatism and practicality at their heart (for example this project and the PoWR project). Some aspects of the conference did seem very research centric and technical, but there was still enough of relevance to me to keep my interest. From speaking to those who have attended before there does seem to be a move by iPres to embrace new digital preservation challenges (like Web archiving) and more hands on research (through the late breaking results papers).

I used the #ipres2010 hashtag a lot at the conference and felt that the insights shared by those tweeting really added to my experience. Unfortunately there was only a relatively small number of people tweeting, though this is likely to change over the next few years. I’d recommend that the iPres organisers themselves use their iPres Twitter account more and specify hashtags for individual sessions, as well as for the whole conference. All the conference tweets have been archived in an iPres2010 Twapper Keeper Archive.

One other thing I would really like to see is links to speaker’s slides. Unfortunately the only resources offered were the papers. These were printed out in the huge proceeding book, which went into the huge conference bag we were given!

After the conference I had the afternoon free to enjoy the delights of Vienna. Below are a few photos of the sights I saw.

More photos from the event are available on Flickr using the ipres2010 tag.

Tags:
Posted in Conference | Comments Off

iPres 2010: Twitter Archiving Using Twapper Keeper

Posted by Marieke Guy on 15th September 2010

I’ve already mentioned my forthcoming trip to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

As well as presenting a paper on Approaches To Archiving Professional Blogs Hosted In The Cloud I will also be presenting a poster and giving a lightning presentation entitled Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges. The full paper is held on the University of Bath repository and was written by Brian Kelly (UKOLN), Martin Hawksey (JISC RSC Scotland N&E), John O’Brien (Twapper Keeper), Matthew Rowe (University of Sheffield) and myself.

The paper explains that Twitter is now widely used in a range of different contexts, ranging from informal social communications and marketing purposes through to supporting various professional activities in teaching and learning and research. The growth in Twitter use has led to a recognition of the need to ensure that Twitter posts (‘tweets’) can be accessed and reused by a variety of third party applications.

It describes development work to the Twapper Keeper Twitter archiving service to support use of Twitter in education and research. The reasons for funding developments to an existing commercial service are described and the approaches for addressing the sustainability of such developments are provided. The paper reviews the challenges this work has addressed including the technical challenges in processing large volumes of traffic and the policy issues related, in particular, to ownership and copyright.

The paper concludes by describing the experiences gained in using the service to archive tweets posted during the WWW 2010 conference and summarising plans for further use of the service.

A copy of the poster is available on Scribd.

Tags: ,
Posted in Conference, ipres2010, Paper | 3 Comments »

iPres 2010: Archiving Professional Blogs

Posted by Marieke Guy on 13th September 2010

Next week (20 – 22 September) I will be travelling to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

I will be presenting a long late breaking result paper at the conference entitled Approaches To Archiving Professional Blogs Hosted In The Cloud. The full paper is held on the University of Bath repository and was written by Brian Kelly and myself.

This is a practical paper which recognises that early adopters of blogs will have made use of externally-hosted blog platforms, such as WordPress.com and Blogger.com, due, perhaps, to the lack of a blogging infrastructure within the institution or concerns regarding restrictive terms and conditions covering use of such services. There will be cases in which such blogs are now well-established and contain useful information not only for current readership but also as a resource which may be valuable for future generations.

The paper sees that there is a need to preserve content which is held on such third-party services – ‘the Cloud’ provides a set of new challenges which are likely to be distinct from the management of content hosted within the institution, for which institutional policies should address issues such as ownership and scope of content. Such challenges include technical issues, such as the approaches used to gather the content and the formats to be used and policy issues related to ownership, scope and legal issues.

It describes the approaches taken in UKOLN to the preservation of blogs used in the organisation and covers the technical approaches and policy issues associated with the curation of blogs a number of different types of blogs: blogs used by members of staff in the department; blogs used to support project activities and blogs used to support events.

My slides are available on Slideshare and are embedded below.



Tags:
Posted in Conference, ipres2010, Paper | 4 Comments »