JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive-It Webinar

Posted by Marieke Guy on July 19th, 2010

I wanted to find out more about the Archive-It available service from the Internet Archive so have just watched their pre-recorded online Webinar. Kristine Hanna and the Archive It team run live informational webinars but the time difference made it a little tricky for me to attend one of these (they are every 2 weeks at 11:30 am PT).

Many of you will have already heard of the Internet Archive, it is a non-profit organisation that was founded in 1997 by Brewster Kahle. It’s aim is “universal access to human knowledge and it fulfils this by aggregating a “broad snapshot of the Web every 2 months” in a Web archive/library, which has been built using open source software. It is currently the largest public Web archive in existence and is around 150 billion+ pages collected from 65+ Web sites.

In response to partners request for more control over their data the Internet Archive team developed the Archive-It service.

It is a web based application that enables institutions to harvest and archive digital content. They can create focused collections, can alter the content scope and frequency of crawls, can add metadata to their collections, make their collections private and more. Archived content is available for browsing 24 hours after a crawl has been completed and is full text searchable within 7 days. Collections are hosted at the Internet Archive data centre, and a subscription includes hosting, access and storage. Collections can be browsed here from the Archive-It Home page or from a landing page on your institution’s Web site.

Archive-It uses open Source technology primarily developed by Internet Archive and International Internet Preservation Consortium (IIPC).

The key components are:

  • Heritrix: Web crawler – crawls and captures pages
  • Wayback Machine – access tool for rendering and viewing Web pages. Displays archived web pages – surf the web as it was.
  • Nutch – open source search engine. Standard full text search
  • WARC file – archival file format used for preservation

I’m planning to write more about the technologies behind Web archiving when I get a chance.

Archive-It Webinar

The Webinar itself is really clear and easy to follow and lasts about 30 minutes. It begins with an introduction to the Internet Archive and Archive-It and ends with a demo of the Archive-It software. As the webinar shows there is no software to install or download. All you need is a computer, internet connection and a browser to be able to access your collection. It looks very easy to use. During a 15 minute demo a new collection is created, metadata added, urls added, settings specified, the collection crawled and a number of reports created. Although many big institutions use the service it is designed with institutions with smaller collection and infrastructure requirements in mind. A free complimentary trial (for 2-4 weeks, one collection, 10 seeds (sites) up to 500,000 urls) is offered to webinar attendees.

I was also quite interested in how the Archive it team are archiving social networking sites, including blogs. By email the team explained that many of their partners have been doing this and they have been successfully capturing blogs since 2007. Partners is the term used for any institution or organisation that use or interact with the Archive-It service. Although they are US based they currently have several European partners (over 125 partners currently use Archive-It) and quite a few in UK HE (35% of their partners come from the university and public libraries) – see their partners list for more details. Archive-It also have connections with a number of digital preservation systems including LOCKSS, Dspace, Content DM, iRODS and ArchivalWare.

Tags:
Posted in Archiving | Comments Off

A Guide to Web Preservation

Posted by Marieke Guy on July 12th, 2010

Today the JISC Preservation of Web Resources (PoWR) team announced the launch of A Guide to Web Preservation.

I worked on the PoWR project back in 2008. The project organised workshops and produced a handbook that specifically addressed digital preservation issues that were, and still are, relevant to the UK HE/FE web management community. It was a really successful project but later down the line there was felt to be a need for a more accessible, easy-to-use version of the handbook. This new guide does just the trick! I was really pleased to see it being given out as a resource on the DPTP Web Archiving Workshop I attended a few weeks back.

To steal some words from the press release:

This Guide uses similar content to PoWR: The Preservation of Web Resources Handbook but in a way which provides a practical guide to web preservation, particularly for web and records managers. The chapters are set out in a logical sequence and answer the questions which might be raised when web preservation is being seriously considered by an institution. These are:

  • What is preservation?
  • What are web resources?
  • Why do I have to preserve them?
  • What is a web preservation programme?
  • How do I decide what to preserve?
  • How do I capture them?
  • Who should be involved?
  • What approaches should I take?
  • What policies need to be developed?

Each chapter concludes with a set of actions and one chapter lists the tasks which must be carried out, and the timings of these tasks, if an institution is to develop and maintain a web preservation programme. In addition points made in the Guide are illustrated with a number of case studies.

The guide was edited by Susan Farrell who has used her knowledge and expertise in the management of large-scale institutional Web services in writing the document.

The Guide can be downloaded (in PDF format) from the JISC PoWR Web site. The Guide is also hosted on JISCPress service which provides a commenting and annotation capability. It has been published on the Lulu.com print-on-demand service where it can be bought for £2.82 plus postage and packing.

Posted in Archiving, trainingmaterials | 1 Comment »

What To Do When a Service Provider Closes

Posted by Marieke Guy on July 5th, 2010

A discussion between the Digital Preservation Coalition and the Digital Curation Centre has led to a new UKOLN Cultural Heritage briefing paper on ‘What To Do When a Service Provider Closes‘.

The original notes for the briefing paper (written by William Kilbride of the DPC) were offered in response to a cry for help from someone working for a non-for-profit organisation that is closing down due to the recession. They were looking for some guidance on what to do with their digital collections.

The briefing paper offers seven point checklist presents some steps that creators and managers of community digital archives might take to make sure that their data is available in the long term.

The key suggestions are:

  1. Keep the Masters
  2. Know What’s What
  3. There Should be a Disaster Plan
  4. Agree a Succession Plan
  5. Know Your Rights
  6. Find a Digital Preservation Service
  7. Put a Copy of your Web Site in a Web Archive

William commented on the guidance:

This will be a growing area of business – and it’s illustrative of the gap between the advice that people need and the advice that’s out there.

Hopefully the Beginner’s Guide to Digital Preservation will be able to put more people in touch with information that helps them sort out their very immediate digital preservation problems.

The guide is available in Word and as a Web page.

There are now 82 Cultural Heritage briefing papers available on the UKOLN Cultural Heritage Web site. The papers are concise and clear and make excellent training materials. There are several related to digital preservation:

Posted in Project news | 1 Comment »

DPTP Web Archiving Workshop

Posted by Marieke Guy on July 1st, 2010

On Monday I attended the Digital Preservation Training Programme (DPTP) Web archiving workshop.

The Digital Preservation Training Programme

The Digital Preservation Training Programme (DPTP) was initially a project funded by the JISC under its Digital Preservation and Asset Management programme. It has been lead by ULCC and had input from the Digital Preservation Coalition, Cornell University and the British Library.

The programme offers a modular training programme with content aimed at multiple levels of attendee. It builds on the foundations of Cornell’s Digital Preservation Management Workshop.

The DPTP team currently run a 3 day digital preservation course, of which Web archiving is a module.  However this was the first time they had offered the module independently of the rest of the course. I believe there are intentions to offer one-off modules more in the future. They are also planning to offer more content online both freely under a Creative Commons licence and as part of a paid for course. This work is still in the development stage.

For the Web archiving workshop they had squeezed the module into half a day, which made for some rushing of content and a late finish. I have a feeling they will be rethinking their timetable. There was way to much content for 3 hours and a longer workshop would allow more time for networking and group activities.

Approach

The team (Ed Pinsent and Patricia Sleeman) started off by introducing the 3 legged stool approach (borrowed from Cornell). This approach sees understanding of digital preservation as requiring consideration of 3 main areas: technology, organisation and resources. While technology used to be seen as the silver bullet these days achieving good digital preservation is much more about planning (the organisation and resource legs). The Web archiving module considered primarily issues relating to the technology and organisation legs.

At the start of the workshop the DPTP team were upfront about the approach they wanted to take and what they wanted from attendees. They explained that they were not there to promote ‘one right way’ but to offer an explanation of the current situation and then allow us to make the decisions. They were keen to encourage interaction and informal question asking -  “there are no stupid questions“.

Content

The content of the day was really useful, the team trod a nice line between covering cultural issues and the technologies that enable archiving. I won’t go into what I learnt here, that’s the content of another blog post but despite being fairly familiar with Web archiving I found there was lots of new information to digest. Not only did I learn from the team but I also learnt by chatting to others interested in Web archiving. This form of focused networking can be hugely beneficial. For example the person sat next to me had been charged with acquisitioning some of the Government Web sites that are for the chop as part of the (up to) 75% cut. His current big concern was what do you do about domain names? We had lots to discus.

There were also attendees from outside the public sector (such as the lady from a commercial bank), they offered a different perspective on issues and it was refreshing to spend time with them.

Late in the morning we heard more formally about Web archiving from a guest speaker, Dave Thompson from the Wellcome Trust. Dave spoke about their archive of medicine related Web sites created through the UK Web Archive. Many of the sites they collect (such as personal accounts of experience of illness) are out of scope for normal preservation programmes. The Wellcome Trust don’t mediate the content of the Web sites collected, as Dave explained it’s not the job of the archivist (or librarian) to do so. For example, there are books on Witchcraft and quackery in the Wellcome library. It’s the job of the archive to preserve and make available these source materials; it’s the job of the historian or researcher to interpret them. The archive will provide a valuable record for our future researchers.

Dave ended with a quotation from Adrian Sanders Liberal Democrat MP for Torbay. As part of the debate on the The future for local and regional media Sanders had said that he thought that “Most of what’s online is indeed tittle tattle and opinion.” Dave observed that such an opinion from a member of parliament was extremely worrying. Many still failed to understand the value of the Web and the value of preserving it. Tittle tattle and opinion is what our papers consist of (and we preserve them) and what ultimately history is made of.

Overall

I really enjoyed the workshop and would thoroughly recommend it to anyone  who needs practical advice to get them started planning a Web archiving project. The speakers were excellent, both knowledgable and receptive to the information needs of the audience.  As Ed explained at the start “I learn a lot from you too“.

My only criticism of the day is that due to over-running some of the slides were missed out. Also technology problems meant that the team were unable to play the screen cast of HTTrack doing it’s stuff. I think the screencast and would make a valuable contribution to the resources offered. The resources available both online and off were excellent, we even received a great certificate for completing the course. Something to hang on my wall!

Tags: ,
Posted in training | 3 Comments »

Creating Open Training Materials

Posted by Marieke Guy on June 24th, 2010

Yesterday I attended the Open University annual Learning and Technology conference: Learning in an open world.

I’ve talked more about my general feelings on the day in another blog post (Learning at an Online Conference) but here I want to focus in on the content of one particular session: Creating Open Courses, presented by Tony Hirst of the Open University (you can watch a playback of the session in Elluminate). During my time working on the JISC Beginner’s Guide to Digital Preservation I’ve been thinking quite a bit about what it means to create open training/learning materials and Tony’s approach struck a chord with me.

Tony’s slides are available on Slideshare and embedded below.



Tony’s talk focused around his creation of the T151 course, an OU module on Digital Worlds, part of the Relevant Knowledge programme.

Tony talked about how the OU are making their print content open through services like OpenLearn and their AV material through YouTube and iTunesU. However while this is happening the mode of production is not necessarily open and he explained that it can take several years to produce a course and it can take 5 to 10 academics 18 months to write one.

Tony wanted to move away from this approach and write the T151 course in public and virtually in real time – 10 weeks of content in 20 weeks. He did so by writing blog posts. The course actually took about 15 weeks to write.

Tony made the choice to use WordPress primarily because of the restrictions on what you can embed, in this way it is similar to Moodle (the open source VLE OU and others are familar with). This is an interesting approach. I am leaning heavily towards WordPress for the final delivery of the Beginner’s Guide (primarily due to time restraints and the fact that I already have experience using WordPress). The restrictions are sometimes a hinderance to me rather than a benefit! Another reason he chose WordPress was it “gives you RSS with everything“, agreed, this can be a real bonus.

Tony then wrote blog posts on series of topics according to a curriculum developed with other academics. He used a FreeMind mind map to get his ideas down and then each blog post was made up of 500-1000 words and took 1-4 hours to write. The end result would take students 10 minutes to 1 hour to work through. Within his posts Tony embedded YouTube movies and other external services. The end result was not a single fixed linear narrative but an emergent narratives. He used GraphViz visualisation to show reverse trackbacks where posts reference previous posts.

The blog also contained questions, readings and links to other relevant content. The idea of this was that each area could be populated from a live feed maintained by someone else. Tony felt that the important thing was to allow students to explore and do (e.g use GameMaker to build a game and submit it), share (using Moodle forums) and demonstrate.

Tony wanted to get away from the idea that there’s a single route through the course and that educator is expressing the one true answer. The students were also provided with a Skunkworks area in a wiki and a FreeMind mind map of all the resources in the course. Assessment was given through short questions and a larger question: they had to write a game design document for a game. He was looking for students to have opportunities to surprise.

In the Q&A Tony talked about how he had written the course while trying to do 101 other things at the same time and how a lot of the course chunks he would write for multiple reasons – this seems to be the approach I’m currently taking. Tony concluded by saying that creating the course was a travelogue in part and was his journey through that material.

How good to hear the approach that I’m currently taking (or trying to take) being endorsed!

Tony has written more about his approach on http://blog.ouseful.info and is very vocal on Twitter as @psychemedia.

Tags:
Posted in Events, trainingmaterials | Comments Off

LiWA Launch first Code

Posted by Marieke Guy on June 14th, 2010

Today the LiWA (Living Web Archives) project has announced the release of the first open-source components of the “liwa-technologies” project on Google code.

The LiWA project is looking

beyond the pure “freezing” of Web content snapshots for a long time, transforming pure snapshot storage into a “Living” Web Archive. “Living” refers to a) long term interpretability as archives evolve, b) improved archive fidelity by filtering out irrelevant noise and c) considering a wide variety of content.

They plan to extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.

This is the first release of the software so they are keen to receive feedback and comments.

Posted in Archiving, Web | Comments Off

Have you got a Case Study for Us?

Posted by Marieke Guy on June 10th, 2010

If you are involved in a JISC project (or work in a similar environment) and would like to offer us a case study of your digital preservation methods please do get in touch.

Some areas that you might want to include in your case study are:

  • The background to your project.
  • A description of the digital preservation problem being addressed.
  • An explanation of the approach taken.
  • A summary of any problems experienced.
  • An explanation the things you would do differently today, based on the experienced you have gained.
  • References
  • Contact Details

Posted in Project news | 1 Comment »

Preserving and the Current Economy

Posted by Marieke Guy on June 8th, 2010

Yesterday David Cameron warned the British public of what he called the “inevitably painful times that lie ahead“. His speach referred to the spending cuts that are seen to be necessary to reduce the 70 billion debt the UK currently has. A few weeks ago the Coalition government unveiled their first round of spending cuts and the budget on 22 June is likely to lower the axe again. The Department for Business Innovation & Skills (BIS) has the Higher Education budget down for £200 million in efficiencies.

It is inevitable that a number of organisations will close and many projects will come to an end. British Educational Communications and Technology Agency (Becta), the organisation which promotes the use of technology in schools was one of the first to go.

So what role will digital preservation and access play in the current economic and fiscal situation?

Digital preservation is more important than ever in a time when the wealth of what JISC and other government funded organisations have created could potentially slip away.

After the closure of Becta was announced there was much discussion on Twitter about what would happen to their Web site and their intellectual assets. Some of their work will be carried by other government organisations and it’s likely that these resources will be transfered over to other sites and databases. Their Web site is currently one of those preserved by the National Archives however there are still questions over what else will be preserved and the processes that will take place. Will they mothball their Web site? What other Web resources will they save? They would do well to consult the JISC Preservation of Web Resources handbook.

It will be an interesting case study to watch.

Howerver it is not only government digital objects that are at risk. Those of commercial companies are unlikely to stand the test of time either.

In response to this the UK Web Archive have created a collection for the recession containing Web sites from high street stores closed down.The Credit Crunch Collection initiated in July 2008 contains records of high-street victims of the recession including Woolworths and Zavvi.

There is also a worry that many digital records from bankrupt companies will dissapear in the haste to sell off assets. On his blog a records manager explains how in the past archivists have waded in to save companies records in a form of “rescue” archiving. However “When a modern business goes bust, many of its records will exist only in electronic form….The inheriting organisation will always be under pressure to take the easiest and cheapest way to dispose of a predecessor’s assets, which in practice probably means that data will be wiped and the hardware sold on. “.

It seems that much is likely to be lost in the next few years in we aren’t careful.

Posted in Project news, Web | Comments Off

What is Digital Preservation?

Posted by Marieke Guy on June 4th, 2010

The first question I asked myself when I began researching the JISC Beginner’s Guide to Digital Preservation is “what exactly is digital preservation?”.

The experts have put a lot of effort into clarity in this area and a good working definition for the sake of this guide is:

The series of managed activities necessary to ensure continued access to digital materials for as long as necessary.

This definition comes from the Digital Preservation Coalition (DPC) Definitions and Concepts list and I feel it works because it is clear and specific.

Let’s look at it a little closer:

  • Managed – Digital preservation is a managerial problem. All activities (the planning, resource allocation, use of technologies, etc.) need to have been thought about and take place for a reason. The term managed stresses the need for a policy.
  • Activities – The policy needs to filter down to a list of processes: tasks that can take place at specified times and in specified ways.
  • Necessary – We are looking at what needs to be done. In your policy you will have looked at how long you want to preserve the objects for. Necessary talks about the activities needed to achieve a specified level of preservation. there may be other useful activities but we want to look at the most essential ones here.
  • Continued Access – Access is the key here. Most objects in the public sphere are preserved to enable access and retrieval. How long this access is needed will have been discussed and should be defined in your policy.
  • Digital Materials – Digital materials, digital objects, call them what you will. This is the stuff you are preserving. Different objects require different processes.

Other useful definitions are available from DigitalPreservationEurope (DPE), the Digital Curation Center (DCC) , the Digital Preservation of ALCTS Preservation and Reformatting Section (Working Group on Defining Digital Preservation) and Wikipedia. Note that digital curation tends to refer more to science/reserach data.

Many organisations choose to quantify their definition of digital preservation by 3 terms of preservation:

  • Long-term preservation – Continued access to digital materials, or at least to the information contained in them, indefinitely.
  • Medium-term preservation – Continued access to digital materials beyond changes in technology for a defined period of time but not indefinitely.
  • Short-term preservation – Access to digital materials either for a defined period of time while use is predicted but which does not extend beyond the foreseeable future and/or until it becomes inaccessible because of changes in technology.

For JISC projects it will normally be require that digital objects are preserved for the medium-term or the long-term.

A really useful slideshow introduction to digital preservation was written by Michael Day, UKOLN and is available on Slideshare.



Posted in definition | Comments Off

Rolling Back the Years

Posted by Marieke Guy on June 3rd, 2010

I’ve been having a little play with MementoFox, a firefox addin that “links resources with their previous versions automatically, so can you see the web as it was in the past“.

Once you have installed the addin a little slider bar is added to your Firefox Web browser. When browsing any Web site you can use the slider bar to select a date on which you’d like to see the shown page. Momento will then look for the closest archived copy available. As you can see I have used MementoFox on the UKOLN home page.

Below is the page for around the time I started working at UKOLN – 10 years ago! The page here is taken from the Internet Archive Wayback machine.

UKOLN Home page in 2000

And here is the page as it is now.

UKOLN Home page in 2010

I initially used version 0.8.6 of MementoFox and had a few problems with viewing embeds (of video, slides etc.) of blogs in Firefox. Version 0.8.7 seems to have sorted this out.

The Memento Project Web site is definitely worth taking a look at. There are various time traveling scenarios and walkthroughs and more information on where the project is going. The project “wants to make it as straightforward to access the Web of the past as it is to access the current Web.

Memento slider bar

At this point, there aren’t any formal technical specifications detailing the Memento framework but we will get to that. For now, the information on this site should provide quite a good insight into how Memento is trying to change the Web by adding a time dimension to its most common protocol, HTTP…If you are interested in establishing a Web with a memory, please join the Memento Development Group.

Maybe in the future we’ll be able to switch our ‘time-versions’ of Web pages as easily as we switch our blog themes.

Posted in Project news | Comments Off