JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for July, 2010

Web Archiving: Tools for Capturing

Posted by Marieke Guy on 28th July 2010

The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.

The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.

These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!

Heritrix

Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.

HTTrack

HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.

Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.

DeepArc

DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.

There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.

Some issues to consider

When chosing one of these tools there might be a few issues that you will want to consider.

  • Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
  • Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
  • What about managing authority? What do you do about sites that you do not own?
  • How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
  • What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.

Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.

Further Resources

NetPreserve – Toolkit for setting up a Web archiving chain

DCC – Web Archiving Tools

Posted in Archiving, Web | 5 Comments »

Archive-It Webinar

Posted by Marieke Guy on 19th July 2010

I wanted to find out more about the Archive-It available service from the Internet Archive so have just watched their pre-recorded online Webinar. Kristine Hanna and the Archive It team run live informational webinars but the time difference made it a little tricky for me to attend one of these (they are every 2 weeks at 11:30 am PT).

Many of you will have already heard of the Internet Archive, it is a non-profit organisation that was founded in 1997 by Brewster Kahle. It’s aim is “universal access to human knowledge and it fulfils this by aggregating a “broad snapshot of the Web every 2 months” in a Web archive/library, which has been built using open source software. It is currently the largest public Web archive in existence and is around 150 billion+ pages collected from 65+ Web sites.

In response to partners request for more control over their data the Internet Archive team developed the Archive-It service.

It is a web based application that enables institutions to harvest and archive digital content. They can create focused collections, can alter the content scope and frequency of crawls, can add metadata to their collections, make their collections private and more. Archived content is available for browsing 24 hours after a crawl has been completed and is full text searchable within 7 days. Collections are hosted at the Internet Archive data centre, and a subscription includes hosting, access and storage. Collections can be browsed here from the Archive-It Home page or from a landing page on your institution’s Web site.

Archive-It uses open Source technology primarily developed by Internet Archive and International Internet Preservation Consortium (IIPC).

The key components are:

  • Heritrix: Web crawler – crawls and captures pages
  • Wayback Machine – access tool for rendering and viewing Web pages. Displays archived web pages – surf the web as it was.
  • Nutch – open source search engine. Standard full text search
  • WARC file – archival file format used for preservation

I’m planning to write more about the technologies behind Web archiving when I get a chance.

Archive-It Webinar

The Webinar itself is really clear and easy to follow and lasts about 30 minutes. It begins with an introduction to the Internet Archive and Archive-It and ends with a demo of the Archive-It software. As the webinar shows there is no software to install or download. All you need is a computer, internet connection and a browser to be able to access your collection. It looks very easy to use. During a 15 minute demo a new collection is created, metadata added, urls added, settings specified, the collection crawled and a number of reports created. Although many big institutions use the service it is designed with institutions with smaller collection and infrastructure requirements in mind. A free complimentary trial (for 2-4 weeks, one collection, 10 seeds (sites) up to 500,000 urls) is offered to webinar attendees.

I was also quite interested in how the Archive it team are archiving social networking sites, including blogs. By email the team explained that many of their partners have been doing this and they have been successfully capturing blogs since 2007. Partners is the term used for any institution or organisation that use or interact with the Archive-It service. Although they are US based they currently have several European partners (over 125 partners currently use Archive-It) and quite a few in UK HE (35% of their partners come from the university and public libraries) – see their partners list for more details. Archive-It also have connections with a number of digital preservation systems including LOCKSS, Dspace, Content DM, iRODS and ArchivalWare.

Tags:
Posted in Archiving | Comments Off

A Guide to Web Preservation

Posted by Marieke Guy on 12th July 2010

Today the JISC Preservation of Web Resources (PoWR) team announced the launch of A Guide to Web Preservation.

I worked on the PoWR project back in 2008. The project organised workshops and produced a handbook that specifically addressed digital preservation issues that were, and still are, relevant to the UK HE/FE web management community. It was a really successful project but later down the line there was felt to be a need for a more accessible, easy-to-use version of the handbook. This new guide does just the trick! I was really pleased to see it being given out as a resource on the DPTP Web Archiving Workshop I attended a few weeks back.

To steal some words from the press release:

This Guide uses similar content to PoWR: The Preservation of Web Resources Handbook but in a way which provides a practical guide to web preservation, particularly for web and records managers. The chapters are set out in a logical sequence and answer the questions which might be raised when web preservation is being seriously considered by an institution. These are:

  • What is preservation?
  • What are web resources?
  • Why do I have to preserve them?
  • What is a web preservation programme?
  • How do I decide what to preserve?
  • How do I capture them?
  • Who should be involved?
  • What approaches should I take?
  • What policies need to be developed?

Each chapter concludes with a set of actions and one chapter lists the tasks which must be carried out, and the timings of these tasks, if an institution is to develop and maintain a web preservation programme. In addition points made in the Guide are illustrated with a number of case studies.

The guide was edited by Susan Farrell who has used her knowledge and expertise in the management of large-scale institutional Web services in writing the document.

The Guide can be downloaded (in PDF format) from the JISC PoWR Web site. The Guide is also hosted on JISCPress service which provides a commenting and annotation capability. It has been published on the Lulu.com print-on-demand service where it can be bought for £2.82 plus postage and packing.

Posted in Archiving, trainingmaterials | 1 Comment »

What To Do When a Service Provider Closes

Posted by Marieke Guy on 5th July 2010

A discussion between the Digital Preservation Coalition and the Digital Curation Centre has led to a new UKOLN Cultural Heritage briefing paper on ‘What To Do When a Service Provider Closes‘.

The original notes for the briefing paper (written by William Kilbride of the DPC) were offered in response to a cry for help from someone working for a non-for-profit organisation that is closing down due to the recession. They were looking for some guidance on what to do with their digital collections.

The briefing paper offers seven point checklist presents some steps that creators and managers of community digital archives might take to make sure that their data is available in the long term.

The key suggestions are:

  1. Keep the Masters
  2. Know What’s What
  3. There Should be a Disaster Plan
  4. Agree a Succession Plan
  5. Know Your Rights
  6. Find a Digital Preservation Service
  7. Put a Copy of your Web Site in a Web Archive

William commented on the guidance:

This will be a growing area of business – and it’s illustrative of the gap between the advice that people need and the advice that’s out there.

Hopefully the Beginner’s Guide to Digital Preservation will be able to put more people in touch with information that helps them sort out their very immediate digital preservation problems.

The guide is available in Word and as a Web page.

There are now 82 Cultural Heritage briefing papers available on the UKOLN Cultural Heritage Web site. The papers are concise and clear and make excellent training materials. There are several related to digital preservation:

Posted in Project news | 1 Comment »

DPTP Web Archiving Workshop

Posted by Marieke Guy on 1st July 2010

On Monday I attended the Digital Preservation Training Programme (DPTP) Web archiving workshop.

The Digital Preservation Training Programme

The Digital Preservation Training Programme (DPTP) was initially a project funded by the JISC under its Digital Preservation and Asset Management programme. It has been lead by ULCC and had input from the Digital Preservation Coalition, Cornell University and the British Library.

The programme offers a modular training programme with content aimed at multiple levels of attendee. It builds on the foundations of Cornell’s Digital Preservation Management Workshop.

The DPTP team currently run a 3 day digital preservation course, of which Web archiving is a module.  However this was the first time they had offered the module independently of the rest of the course. I believe there are intentions to offer one-off modules more in the future. They are also planning to offer more content online both freely under a Creative Commons licence and as part of a paid for course. This work is still in the development stage.

For the Web archiving workshop they had squeezed the module into half a day, which made for some rushing of content and a late finish. I have a feeling they will be rethinking their timetable. There was way to much content for 3 hours and a longer workshop would allow more time for networking and group activities.

Approach

The team (Ed Pinsent and Patricia Sleeman) started off by introducing the 3 legged stool approach (borrowed from Cornell). This approach sees understanding of digital preservation as requiring consideration of 3 main areas: technology, organisation and resources. While technology used to be seen as the silver bullet these days achieving good digital preservation is much more about planning (the organisation and resource legs). The Web archiving module considered primarily issues relating to the technology and organisation legs.

At the start of the workshop the DPTP team were upfront about the approach they wanted to take and what they wanted from attendees. They explained that they were not there to promote ‘one right way’ but to offer an explanation of the current situation and then allow us to make the decisions. They were keen to encourage interaction and informal question asking -  “there are no stupid questions“.

Content

The content of the day was really useful, the team trod a nice line between covering cultural issues and the technologies that enable archiving. I won’t go into what I learnt here, that’s the content of another blog post but despite being fairly familiar with Web archiving I found there was lots of new information to digest. Not only did I learn from the team but I also learnt by chatting to others interested in Web archiving. This form of focused networking can be hugely beneficial. For example the person sat next to me had been charged with acquisitioning some of the Government Web sites that are for the chop as part of the (up to) 75% cut. His current big concern was what do you do about domain names? We had lots to discus.

There were also attendees from outside the public sector (such as the lady from a commercial bank), they offered a different perspective on issues and it was refreshing to spend time with them.

Late in the morning we heard more formally about Web archiving from a guest speaker, Dave Thompson from the Wellcome Trust. Dave spoke about their archive of medicine related Web sites created through the UK Web Archive. Many of the sites they collect (such as personal accounts of experience of illness) are out of scope for normal preservation programmes. The Wellcome Trust don’t mediate the content of the Web sites collected, as Dave explained it’s not the job of the archivist (or librarian) to do so. For example, there are books on Witchcraft and quackery in the Wellcome library. It’s the job of the archive to preserve and make available these source materials; it’s the job of the historian or researcher to interpret them. The archive will provide a valuable record for our future researchers.

Dave ended with a quotation from Adrian Sanders Liberal Democrat MP for Torbay. As part of the debate on the The future for local and regional media Sanders had said that he thought that “Most of what’s online is indeed tittle tattle and opinion.” Dave observed that such an opinion from a member of parliament was extremely worrying. Many still failed to understand the value of the Web and the value of preserving it. Tittle tattle and opinion is what our papers consist of (and we preserve them) and what ultimately history is made of.

Overall

I really enjoyed the workshop and would thoroughly recommend it to anyone  who needs practical advice to get them started planning a Web archiving project. The speakers were excellent, both knowledgable and receptive to the information needs of the audience.  As Ed explained at the start “I learn a lot from you too“.

My only criticism of the day is that due to over-running some of the slides were missed out. Also technology problems meant that the team were unable to play the screen cast of HTTrack doing it’s stuff. I think the screencast and would make a valuable contribution to the resources offered. The resources available both online and off were excellent, we even received a great certificate for completing the course. Something to hang on my wall!

Tags: ,
Posted in training | 3 Comments »