JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Web Archiving: Tools for Capturing

Posted by Marieke Guy on July 28th, 2010

The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.

The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.

These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!

Heritrix

Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.

HTTrack

HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.

Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.

DeepArc

DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.

There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.

Some issues to consider

When chosing one of these tools there might be a few issues that you will want to consider.

  • Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
  • Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
  • What about managing authority? What do you do about sites that you do not own?
  • How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
  • What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.

Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.

Further Resources

NetPreserve – Toolkit for setting up a Web archiving chain

DCC – Web Archiving Tools

5 Responses to “Web Archiving: Tools for Capturing”

  1. Ubuntu Linux Tricks Says:

    Web harvesting engines are essentially web search engine, which save our precious time.

    Rajesh Shah

  2. Dori Stein Says:

    How to compare and choose web scraping tools?
    I recommend reading these series of posts is dedicated to executives taking charge of projects that entail scraping information from one or more websites.
    http://www.fornova.net/blog/?p=18

  3. JISC Beginner's Guide to Digital Preservation » Blog Archive » Mirroring sites with WinHTTrack Says:

    [...] Web Archiving: Tools for Capturing [...]

  4. Health For Dogs Says:

    Sorry to appear a little dim, but how do these capturing tools differ from the spiders that web search engines employ? Thanks, Tom

  5. Marylin Sorrentino Says:

    Im having bunch of trouble trying to find theHarvester directory. I have backtrack5r1 and i thought that successfully downloaded theharvest but i just cant get the commands above to work. Any help?
    Marylin Sorrentino