The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.
The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.
These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!
Heritrix
Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.
HTTrack
HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.
Wget
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.
DeepArc
DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.
There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.
Some issues to consider
When chosing one of these tools there might be a few issues that you will want to consider.
- Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
- Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
- What about managing authority? What do you do about sites that you do not own?
- How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
- What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.
Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.
Further Resources
NetPreserve – Toolkit for setting up a Web archiving chain
DCC – Web Archiving Tools