Posted by Marieke Guy on July 19th, 2010
I wanted to find out more about the Archive-It available service from the Internet Archive so have just watched their pre-recorded online Webinar. Kristine Hanna and the Archive It team run live informational webinars but the time difference made it a little tricky for me to attend one of these (they are every 2 weeks at 11:30 am PT).
Many of you will have already heard of the Internet Archive, it is a non-profit organisation that was founded in 1997 by Brewster Kahle. It’s aim is “universal access to human knowledge and it fulfils this by aggregating a “broad snapshot of the Web every 2 months” in a Web archive/library, which has been built using open source software. It is currently the largest public Web archive in existence and is around 150 billion+ pages collected from 65+ Web sites.
In response to partners request for more control over their data the Internet Archive team developed the Archive-It service.
It is a web based application that enables institutions to harvest and archive digital content. They can create focused collections, can alter the content scope and frequency of crawls, can add metadata to their collections, make their collections private and more. Archived content is available for browsing 24 hours after a crawl has been completed and is full text searchable within 7 days. Collections are hosted at the Internet Archive data centre, and a subscription includes hosting, access and storage. Collections can be browsed here from the Archive-It Home page or from a landing page on your institution’s Web site.
Archive-It uses open Source technology primarily developed by Internet Archive and International Internet Preservation Consortium (IIPC).
The key components are:
- Heritrix: Web crawler – crawls and captures pages
- Wayback Machine – access tool for rendering and viewing Web pages. Displays archived web pages – surf the web as it was.
- Nutch – open source search engine. Standard full text search
- WARC file – archival file format used for preservation
I’m planning to write more about the technologies behind Web archiving when I get a chance.
The Webinar itself is really clear and easy to follow and lasts about 30 minutes. It begins with an introduction to the Internet Archive and Archive-It and ends with a demo of the Archive-It software. As the webinar shows there is no software to install or download. All you need is a computer, internet connection and a browser to be able to access your collection. It looks very easy to use. During a 15 minute demo a new collection is created, metadata added, urls added, settings specified, the collection crawled and a number of reports created. Although many big institutions use the service it is designed with institutions with smaller collection and infrastructure requirements in mind. A free complimentary trial (for 2-4 weeks, one collection, 10 seeds (sites) up to 500,000 urls) is offered to webinar attendees.
I was also quite interested in how the Archive it team are archiving social networking sites, including blogs. By email the team explained that many of their partners have been doing this and they have been successfully capturing blogs since 2007. Partners is the term used for any institution or organisation that use or interact with the Archive-It service. Although they are US based they currently have several European partners (over 125 partners currently use Archive-It) and quite a few in UK HE (35% of their partners come from the university and public libraries) – see their partners list for more details. Archive-It also have connections with a number of digital preservation systems including LOCKSS, Dspace, Content DM, iRODS and ArchivalWare.