JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for the 'Archiving' Category

Where does the future of digital archiving lie?

Posted by Marieke Guy on 27th August 2010

So where does the future of digital archiving lie? According to Steve Bailey it’s in Google’s hands.

This answer has sparked off some discussion on the records management JISCMail list, firstly about whether this is truly the case, and if so what it means. So let’s you peel back the discussion and start at the beginning by watching Steve’s excellent talk given at the 8th European Conference on Digital Archiving, 28 – 30 April 2010, Geneva.

A warning: the talk is excellent but unfortunately the embedded video isn’t very user friendly and won’t allow you to enlarge it or watch it from a specific point. Any mishap and you’re back to the beginning again. It’s all or nothing so set aside 20 minutes for this one!

The presented paper (there are no slides) starts of with a hypothetical analogy. Imagine if Samuel Pepys, the 17th century diarist, had had to rely on individual businesses to store and preserve his maps, his notebooks, his vellum manuscripts and so on. These were businesses that dealt with individual formats and had little interest in the content of Pepys records. Luckily this wasn’t the case and much of what he wrote has been recorded by the National Archive.

Bailey points out that we now find ourselves in a world where the reponsibility for archiving much of our office 2.0 documents lays at the feet of 3rd parties. Documents are stored according to format and regardless of their communality of content, text documents are now stored on Google docs, videos on YouTube, photos on Flickr and so on. Although cloud services have brought us much flexibility they have left us with a Pandoras box, ‘no regard for preservation’ is one of the evils that has flown out. They are externally hosted services with very different agendas from ours, they may notify us if they are going to delete all our content but they don’t necessarily have to so. The title of Brian Kelly’s post 5 Days Left to Choose a New Ning Plan is enough to show that there may be very little time in which to rescue your digital objects.

And so Bailey concludes that the future of digital archiving lies with Google.

Bailey also outlines this theory in a post on his Records management futurewatch blog – Is the Cloud aware that it has ‘the future of digital archiving in its hands’?

For him it is not a case of whether this is the right place for it to lie, it is just so.

It is at this point perhaps worth pausing to note that the question I have just offered an answer to is not in whose hands should the future of digital preservation lie, but in whose hands does it lie – a very important distinction indeed

At another point he says:

“Once again, I do not say that this is right or wrong, foolish or wise – simply that it appears inevitable and that we would do well to prepare ourselves for it.

Steve asks us to hold back from lamenting about this situation and consider in engaging in a dialogue with cloud based service providers. He offers a possible four point plan that might help us:

  1. Taking a risk management approach to choice of Web 2.0 services – look at issues like IPR
  2. Consider what to do if your provider closes down, have a back up strategy
  3. Work with service providers to establish ways of searching information (this looks at areas like retention schedules)
  4. Consider asking Google if they are happy to fulfil this role

Much of this rings true with work we have carried out at UKOLN on projects like the JISC Preservation of Web Resources project. The final point is an interesting one though.

Perhaps we should actually stop to ask Google and their peers whether they are indeed aware of the fact that the future of digital preservation lies in their hands and the responsibilities which comes with it and whether this is a role they are happy to fulfil. For perhaps just as we are in danger of sleepwalking our way into a situation where we have let this responsibility slip through our fingers, so they might be equally guilty of unwittingly finding it has landed in theirs.

If so, might this provide the opportunity for dialogue between the archival professions and cloud based service providers and in doing so, the opportunity for us to influence (and perhaps even still directly manage) the preservation of digital archives long into the future

Bailey even suggests the possible maintenance of a public sector funded meta-repository “within which online content can be transferred, or just copied, for controlled, managed long term storage whilst continuing to provide access to it to the services and companies from which it originated“.

In reply someone from the Records Managers list makes the following point:

In terms of where the future of digital preservation does lie, I doubt it is with the major providers in part because that it not their business case. Just as newspapers are not in the archive business, (although they may have archives) neither are the web service providers (yet) in that business. The challenge is that archives as opposed to storage, is guided by the key question of who and why. To archive something is based upon a distinct community fixed in time and space. Archives as opposed to mass storage has to work by what it refuses as much as by what it includes“.

The cloud may be a mass storage device but it is not yet an archive.

So it seems that the future of digital archiving continues to lie in the hands of those who care about it – the records managers, the archivists, the librarians, the JISC project managers – it is just that they now need to either include others in the dialogue about how to preserve digital objects or (and a part of me thinks this is the more realistic approach) think in a more lateral way about how you continue to preserve when you’ve lost control of your digital objects.

Other interesting posts/articles relating to preservation and the cloud include:

Digital preservation: a matter for the clouds? by Maureen Pennock, British Library

Duracloud – A hosted service and open technology developed by DuraSpace that makes it easy for organizations and end users to use cloud services. DuraCloud leverages existing cloud infrastructure to enable durability and access to digital content.

Posted in Archiving | 1 Comment »

Mirroring sites with WinHTTrack

Posted by Marieke Guy on 26th August 2010

Earlier this week Brian kelly published a post on how he has used WinHTTrack to create a copy of the Institutional Web Management Workshop 2008 social network. The social network was created using Ning, who have recently cancelled provision of free social networks. In his post – 5 Days Left to Choose a New Ning Plan – Brian talks us through the process taken to mirror the service and also discusses some of the wider implications of use of externally hosted services.

Brian says:

The use of such services to support events, in particular, raises some interesting issues. I have previously suggested that “The lesson I’ve learnt – there’s a need to change the settings for social networks set up to support events after the event is over. I still prefer to make it easy to subscribe to such services, however, in order to avoid any delays caused by the need to accept new subscriptions manually“. But as well as tightening up on access after an event is over in order to avoid spam are futher measures needed? Should the content be replicated elsewhere? Should the social networking site be closed? Or should we be happy with the default option of simply doing nothing – after all, although the announcement stated that the free service would be withdrawn on 20 August, it is still available today.

HTTrack is one of the tools I talked about in my post Web Archiving: Tools for Capturing. It is always interesting to hear case studies of use.

Posted in Archiving | 1 Comment »

Web Archiving: Tools for Capturing

Posted by Marieke Guy on 28th July 2010

The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.

The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.

These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!

Heritrix

Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.

HTTrack

HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.

Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.

DeepArc

DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.

There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.

Some issues to consider

When chosing one of these tools there might be a few issues that you will want to consider.

  • Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
  • Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
  • What about managing authority? What do you do about sites that you do not own?
  • How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
  • What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.

Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.

Further Resources

NetPreserve – Toolkit for setting up a Web archiving chain

DCC – Web Archiving Tools

Posted in Archiving, Web | 5 Comments »

Archive-It Webinar

Posted by Marieke Guy on 19th July 2010

I wanted to find out more about the Archive-It available service from the Internet Archive so have just watched their pre-recorded online Webinar. Kristine Hanna and the Archive It team run live informational webinars but the time difference made it a little tricky for me to attend one of these (they are every 2 weeks at 11:30 am PT).

Many of you will have already heard of the Internet Archive, it is a non-profit organisation that was founded in 1997 by Brewster Kahle. It’s aim is “universal access to human knowledge and it fulfils this by aggregating a “broad snapshot of the Web every 2 months” in a Web archive/library, which has been built using open source software. It is currently the largest public Web archive in existence and is around 150 billion+ pages collected from 65+ Web sites.

In response to partners request for more control over their data the Internet Archive team developed the Archive-It service.

It is a web based application that enables institutions to harvest and archive digital content. They can create focused collections, can alter the content scope and frequency of crawls, can add metadata to their collections, make their collections private and more. Archived content is available for browsing 24 hours after a crawl has been completed and is full text searchable within 7 days. Collections are hosted at the Internet Archive data centre, and a subscription includes hosting, access and storage. Collections can be browsed here from the Archive-It Home page or from a landing page on your institution’s Web site.

Archive-It uses open Source technology primarily developed by Internet Archive and International Internet Preservation Consortium (IIPC).

The key components are:

  • Heritrix: Web crawler – crawls and captures pages
  • Wayback Machine – access tool for rendering and viewing Web pages. Displays archived web pages – surf the web as it was.
  • Nutch – open source search engine. Standard full text search
  • WARC file – archival file format used for preservation

I’m planning to write more about the technologies behind Web archiving when I get a chance.

Archive-It Webinar

The Webinar itself is really clear and easy to follow and lasts about 30 minutes. It begins with an introduction to the Internet Archive and Archive-It and ends with a demo of the Archive-It software. As the webinar shows there is no software to install or download. All you need is a computer, internet connection and a browser to be able to access your collection. It looks very easy to use. During a 15 minute demo a new collection is created, metadata added, urls added, settings specified, the collection crawled and a number of reports created. Although many big institutions use the service it is designed with institutions with smaller collection and infrastructure requirements in mind. A free complimentary trial (for 2-4 weeks, one collection, 10 seeds (sites) up to 500,000 urls) is offered to webinar attendees.

I was also quite interested in how the Archive it team are archiving social networking sites, including blogs. By email the team explained that many of their partners have been doing this and they have been successfully capturing blogs since 2007. Partners is the term used for any institution or organisation that use or interact with the Archive-It service. Although they are US based they currently have several European partners (over 125 partners currently use Archive-It) and quite a few in UK HE (35% of their partners come from the university and public libraries) – see their partners list for more details. Archive-It also have connections with a number of digital preservation systems including LOCKSS, Dspace, Content DM, iRODS and ArchivalWare.

Tags:
Posted in Archiving | Comments Off

A Guide to Web Preservation

Posted by Marieke Guy on 12th July 2010

Today the JISC Preservation of Web Resources (PoWR) team announced the launch of A Guide to Web Preservation.

I worked on the PoWR project back in 2008. The project organised workshops and produced a handbook that specifically addressed digital preservation issues that were, and still are, relevant to the UK HE/FE web management community. It was a really successful project but later down the line there was felt to be a need for a more accessible, easy-to-use version of the handbook. This new guide does just the trick! I was really pleased to see it being given out as a resource on the DPTP Web Archiving Workshop I attended a few weeks back.

To steal some words from the press release:

This Guide uses similar content to PoWR: The Preservation of Web Resources Handbook but in a way which provides a practical guide to web preservation, particularly for web and records managers. The chapters are set out in a logical sequence and answer the questions which might be raised when web preservation is being seriously considered by an institution. These are:

  • What is preservation?
  • What are web resources?
  • Why do I have to preserve them?
  • What is a web preservation programme?
  • How do I decide what to preserve?
  • How do I capture them?
  • Who should be involved?
  • What approaches should I take?
  • What policies need to be developed?

Each chapter concludes with a set of actions and one chapter lists the tasks which must be carried out, and the timings of these tasks, if an institution is to develop and maintain a web preservation programme. In addition points made in the Guide are illustrated with a number of case studies.

The guide was edited by Susan Farrell who has used her knowledge and expertise in the management of large-scale institutional Web services in writing the document.

The Guide can be downloaded (in PDF format) from the JISC PoWR Web site. The Guide is also hosted on JISCPress service which provides a commenting and annotation capability. It has been published on the Lulu.com print-on-demand service where it can be bought for £2.82 plus postage and packing.

Posted in Archiving, trainingmaterials | 1 Comment »

LiWA Launch first Code

Posted by Marieke Guy on 14th June 2010

Today the LiWA (Living Web Archives) project has announced the release of the first open-source components of the “liwa-technologies” project on Google code.

The LiWA project is looking

beyond the pure “freezing” of Web content snapshots for a long time, transforming pure snapshot storage into a “Living” Web Archive. “Living” refers to a) long term interpretability as archives evolve, b) improved archive fidelity by filtering out irrelevant noise and c) considering a wide variety of content.

They plan to extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.

This is the first release of the software so they are keen to receive feedback and comments.

Posted in Archiving, Web | Comments Off

Preserving your Tweets

Posted by Marieke Guy on 28th May 2010

Recently there has been lots of talk about preserving Tweets, especially since the Library of Congress agreed to take on the archive.

I’ve just written an article for FUMSI on this (probably out in September). FUMSI are the FreePint people who publish tips and articles to help information professionals do their work.

One area I looked at was the tools available to archive tweets. We wrote quite a lot about this on the JISC PoWR project. Here is a taster of the most current tools out there:

  • Print Your Twitter service which creates a PDF file of an accounts tweets.
  • WordPress Lifestream plugin which allows you to integrate Twitter with your blog and so archive using blog capabilities.
  • What the Hashtag allows you to create an HTML archive and RSS feed based on a hashtag.
  • Tweetdoc service allows you to create a PDF file that brings together all the tweets from a particular event or search term.
  • Twappr Keeper allows users to create a notebook of tweets for a hashtag.
  • The Archivist Desktop is a desktop application that runs on your local system and allows you to archive tweets for later data-mining and analysis for any given search.

Other approaches include:

  • Searching Twemes the site mashes Twitter with Flickr, Delicious and other services.
  • Searching FriendFeed brings back much older tweets than Twitter but you are reliant on users being members of the service
  • Subscribing to certain Twitter feeds by email and then applying an email filter to them.

Some of these tools are covered in more detail in the JISC PoWR blog post: Tools For Preserving Twitter Posts.

Anyone got any other suggestions?

Tags:
Posted in Archiving, Web | Comments Off