JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for the 'Web' Category

Preservation of web resources

Creating a Sustainable Web site

Posted by Marieke Guy on 22nd August 2012

Work has being going on at the University of Northampton to develop a ‘sustainable web site’. The site designed to promote STEM (Science, Technology, Engineering and Mathematics) outreach activities is being funded through a Nuffield Science Bursary.

The project is looking at different aspects of web sustainability. They have begun with an investigation of possible tools and approaches for web sustainability. Their findings have been shared through the following blog posts:

The site itself is being hosted using Weebly because of its easy of use, free tools and free hosting. It is believed to be sustainable because:

  • It has been design to be relatively simple to maintain
  • Uses free tools
  • Uses free web-hosting
  • Relatively easy to transfer management
  • Relatively easy to wrap most of the site and largely move it to another host.

A user guide to maintaining the site has also been produced design to be simple to understand.

Posted in Web | Comments Off

Your Digital Legacy

Posted by Marieke Guy on 31st January 2011

Personal Legacy

Last week Law.Com published an interesting article entitled What Happens to Your Digital Life When You Die?.

The article, written by Ken Strutlin, starts by explaining that the dealing with our digital legacy is something the legal world has yet to get to grips with.

Still, one of the neglected ensigns of internet citizenship is advanced planning. When people die, there are virtual secrets that follow them to the grave — the last refuge of privacy in a transparent society. Courts and legislatures have only begun to reckon with the disposition of digital assets when no one is left with the knowledge or authority to conclude the business of the cyber-afterlife.

It is an immensley complicated area and “the most important long-term consideration is who can access a person’s online life after they have gone or become incapacitated?“. Many people can leave behind a huge amount of digital data. Much of this, for example images and documents, may no longer be sat on a local hard drive but may be out there stored on cloud services such as Flickr and Facebook. It is likely that loved ones will be keen to be able to access and collate this data.

Information on both legal rights and what physically needs to be done is becoming increasingly important.

A few years ago a colleague of mine passed away and after some time I took it upon myself to notify Facebook. Relatives had initially posted some information (such as funeral details) up on my colleague’s wall but no other action had been taken. The profile had remained as one of a living user. After I contacted them Facebook acted quickly and effectively and memorialized the account. It is quite clear that they have a well thought out set of procedures in place.

Work Legacy

At UKOLN where I work we have touched on this subject when considering how you deal with the digital legacy of staff who move on. Although former members of staff are not ‘dead’ the problems that their leaving causes can be similar to those when someone dies – unknown passwords and use of unlisted services, to name two. In the past this type of information has been described as corporate or organisational memory and has often been subdivided into explicit and tacit knowledge. Recording corporate knowledge, especially the tacit type, has always caused problems, but the digital nature of resources now adds another level.

Strutlin offers a recount of the tale of the Rosetta stone, whose meaning was originally lost but then rediscovered when a Napoleonic soldier found a triptych in the Egyptian town of Rosetta which offered meanings for the hieroglyphics. Strutlin’s response is that “We need more than serendipity to preserve the data of our lives beyond our lifetimes“.

Over time it is likely that laws will emerge and processes and procedures will evolve but we need to be proactive about instigating them.

The principal concern today is the passing on of passwords, divvying up social media contents, and protecting virtual assets. But five minutes from now, those social media sites will include life logged metrics with excruciating details about our health, activities, and collective experiences. They will be more intimate and vivid than any handwritten personal journal or photo album. And they will demand clear and comprehensive rules to oversee their final disposition.

Posted in Web | Comments Off

Web Archiving: Tools for Capturing

Posted by Marieke Guy on 28th July 2010

The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.

The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.

These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!


Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.


HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.


GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.


DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.

There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.

Some issues to consider

When chosing one of these tools there might be a few issues that you will want to consider.

  • Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
  • Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
  • What about managing authority? What do you do about sites that you do not own?
  • How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
  • What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.

Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.

Further Resources

NetPreserve – Toolkit for setting up a Web archiving chain

DCC – Web Archiving Tools

Posted in Archiving, Web | 5 Comments »

LiWA Launch first Code

Posted by Marieke Guy on 14th June 2010

Today the LiWA (Living Web Archives) project has announced the release of the first open-source components of the “liwa-technologies” project on Google code.

The LiWA project is looking

beyond the pure “freezing” of Web content snapshots for a long time, transforming pure snapshot storage into a “Living” Web Archive. “Living” refers to a) long term interpretability as archives evolve, b) improved archive fidelity by filtering out irrelevant noise and c) considering a wide variety of content.

They plan to extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.

This is the first release of the software so they are keen to receive feedback and comments.

Posted in Archiving, Web | Comments Off

Preserving and the Current Economy

Posted by Marieke Guy on 8th June 2010

Yesterday David Cameron warned the British public of what he called the “inevitably painful times that lie ahead“. His speach referred to the spending cuts that are seen to be necessary to reduce the 70 billion debt the UK currently has. A few weeks ago the Coalition government unveiled their first round of spending cuts and the budget on 22 June is likely to lower the axe again. The Department for Business Innovation & Skills (BIS) has the Higher Education budget down for £200 million in efficiencies.

It is inevitable that a number of organisations will close and many projects will come to an end. British Educational Communications and Technology Agency (Becta), the organisation which promotes the use of technology in schools was one of the first to go.

So what role will digital preservation and access play in the current economic and fiscal situation?

Digital preservation is more important than ever in a time when the wealth of what JISC and other government funded organisations have created could potentially slip away.

After the closure of Becta was announced there was much discussion on Twitter about what would happen to their Web site and their intellectual assets. Some of their work will be carried by other government organisations and it’s likely that these resources will be transfered over to other sites and databases. Their Web site is currently one of those preserved by the National Archives however there are still questions over what else will be preserved and the processes that will take place. Will they mothball their Web site? What other Web resources will they save? They would do well to consult the JISC Preservation of Web Resources handbook.

It will be an interesting case study to watch.

Howerver it is not only government digital objects that are at risk. Those of commercial companies are unlikely to stand the test of time either.

In response to this the UK Web Archive have created a collection for the recession containing Web sites from high street stores closed down.The Credit Crunch Collection initiated in July 2008 contains records of high-street victims of the recession including Woolworths and Zavvi.

There is also a worry that many digital records from bankrupt companies will dissapear in the haste to sell off assets. On his blog a records manager explains how in the past archivists have waded in to save companies records in a form of “rescue” archiving. However “When a modern business goes bust, many of its records will exist only in electronic form….The inheriting organisation will always be under pressure to take the easiest and cheapest way to dispose of a predecessor’s assets, which in practice probably means that data will be wiped and the hardware sold on. “.

It seems that much is likely to be lost in the next few years in we aren’t careful.

Posted in Project news, Web | Comments Off

Preserving your Tweets

Posted by Marieke Guy on 28th May 2010

Recently there has been lots of talk about preserving Tweets, especially since the Library of Congress agreed to take on the archive.

I’ve just written an article for FUMSI on this (probably out in September). FUMSI are the FreePint people who publish tips and articles to help information professionals do their work.

One area I looked at was the tools available to archive tweets. We wrote quite a lot about this on the JISC PoWR project. Here is a taster of the most current tools out there:

  • Print Your Twitter service which creates a PDF file of an accounts tweets.
  • WordPress Lifestream plugin which allows you to integrate Twitter with your blog and so archive using blog capabilities.
  • What the Hashtag allows you to create an HTML archive and RSS feed based on a hashtag.
  • Tweetdoc service allows you to create a PDF file that brings together all the tweets from a particular event or search term.
  • Twappr Keeper allows users to create a notebook of tweets for a hashtag.
  • The Archivist Desktop is a desktop application that runs on your local system and allows you to archive tweets for later data-mining and analysis for any given search.

Other approaches include:

  • Searching Twemes the site mashes Twitter with Flickr, Delicious and other services.
  • Searching FriendFeed brings back much older tweets than Twitter but you are reliant on users being members of the service
  • Subscribing to certain Twitter feeds by email and then applying an email filter to them.

Some of these tools are covered in more detail in the JISC PoWR blog post: Tools For Preserving Twitter Posts.

Anyone got any other suggestions?

Posted in Archiving, Web | Comments Off