JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

A Structure for the Guide

Posted by Marieke Guy on September 10th, 2010

Back in May I wrote about my brain storming session and how I’d got a rough structure in place. I’ve now written the majority of the guide and have a table of contents to share with you.

Although the structure is based on a series of questions the pages are intereconnected and it is hoped that people will be able to approach the guide in different ways: through tags, through an index, as one off answers to a question, as briefing papers etc.

Anyway here is the structure as it now stands – please do let me know what you think.

  • Why is Digital Preservation Relevant to my JISC Project?
    • Access and Reuse Drivers
    • Legal Drivers
    • Economic Drivers
    • Reputation Drivers
    • Responsibility Drivers
    • Corporate Memory
    • Cultural Drivers
  • What are the Particular Digital Preservation Challenges JISC Projects Face?
    • Responsibility
    • Risk Management
    • Cost
  • What is Digital Preservation?
    • Definition of Digital Preservation
    • Definition of Digital Object
    • Definition of Digital Deterioration
    • Digital Preservation Approaches
    • Lifecycle Model
  • What Exactly do I Need to Preserve?
    • Information Audit
    • JISC Project Outputs
      • Text Based Information
      • Software
      • Numerical Information
      • Audio Visual
      • Emails
      • Event Information
      • Learning Objects and Teaching Materials
      • Web 1.0
      • Web 2.0
        • Blogs
        • Wikis
        • Twitter
      • Personal Data
    • Selection
  • How do I Make my Deliverables Easier to Preserve?
    • Formats
    • IPR and Licences
    • Metadata
  • How do I Preserve Digital Objects?
    • Tools
    • Training
    • Preservation Strategy
    • Preservation Policy
  • Can I Offload Digital Preservation?
    • Repositories
    • External Web Archiving Services
    • Institutional Records Management Processes
    • Outsourcing
  • Appendix
    • Glossary
    • Case Studies

Posted in Project news | Comments Off

Case Study: Tap into Bath

Posted by Marieke Guy on September 8th, 2010

There is a lot of value in digital preservation case studies. Building knowledge from observation by sharing approaches can save others a lot of time and effort.

My colleague Ann Chapman has written the following brief case study on the archiving of the Tap into Bath demonstrator project. The demonstrator project contained elements including a Web site, data base and software. Ann can be contacted via her UKOLN staff page.

Tap into Bath

Tap into Bath was a demonstrator project to create a searchable database of collection-level descriptions as part of the Collection Description Focus work programme. The lead partners were UKOLN and University of Bath Library; and twenty-five contributing partners from archive, library and museum collections in both public and private sectors in the City of Bath.

Tap into Bath Home Page

The project began on 12 January 2004. The database was created by a member of the University of Bath library staff using the RSLP Metadata Schema for collection description and a MySQL database. A programmer was hired to create the search and display interfaces; it was part of the contract that the project would be making this available as open source software. The completed database and the project web pages were held on a University of Bath server. Partner organisations submitted collection entries as Word documents and the data was entered into the database by University library staff. The Tap into Bath database was formally launched at the Guildhall in Bath on 8 December 2004.

The MySQL database and the search and data entry interfaces were designated as open source and the un-populated database and accompanying software offered for re-use with accreditation. Several enquiries were received; some did not proceed (typically because funding was unavailable for data collection and data entry tasks) but the following resources were created, both of which have Web links to the Tap into Bath site.

The Southern Cross Resource Finder (SCRF) is a web-based resource that enables users to discover collections from libraries, archives and museums which hold resources useful for the study of Australia and/or New Zealand. Produced by and is maintained by the Menzies Centre for Australian Studies, King’s College London, it was launched in 2005.

Milton Keynes Inspire is an online searchable database to promote access to the collections in museums, archives, galleries and libraries in Milton Keynes, launched on 4 November 2005.

In 2007, project partners were contacted and asked to review their entries and supply any additional or amended data and the database was updated.

Tap into Bath record for the Holburne museum

In 2010 UKOLN received notification that the University of Bath server would be de-commissioned later that year and reviewed the status of the project. The conclusion was that the data needed further updating immediately and on a regular basis in the future and that it would benefit from a re-designed search and display interface. Neither UKOLN nor any of the partners has the resources to host, maintain and/or develop the resource and so it was decided to take the following actions.

  • Archive on UKOLN server and burn onto DVD
    • Populated Tap into Bath database
    • Un-populated database
    • Web apps for data entry and search & display interfaces
    • Word documents ‘High Level Design’ and ‘System Maintenance Guide’
    • Metadata schema
    • Guidelines for data entry
    • Screenshots of search and display pages in use
  • Create a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use
    • Create new Web pages for the project on the UKOLN server that:
    • Record the history of the project
    • State that the resource has been taken down
    • State that the Tap into Bath email address is no longer active
    • Provide access to a zipped download of unpopulated database, Web interface software and installation documents in Word format for organisations to re-use.
  • Notify partners of new URL and request they remove the old URL if this is currently displayed on their Web site
  • Notify Southern Cross and Milton Keynes Inspire of change of URL so they can update the acknowledgement link on their Web pages
  • Tap into Bath email address: messages currently go to a member UKOLN’s Outreach & Community team. This to be changed so messages go to a member of UKOLN’s Systems and Support team.
  • Record all of the above activity for UKOLN resource management purposes

Carrying out the above processes has ensured that the Tap into Bath site and data has been effectively archived for the short-term. Openly documenting the process enables interested parties to be aware of the archive process and know who should be contacted if any information or data is required.

Tags:
Posted in Case studies | 2 Comments »

Treasuring Twitter

Posted by Marieke Guy on September 6th, 2010

My article on Treasuring Twitter: The Why and How of Preserving Tweets has now been published in FUMSI.

FUMSI is the successor to FreePint and publishes articles aimed at helping information professionals do their work.

Tags:
Posted in articles | Comments Off

Where does the future of digital archiving lie?

Posted by Marieke Guy on August 27th, 2010

So where does the future of digital archiving lie? According to Steve Bailey it’s in Google’s hands.

This answer has sparked off some discussion on the records management JISCMail list, firstly about whether this is truly the case, and if so what it means. So let’s you peel back the discussion and start at the beginning by watching Steve’s excellent talk given at the 8th European Conference on Digital Archiving, 28 – 30 April 2010, Geneva.

A warning: the talk is excellent but unfortunately the embedded video isn’t very user friendly and won’t allow you to enlarge it or watch it from a specific point. Any mishap and you’re back to the beginning again. It’s all or nothing so set aside 20 minutes for this one!

The presented paper (there are no slides) starts of with a hypothetical analogy. Imagine if Samuel Pepys, the 17th century diarist, had had to rely on individual businesses to store and preserve his maps, his notebooks, his vellum manuscripts and so on. These were businesses that dealt with individual formats and had little interest in the content of Pepys records. Luckily this wasn’t the case and much of what he wrote has been recorded by the National Archive.

Bailey points out that we now find ourselves in a world where the reponsibility for archiving much of our office 2.0 documents lays at the feet of 3rd parties. Documents are stored according to format and regardless of their communality of content, text documents are now stored on Google docs, videos on YouTube, photos on Flickr and so on. Although cloud services have brought us much flexibility they have left us with a Pandoras box, ‘no regard for preservation’ is one of the evils that has flown out. They are externally hosted services with very different agendas from ours, they may notify us if they are going to delete all our content but they don’t necessarily have to so. The title of Brian Kelly’s post 5 Days Left to Choose a New Ning Plan is enough to show that there may be very little time in which to rescue your digital objects.

And so Bailey concludes that the future of digital archiving lies with Google.

Bailey also outlines this theory in a post on his Records management futurewatch blog – Is the Cloud aware that it has ‘the future of digital archiving in its hands’?

For him it is not a case of whether this is the right place for it to lie, it is just so.

It is at this point perhaps worth pausing to note that the question I have just offered an answer to is not in whose hands should the future of digital preservation lie, but in whose hands does it lie – a very important distinction indeed

At another point he says:

“Once again, I do not say that this is right or wrong, foolish or wise – simply that it appears inevitable and that we would do well to prepare ourselves for it.

Steve asks us to hold back from lamenting about this situation and consider in engaging in a dialogue with cloud based service providers. He offers a possible four point plan that might help us:

  1. Taking a risk management approach to choice of Web 2.0 services – look at issues like IPR
  2. Consider what to do if your provider closes down, have a back up strategy
  3. Work with service providers to establish ways of searching information (this looks at areas like retention schedules)
  4. Consider asking Google if they are happy to fulfil this role

Much of this rings true with work we have carried out at UKOLN on projects like the JISC Preservation of Web Resources project. The final point is an interesting one though.

Perhaps we should actually stop to ask Google and their peers whether they are indeed aware of the fact that the future of digital preservation lies in their hands and the responsibilities which comes with it and whether this is a role they are happy to fulfil. For perhaps just as we are in danger of sleepwalking our way into a situation where we have let this responsibility slip through our fingers, so they might be equally guilty of unwittingly finding it has landed in theirs.

If so, might this provide the opportunity for dialogue between the archival professions and cloud based service providers and in doing so, the opportunity for us to influence (and perhaps even still directly manage) the preservation of digital archives long into the future

Bailey even suggests the possible maintenance of a public sector funded meta-repository “within which online content can be transferred, or just copied, for controlled, managed long term storage whilst continuing to provide access to it to the services and companies from which it originated“.

In reply someone from the Records Managers list makes the following point:

In terms of where the future of digital preservation does lie, I doubt it is with the major providers in part because that it not their business case. Just as newspapers are not in the archive business, (although they may have archives) neither are the web service providers (yet) in that business. The challenge is that archives as opposed to storage, is guided by the key question of who and why. To archive something is based upon a distinct community fixed in time and space. Archives as opposed to mass storage has to work by what it refuses as much as by what it includes“.

The cloud may be a mass storage device but it is not yet an archive.

So it seems that the future of digital archiving continues to lie in the hands of those who care about it – the records managers, the archivists, the librarians, the JISC project managers – it is just that they now need to either include others in the dialogue about how to preserve digital objects or (and a part of me thinks this is the more realistic approach) think in a more lateral way about how you continue to preserve when you’ve lost control of your digital objects.

Other interesting posts/articles relating to preservation and the cloud include:

Digital preservation: a matter for the clouds? by Maureen Pennock, British Library

Duracloud – A hosted service and open technology developed by DuraSpace that makes it easy for organizations and end users to use cloud services. DuraCloud leverages existing cloud infrastructure to enable durability and access to digital content.

Posted in Archiving | 1 Comment »

Mirroring sites with WinHTTrack

Posted by Marieke Guy on August 26th, 2010

Earlier this week Brian kelly published a post on how he has used WinHTTrack to create a copy of the Institutional Web Management Workshop 2008 social network. The social network was created using Ning, who have recently cancelled provision of free social networks. In his post – 5 Days Left to Choose a New Ning Plan – Brian talks us through the process taken to mirror the service and also discusses some of the wider implications of use of externally hosted services.

Brian says:

The use of such services to support events, in particular, raises some interesting issues. I have previously suggested that “The lesson I’ve learnt – there’s a need to change the settings for social networks set up to support events after the event is over. I still prefer to make it easy to subscribe to such services, however, in order to avoid any delays caused by the need to accept new subscriptions manually“. But as well as tightening up on access after an event is over in order to avoid spam are futher measures needed? Should the content be replicated elsewhere? Should the social networking site be closed? Or should we be happy with the default option of simply doing nothing – after all, although the announcement stated that the free service would be withdrawn on 20 August, it is still available today.

HTTrack is one of the tools I talked about in my post Web Archiving: Tools for Capturing. It is always interesting to hear case studies of use.

Posted in Archiving | 1 Comment »

Preserving Digital Lives

Posted by Marieke Guy on August 23rd, 2010

The @jisckeepit Twitter account alerted me to a really interesting article on downsizing your personal world from physical to digital (Cult of less: Living out of a hard drive). The jist of it is that many people are getting rid of their CD, DVD, and book collections and replacing them with digital versions. On an extreme level this has led to some people getting rid of nearly all of their physical possessions and living a ‘minimalist life’.

The article really struck me on a number of levels. Firstly I have been having quite a few discussions with a friend who is in the process of down-sizing to a smaller house. She’d already sold off her CDs on ebay (after adding them to her MP3 player) but has now gone one step further and got rid off her books too. She can get the informtion she needs off the Web of by using an e-book reader. To many living in a house of clutter this might appeal, personally I’m not quite ready to let go. However we both agreed that on an environmental level any moves away from ‘creating stuff’ must be a good thing.

Secondly, and of more relevance to this blog, there is the digital preservation angle. As @jisckeepit put it “note how rapidly preservation becomes critical…“. In fact there is no mention of ‘digital preservation’ in the article per se but there is recognition that back ups are vital.

Mr Yurista says he frequently worries he may lose his new digital life to a hard drive crash or downed server. “You have to really make sure you have back-ups of your digital goods everywhere,” he said.

The article mentions the new role of Data crisis counsellors who help individuals claw back their data: “data recovery services will become rather like the firefighters of the 21st Century – responders who save your valuables.

Digital Lives

Back in 2007-2009 the British Library carried out the Digital Lives Research Project. The project team, made up of the British Library, University College London and University of Bristol, created a major pathfinding study of personal digital collections.

One of the primary research questions asks How should curators approach selection, preservation and access to personal digital collections? What aspects of existing practice can be applied? What needs to be changed?

The Digital Lives project blog offers some interesting insites. The beta synthesis of the project was released early this year and is available as a PDF (it is a hefty 259 pages long but well worth a read!)

It is concluded that the role of personal archives in daily life and their research value have never been more profound. The potential benefits to society and to individuals are both deep and far reaching in their capacity to empower research and human well being and advancement….The project has outlined the concept of Personal Informatics to encapsulate the three concerns of digital capture, preservation and utility in the context of personal digital objects, and to embrace the study of digital personal information in all its manifestations.

So how does preservation of our own digital lives fit in with JISC? The answer is still unclear but as the lines between work and home life, real and digital continue to blur many may feel that the digital preservation thread cuts right the way across.

JISC Keep It

Note the JISC Keep It project aims to enable a diverse range of digital content presented by institutional repositories – research papers, science data, arts, teaching materials and theses – to be managed effectively today, tomorrow and beyond. Their Web site and blog are useful for anyone interested in a repositories role in digital preservation.

Posted in Project news | 2 Comments »

DCC Roadshow 2010 – 2011

Posted by Marieke Guy on August 20th, 2010

The Digital Curation Centre have carried out digital preservation training in the past (for example the Digital Curation 101 course) but they have now committed to running a series of data curation roadshows. These are likely to be very useful anyone involved in digital curation from senior managers to researchers and librarians.

Institutional Challenges in the Data Decade

The DCC Roadshows will comprise of a series of inter-linked workshops aimed at supporting institutional data management, planning and training.

The first will take place 2-4 November in Bath and will be open to participants from Higher Education Institutes (HEIs) in the south-west of England. The roadshow will run over 3 days and comprise of a series of day and half day workshops.

For more details see the DCC Web site. Registration will open in September 2010.

Tags:
Posted in dcc, Project news | Comments Off

Making Digital Preservation Fun…

Posted by Marieke Guy on August 16th, 2010

…isn’t always that easy but DigitalPreservationEurope(DPE) are having a good go. They have created Team Digital Preservation – a wacky cartoon crew who “embody all aspects of digital preservation“. Digiman leads his team against Blizzard and his band of evil cronies, Team Chaos, who “embody all aspects of threats to digital preservation“.

It’s all good clean fun but still gets over a very clear message – Digital Preservation is good!

DPE have so far uploaded 5 Team Digital Preservation videos to their Wepreserve account and they are getting a good number of hits. The latest is Team Digital Preservation and the Planets Testbed.

Blizzard and his band of evil cronies, Team Chaos, have developed a devastating new weapon. But Never Fear trusty Viewers, tune in now to find out what those wonderful whizz-kids at the top-secret Team Digital Preservation research lab have cooked up to protect Digiman this time!

YouTube Preview Image

All animations are free to use by those wishing to raise awareness and understanding about digital preservation.

Planets

Planets (Preservation and Long-term Access through NETworked Services) is a four-year, €15 million project, co-funded by the European Commission under the Information Society Technologies (IST) priority of the 6th framework Programme (IST-033789). The Open Planets Foundation has been established to build on the investment to provide practical solutions and expertise in digital preservation.

DigitalPreservationEurope

DigitalPreservationEurope(DPE) builds on the earlier successful work of ERPANET, facilitates pooling of the complementary expertise that exists across the academic research, cultural, public administration and industry sectors in Europe. It fosters collaboration and synergies between many existing national and international initiatives across the European Research Area. DPE addresses the need to improve coordination, cooperation and consistency in current activities to secure effective preservation of digital materials. DPE’s success will help to secure a shared knowledge base of the processes, synergy of activity, systems and techniques needed for the long-term management of digital material.

Tags: ,
Posted in trainingmaterials | 1 Comment »

Using WordPress

Posted by Marieke Guy on August 3rd, 2010

Owen Stephens, a friend and colleague of mine has just started work on a JISC commissioned Guide to Open Bibliographic Data for use by managers, practitioners and developers in the library community.

He’s planning to create the guide in WordPress, he wants the guide to be a useful and powerful online resource, he wants commenting on different sections, he wants to have different views on the sections…

It’s all starting to sound very familiar. These are some of my intentions with the JISC Beginner’s Guide to Digital Preservation.

Owen’s thoughts on what he calls Multi-faceted document navigation are available from his blog. He’s also created a demonstrator site.

There is a lot going on. He’s used a Query Multiple Taxonomies plugin and is using custom taxonomies and taxonomy templates – a much detailed approach than using just tags and categories.

He’s also using an inline post plugin to allow him to embed content from one post in another post.

I know I need to spend a lot more time looking at the plugins WordPress has available. At a recent event I attended a workshop session on WordPress beyond blogging.



The session, presented by Joss Winn from the University of Lincoln left me inspired but slightly unsure about what to do next. My blogs are housed by our systems team, so there I have to ‘ask’ for things I’d like. This isn’t a problem (by that I mean they are keen to help out) but it can be a time delay and I don’t really get to play as much as I’d like to (or should).

I think my final guide won’t be as ambitious as Owen’s but I hope it is a useful and powerful online resource.

Tags:
Posted in Wordpress | 1 Comment »

Web Archiving: Tools for Capturing

Posted by Marieke Guy on July 28th, 2010

The DPTP workshop on Web archiving I attended a few weeks back was a great introduction the the main tools out there for capturing web resources.

The four big players for the HE sector are listed in the new JISC PoWR Guide to Web Preservation (Appendix C, p41). I’ve used some of the JISC PoWR explanation here and added in some thoughts of my own.

These capture tools or Web harvesting engines are essentially web search engine crawlers with special processing abilities that extract specific fields of content from web pages. They basically do harvesting, capturing and gathering, pretty much like Google harvester. You provide them with a seed (site to capture) and let them lose!

Heritrix

Heritrix is a free, open-source, extensible, archiving quality web crawler. It was developed, and is used, by the Internet Archive and is freely available for download and use in web preservation projects under the terms of the GNU GPL. It is implemented in Java, and can therefore run on any system that supports Java (Windows, Apple, Linux/Unix). Archive-It uses this capture tool and it the Web Curator tool is used as front end by supporting key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Heritrix can be downloaded from Sourceforge.

HTTrack

HTTrack is a free offline browser utility, available to use and modify under the terms of the GNU GPL. Distributions are available for Windows, Apple, and Linux/Unix. It enables the download of a website from the Internet to a local directory, capturing HTML, images, and other files from the server, and recursively building all directories locally. It can arrange the original site’s relative link structure so that the entire site can be viewed locally as if online. It can also update an existing mirrored site, and resume interrupted downloads. Like many crawlers, HTTrack may in some cases experience problems capturing some parts of websites, particularly when using Flash, Java, Javascript, and complex CGI.

Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool, so it can easily be used with other scripts, or run automatically at scheduled intervals. It is freely available under the GNU GPL and versions are available for Windows, Apple and Linux/Unix.

DeepArc

DeepArc was developed by the Bibliothèque Nationale de France to archive objects from database-driven deep websites (particularly documentary gateways). It uses a database to store object metadata, while storing the objects themselves in a file system. Users are offered a form-based search interface where they may input keywords to query the database. DeepArc has to be installed by the web publisher who maps the structure of the application database to the DeepArc target data model. DeepArc will then retrieve the metadata and objects from the target site. DeepArc can be downloaded from Sourceforge.

There is a useful definition of Web archiving on Wikipedia with details of some other on-demand tools including WebCite, Hanzo, Backupmyurl, the Web Curator tool, Freezepage, Web site Archive and Page Freezer.

Some issues to consider

When chosing one of these tools there might be a few issues that you will want to consider.

  • Where you want the point of capture to be? Within the authoring system or server, at the browser or by using a crawler – most of the Web capture tools use the creawler approach.
  • Do you wish to ignore robots.txt? You can set a capture tool to ignore this file, is it ethical to do so?
  • What about managing authority? What do you do about sites that you do not own?
  • How does your capture tool deal with tricky areas like databases, datafeeds and subscription/login areas, the deep Web?
  • What sort of exclusion filters will you want to use? You’ll want to avoid too much ‘collateral harvesting’ i.e. gathering content that isn’t needed.

Harvesting tools can go wrong, you need to be careful when programming them and avoid settings that can send them into a loop.

Further Resources

NetPreserve – Toolkit for setting up a Web archiving chain

DCC – Web Archiving Tools

Posted in Archiving, Web | 5 Comments »