JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Archive for the 'Conference' Category

DCC Roadshow in Cardiff

Posted by Marieke Guy on 15th December 2011

Snow, sleet, hailstones, rain and sunshine! The Cardiff weather couldn’t make up its mind, but the Digital Curation Centre (DCC) roadshow carried on regardless. Although I have attended various days of the travelling roadshow (Bath and Cambridge) I’ve never actually managed to catch a day one. The opening day is an opportunity to hear an overview of the research data management landscape and is also the day on which local case studies make it onto the agenda, so I was looking forward to it.

Welcome: Janet Peters, Cardiff University

Janet Peters, Director of University Libraries and University Librarian for Cardiff University, opened the day by saying how keen she was to have the roadshow take place locally; feeling it to be very timely given current research data management (RDM) work in Cardiff. Janet explained that her attendance of the Bath roadshow had kick-started Cardiff’s work in this area. Cardiff have recently revitalised their digital preservation group and have been providing guidance and assisting departments with implementing changes to their RDM processes – more on this later. They have also recently rolled out an institutional repository, though it doesn’t cover data sets (at the moment).

The Changing Data Landscape: Liz Lyon, UKOLN

Liz Lyon on The Changing Data Landscape

Liz set the scene for the day by outlining the current data landscape. She began by introducing the new BIS report entitled Innovation and Research Strategy for Growth which expresses the government’s support for open data and introduces the Open Data Institute (ODI). Only last week David Cameron made the suggestion that “every NHS patient should be a “research patient” with their medical details “opened up” to private healthcare firms”. Openness and access to data are two of the biggest challenges of the moment and have stimulated much debate. Liz gave the controversial example of one tobacco companies FOI request to the University of Stirling for information relating to a survey on the smoking habits of teenagers. She explained that proposed amendment to FOI data will allow institutions to ask for exemption to FOI requests when research is ongoing. It’s often the case that researchers don’t want to share data and there have been instances when governments have placed restrictions on data use(e.g. the bring your genes to cal project. Liz shared some examples of more positive cases of when research is shared e.g. Alzheimers research, 1000 Genomes Project, Personal Genome Project, openSNP. She also offered some citizen science examples: BBC nature, project Noah http://www.projectnoah.org/, Galaxy Zoo, Patients Participate, BBC Lab. The Panton Principles are a recent set of guidelines that offer possible approaches: Open knowledge, open data, open content and open service. To some degree the key to all of this is knowing about data licensing and the DCC offer advice in this area.

Liz then moved on to what is often seen as the biggest challenge of all: the sheer volume of data now created e.g. large hydron collider. In the genomics area there are lots of shocking statistics on the growth of data and the implications of this. Another new report phg foundation: Next steps in the sequence gives the implications of this data deluge for the NHS. The text the Forth paradigm highlights data intensive research as being the next step in research. The DCC are working with Microsoft Research Connections to create a community capability model for data intensive research

It is apparent that big data is being lost, but so is small data (like excel spreadsheets) and part of the challenge is working out how scientists can deal with the longtail. What is framed as gold standard data is when you can fully replicate the code and the data, reproducible research is the second best approach. Data storage needs to be scalable, cost-effective, secure, robust and resilient, have a low entry barrier, have ease of use. Liz also also asked us to consider the role of cloud services, giving Dataflow http://www.dataflow.ox.ac.uk/, VIDaaS, BRISSKit, lab notebook as 4 JISC projects to follow in this area.

Liz then talked a little about policy, giving research council examples. The most relevant is the fairly demanding EPSRC expectations that have serious implications for HEIs: Institutions must provide a RDM roadmap by 1st May 2012 and must be compliant with these expectations by 1st May 2015. At the University of Bath, where Liz is based, there is a new project called research360@Bath and they have a particular emphasis on faculty-industry focus. There will also be a new data scientist role based at UKOLN. A full list of funders and their requirements is available from the DCC Web site.

Resources are available and the Incremental project http://www.lib.cam.ac.uk/preservation/incremental/ back in 2010 found that many people felt that institutional policies were needed in the RDM area. Edinburgh have developed an aspirational data management policy. The DCC have pulled together exemplars of data policy information http://www.dcc.ac.uk/resources/policy-and-legal/institutional-data-policies, ANDS also have a page on local policy.

It is also important to consider how you incentivise data management? There is quite a lot of current work on impact, data citation and DOIs. Some example projects: Total Impact http://total-impact.org/ and SageCite.

And what about the cost? Useful resources include the Charles Beagrie report on Keeping Research data safe http://www.beagrie.com/jisc.php, Neil Begrie has done some work into helping people articulate the benefits through use of a benefits framework tool.

In conclusion Liz asked delegates to think about the gaps in their institution.

Digital Data Management Pilot Projects: Sarah Philips, Cardiff University

Sarah explained how at Cardiff the University had retention requirements for quite a lot of corporate records and permanent records. They also have requirements for some of their research data for 5 -30 years. The University has set up three pilot projects in response to feedback from a digital preservation policy in the cultural area, in the school of Biosciences using genomic data and in the school of history and archaeology. Work in the school of history and archaeology department is now coming to a close and this is the area Sarah would concentrate on.

Three projects within the department were used as a test bed. The South Romanian Archaelogical Project (SRAP) at the University had collected excavation data and the team have been keen to make the data available. The Magura Past and Present Project had artists coming in and creating art; because the project was an engagement project it was required that the outputs be available, though not necessarily the data. The final project was on auditory archaeology. All three projects were run by Doctor Steve Mills.

Records management audits were carried out through face-to-face interviews with staff using the DCCs Data Asset Framework. Questions included: what records and data are held? How are the records and data managed and stored? What are the member of staffs requirements? A data asset register was created that dealt with lots of IP issues, ownership issues etc. Once this data was collected potential risks were identified e.g. Dr Mills had been storing data on any other hard-drives available but he didn’t have a systematic approach to this, there was some metadata available but file structure was an issue, proprietary formats were used and there are no file naming procedures in place. Dr Mills was keen to make the data accessible so the RDM team have been looking at depositing it with the Archaeology Data Service, if this solution isn’t feasible they will have to use an institutional solution.

High Performance Computing and the Challenges of Data-intensive Research: Martyn Guest, Cardiff University

Martyn started off by giving an introduction to advanced research computing
at Cardiff (ARCAA) which was established in 2008. Chemistry and physics have been the biggest users of high performance computing so far, but the data problem is relatively new and has really arisen since the explosion of data use by the medical and humanities schools.

He sees the challenges as being technical (quality performance, metadata, security, ownership, access, location and longevity), political (centralisation vs departmental), governance, ownership) and personal, financial (sustainability), legal & ethical (DP, FOI). Martyn showed us their data intensive supercomputer (‘Gordon’) and a lot of big numbers (for file sizes) were banded about! Gordon runs large-memory applications (supermode) – 512 cores, 2 TB of RAM, and 9.6 TB of flash. It has been the case that NERC has spent a lot of time moving data leaving less effort for analysing the data.

Martyn shared a couple of case studies: Positron Emission Tomography Imaging (PET) data where the biggest issues were that the data was raw, researchers weren’t interested in patient identifiable data but want image while clinicians wanted PID and image. He talked about sequencing data , which is now relatively easy, the hard bit is using biometrics on the data. As Martyn explained it now costs more to a analyse a genome than to sequence it and the big issue is sharing that data. Martin joked that the “best way to share data is by Fedex”, many agreed that this may often be the case! The case studies showed that in HPC it’s often a computational problem. HPC Wales has three various components to it including awareness building around HPC and the creation of a welsh network that can be accessed from anywhere and globally distributed.

Martyn concluded that the main issues are around how to do the computing efficiently while the archiving issues continue to be secondary.

Research Data Storage at the University of Bristol: Caroline Gardiner, University of Bristol

Caroline Gardiner explained that at the University of Bristol her team had originally carried out a lot of high performance computing but were increasingly storing research data. She noted that the arts subjects are increasingly creating huge data sets.

Caroline admitted to collecting horror stories of lost data and using this as a way to leverage support. The Bristol solution has been BluePeta which has been created using £2m funding and is a petascale facility. This facility is purely for research data at the moment, not learning and teaching data, thought is an expandable facility.

Caroline explained that their success in this area came from many directions. Bristol already had a management structure in place for HPC and for research data storage, they had access to the strategy people and those who held the purse strings. Bristol also have a research data storage and management board, there continues to be buy in from academics.

The process in place is that the data steward (usually principal investigator PI) applies and can register one or more projects. There is then academic peer review and storage policies applied. There is a cost model in place, the data steward gets 5TB free and then have to pay £400 per TB for annum disk storage. They are encouraging PIs to factor in these costs when writing their research grant applications. The facility is more for data that needs to be stored over the long term rather than active data.

Bristol are also exploring options for offsite storage and will also be looking at an annual asset holding review. They are also looking at preparing an EPSRC roadmap and starting to address wider issues of data management.

In answer to a question Caroline explained that they had made cost analysis against 3rd party solutions but when using the big players (like Google and Amazon) the cost of moving the data was the issue. There was some discussion on peer-to-peer storage but delegates were concerned that it would kill the network.

Data.bris: Stephen Gray, University of Bristol

Following on from Caroline’s talk Stephen Grey talked about what was happening on the ground through data.bris. Stephen explained that the drivers for the project were meeting the funder requirements (not just EPSRC), also meeting the publisher requirements and using research data in the REF and to increase successful applications. Bristol have agreed a digital support role alongside the data.bris project, though this ia all initially limited to the department of arts and humanities.

The team will be initially meeting with researchers and using the DMPOnline tool to establish funder requirements and ethical, IPR and metadata issues. After the planning there will be the research application and then hopefully research funding. The projects will then have access to BluePeta storage. The curation is planned to happen at the end of the project and high valued data identified for curation. Minimal metadata should be added at this stage, though there is a balancing act here between resourcing and how much metadata is added. Bristol have a PURE research management system and data.bris repository where they can check the data and carry out metadata extraction and assign DOIs. They will then promote and monitor data use

In the future the team also want to look into external data centres use. A theme running through the project is ongoing training and guidance and advocacy and policy. Training will need to go to all staff including IT support and academic staff and they are hoping for some mandatory level of training.

Bristol are also planning on using the DCC’s CARDIO and DAF tools

In the Q&A session delegates were interested in how Bristol had received som much top-down support for this work. It was explained that the pro VC for research ws a scientist and understood the issues. While there was support for research data it was felt that there could do with being more support for outputs.

Herding Cats – Research Publishing at Swansea University: Alexander Roberts, Swansea University

Alexander Roberts started off his presentation by saying that Swansea wants it all: all data, big data, notes scribbled on the back of fag packets, ideas, searchable and mineable data. Not only this but Swansea would like it all in one place, currently they have a lot of departmental data bases and various file formats in use. Swansea looked at couple of different systems including PURE but wanted an in house content management system, they also inherited a DSpace repository. They wanted this system to integrate with their TerminalFour Web CMS, with their DSpace system Cronfa and to give RSS feeds for staff research profiles, give Twitter feeds, Facebook updates etc. There was a consultation process that allowed lots of relationships to be formed and the end users to be involved. People were concerned that if they passed over their data they wouldn’t be able to get it back. A schema was created for the system. They started off using Sharepoint and were clear that they wanted everything in a usable format for the REF. The end result was built from the ground up: a form-based research information system that allowed researchers to add their outputs as easily as possible. It is a simple form based application that integrates with the HR database and features DOI resolving, MathML. The ingest formats are RSS, XML, Excel, Acess and others. It provides Open Data Protocol (oData) endpoint which provides feeds to the Web CMS and personal RSS feeds.

Alexander ended by saying that in 2012 they would like to implement automatic updates to DSPACE via SWORD and a searchable live directory of research outputs. They also want to have enhanced data visualisation tools for adminstrators. Mobile consideration is also a high priority as Swansea have a mobile first policy.

Michael Day and Alexander

Delivering an Integrated Research Management & Administration System: Simon Foster, University of Exeter

A Research Management and Administration System (RMAS) is more about managing data about projects but can also deal with research data. The Exeter project has been funded under the UMF, funded by HEFCE through JISC and is part of the HEFCE bigger vision of cloud computing and join up of systems. HE USB is being used: a test cloud environment from Eduserv. Simon Foster described how the project had started with a feasibility study which looked at whether there was demand for a cradle to grave RMAS system, 29 higher education institutions expressed interest. The project was funded and it was worked out that 29 HEIs phased in over ten years could save £25 million. The single supplier approach was avoided after concerns that it could kill all others in the market. The steering group looked at the processes involved and these were fed into a user requirement document. It was necessary that it was cloud enabled and were compliant with CERIF data exchange. Current possible systems include Pure, Avida etc. Specific modules were suggested. The end result will be a framework in place that will allow institutions to put out a mini-tender for RMAS systems asking specific institution related questions. Institutions should be able to do this in 4 weeks rather than 6 months.

The next steps for the project are proof of concept deliverables using CERIF data standards and use of externally hosted services. They also want to work with other services, such as OSS Watch.

There followed a panel session which included questions around the cost implications of carrying out this work. One suggestion was to consider the cost of failed bids due to lack of data management plans.

What can the DCC do for You?: Michael Day, UKOLN

Michael Day finished off the day with an overview of the DCC offerings and who they are aimed at (from researchers to librarians, from funders to IT services staff). He reiterated that part of RDM is bringing together different people from disparate areas and clarifying their role in the RDM process. The DCC tools include CARDIO, DAF, DMP Online, DRAMBORA. Some of the services include policy development, training, costing, workflow assessment etc. DCC resources are available from the DCC Website.


So after a day talking about data deluge while listening to a deluge of the more familiar sort (loud hail and rain) we were left with a lot to think about.

One interesting insight for me were that while the data deluge had come originally from certain science areas (astronomy, physics etc.) now more and more subjects (including arts and social sciences) are creating big data sets. One possible approach, advocated by a number of the day’s presenters, is to use HPC as a starting point from which to jolt start research data management. However there will continue to be a lot of data ‘outside of the circle’. As ever, join up is very important. Getting all the stakeholders together is essential, and that is something the DCC roadshows do very well. All presentations from the day are available form the DCC Web site.

The next roadshow will take place from 7 – 8 February 2012 in Loughborough. It is free to attend.

Posted in Conference, dcc, irg | Comments Off

Alliance for Permanent Access Conference

Posted by Marieke Guy on 16th November 2011

Last week (8th – 9th November) I attended the Alliance for Permanent Access (APA) annual conference in London. The APA aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information.

The event was held at the British Medical Association House, a fantastic setting for an event. It was a really interesting conference which provided a chance to hear about lots of great digital preservation projects.

There were a lot of really interesting plenaries so I’ve summarised a few of my personal favourites:

Digital Preservation What Why Which When With? – Prof. Keith Jeffery, Chair of APA Executive Board.

Unfortunately the European Commissioner Nellie Kroes couldn’t made it so Keith, outgoing chair of the Alliance, gave the keynote instead. Keith reflected on the history of digital preservation starting with the legendary story of the Doomsday book and the chameleon project. Keith talked about the importance of keeping digital resources accessible, understandable and easy to find. He gave an overview of some of the value judgements that need to be made, the standards (OAIS) and best practice (looking at projects like Parse and Aparsen). Keith also emphasised the role of the APA in this area, pulling together digital preservation research.

ODE Project – Dr Salvatore Mele, CERN

Salvatore Mele introduced the Opportunities for Data Exchange (ODE) project, which is about sharing data stories. Currently there are lots of incentives for research but not for preservation and the transition from science to e-Science has resulted in a data deluge that needs serious attention! Salvatore talked about the impossible triangle of reuse, (open) access and preservation – each leans heavily on the other. ODE has considered both the carrot and stick approaches (which have some value e.g the carrot of sharing big data has incentives to research not preservation) but isn’t enough. Mele explained that if there was no stick and no carrot we may to work one by one with researchers to encourage sharing. ODE offers a way to reduce the friction in research data management through awareness raising. The ODE Project booklet Ten Tales of Drivers & Barriers in Data Sharing is definitely worth a read.

Mr Mark Dayer, Consultant Cardiologist, Taunton & Somerset NHS Trust

It was really refreshing to hear the view of an outsider. Mark Dayer is not involved in digital preservation, he is a consultant cardiologist – he operates on hearts. Mark gave an incredibly open and entertaining presentation on the state of play in the National Health Service (NHS). He began by giving some background for the non-UK residents in the audience: “The NHS is a beloved institution that no political party dare dismantle” – or at least it used to be. Unfortunately the NHS and IT has made for grim headlines in the recent past and the NHS has enormous quantities of data and an enormous number diverse systems working locally and in unconnected ways. Many people are still working with paper based systems .Not only this but the NHS needs to make £20 billion of savings. Mark explained how an increasing number of systems (120 different clinical systems in use in one area) and bad IT planning has added to the problem. Other issues such as data security add to the mix: the ‘spine’ personal records system should hold over 50 million records but only has 5 million so far.

After the disaster story Mark moved on to the small successes that have started to happen. He explained that they are starting to build data centres, use the cloud (e.g. Chelsea and Westminster hospital) and use integration engines (which give an idea of number of data standards). He talked about the systems and standards including CDA, HL-7, ICD-10 (classification system), OPCS, SNOWMED-CT and about the new N3 VPN. Mark concluded by saying that it wasn’t just about the right software, but about the right hardware too, and that you need to bring people with you, all the way

Dr Martha Anderson, Director of the NDIIPP, US Library of Congress, Networks as evolving infrastructure for digital preservation

Martha Anderson started off by showing us a picture of the biggest Web seen. She explained that the old African proverb “when spiders unite they can take down a lion” applies here. Almost a dozen spider families were involved in the creation of this Web, the population had exploded due to wet conditions. Martha applied this analogy to digital preservation networks, telling us that we need our network will evolve if the conditions are right. The National Digital Information Infrastructure and Preservation Program (NDIIPP) was created to help create networks between people to undertake preservation – communities working together as bilateral and multi-lateral alliances.
Many different institutions are now involved in digital preservation and in developing alliances across communities. A good example is the blue ribbon task force which cut across sectors including the financial, scientific, aerospace and HE. Other sectors have much to offer us, for example Martha has learnt about video metadada annotation from Major League Baseball! The Data-PASS network gives a picture of what networks are doing. Martha concluded that it is all about setting up and supporting social interaction and local interaction to set up networks – finding common stories. She felt that if there was no local benefit for work then it cannot be sustained and that it cannot last past the funding. Martha observed that it is interesting that groups of institution will act in public interest but in their own interest on their own. Networks are beneficial to all.

UK Government views, Nigel Hickson, Head EU and International ICT Policy DCMS

Nigel Hickson was there to talk about the government’s responsibility for the digital infrastructure which includes the take up of broadband and copyright issues. Nigel began by singing the praises of the Riding the Wave report that was released 2010 by the high level expert group on research data, the Knowledge Exchange. He talked about the importance of having a framework and a holistic approach. For many broadband is an economic driver, mobile data continues to be a disruptive element (doubling every year) and all this spells game change for the public sector. The problem is that mobile data is increasing; the solution is having an ‘auction’ to increase capacity. The current UK approach is that the market should lead and that competition is vital. Britain’s superfast broadband strategy has 530 million to spend by 2015 and potential for an extra 300 million before 2017. Projects require price match from the private sector. The government also wants things to be digital by default, with the option of doing them offline if necessary. Other key priorities are a rights management infrastructure and the proposal on orphan works.

Nigel also outlined the European digital agenda where broadband is again a critical element. The key European targets are for basic broadband by 2013 for 100% citizens. By 2020 50% of households should have subscription of 100Mbits ps or above.

The Report A Surfboard for Riding the Wave builds on the 2010 report and presents an overview of the present situation with regard to research data in Denmark, Germany, the Netherlands and the United Kingdom and offers broad outlines for a possible action programme for the four countries in realising the envisaged collaborative data infrastructure.

Posted in Conference, Events | Comments Off

Approaches to Digitisation

Posted by Marieke Guy on 11th February 2011

Digital preservation is about preserving digital objects. These objects have to come to be somehow and earlier this week (Wednesday 9th February) I was invited to talk at an Approaches to Digitisation course facilitated by Research Libraries UK and the British Library. It was held at the British Library Centre for Conservation, a swanky new building in the British Library grounds. It was the first time the course has run, though they are planning another in the autumn. The course was aimed at those from cultural heritage institutions who are embarking on digitisation projects and sought to provide overview of how to plan for and undertake digitisation of library and archive material.

British Library by Phil Of Photos

I was really pleased that the event spent a considerable amount of time on the broader issues. Digitisation itself, although not necessarily easy, is just one piece of the jigsaw and there has been a tendency in the past for institutions to carry out mass digitisation and not consider the bigger picture. During the day several speakers advocated the use of the lifecycle approach and planning, selection and sustainability were highlighted as being key areas for consideration. If digitisation managers take this on board the end result will hopefully be a collection of well-rounded, well-maintained, well-used digitised collections with a preservation strategy in place.

The course followed a fairly traditional format with presentations, networking time and a printed out delegate pack. Unfortunately there was no wireless but this left us concentrating completely on presenters and what they had to say, and it was all very useful stuff.

Benefits of Digitising Material – Richard Davies, British Library

Richard Davies started the day with an introduction to the benefits of digitisation (a good article entitled The Case for Digitisation is available on P16 of the most recent JISC inform). Rather than just giving a straightforward overview of the different benefits he used a number of case studies to illustrate the added value that digitisation can provide, for example by opening up access, allowing digital scholarship and collaboration.

The British Library has now digitised approximately 4 million news papers. Opening up access has meant that people can use the papers in completely different ways, for example by full text searching and allowing different views on resources. Projects like the British Library Beowulf project and others allow extensive cross searching and the Codex Sinaiticus project has taken a bible physically held in 4 locations and allowed it to be accessed as one, for the first time. The Google Art Project allows users to navigate galleries in a similar way to Google Street view and the high resolution of the digital objects is impressive.

Google Art Project

Digitisation also presents opportunities for taking next step in digital scholarship. In the past carrying out the level of research that now possible with digital resources would have taken a very long time indeed. The Old Bailey project has digitised 198,000 records and users can now carry out extensive analysis on content in a matter of minutes.

Davies also illustrated how digitisation can allow you to expand your collection by bringing in resources from general public and by crowd sourcing. The Trove Project has carried out crowd sourcing of its optical character recognition (OCR) results. They have offered prizes for people who corrected the most text, they have also made a lot of details of how they went about the project is available online. The Transcribe Bentham project have also made many details about how they carried out work available on their blog.

Davies suggested that digitisation managers need to think about the model they will be using. Will content be freely available or will there be a business model behind it. One option is to allow users to dig to a certain level and then ask them to pay if they wish to access any further resources.

Davies concluded that to have a successful digitisation project you need to spend time on the other stuff – metadata, OCRing the text, making resources available in innovative ways. Digitisation is only one element of a digitisation project.

Planning for Digitisation- Richard Davies, British Library

Richard Davies continued presenting, this time looking more at his day job – planning for digitisation projects. He offered up a list of areas for consideration (though stated that this was a far from exhaustive list).

He suggested that a digitisation strategy helps you prioritise and can be a way of narrowing down the field. Such a strategy should fit within a broader context, in the British Library it is part of their 2020 vision. Policy and strategy should consider questions like: Who are we? Where could we go? Where should we go? How do we get there? It should also bear in mind funding and staffing levels.

Davies also spent a lot of time talking about the operational and strategic elements of embarking on a project. It is very much a case of preparation being the key, he suggested digitisation managers do as much preparation up front as possible without holding up the project. For example when considering selection consider what’s unique? What’s needed? What’s possible? (bearing in mind cost, copyright, conservation). He also emphasised the importance of lessons learnt reports.

Davies concluded by talking about some current challenges to digitisation programmes. The primary one was economic as funding calls are rarer and rarer. It can be useful to have funding bid expert onboard. He also explained that you can make the most of the bidding process by using it as an opportunity to help yourself to answer difficult questions about what you want to do. There is currently a lot of competition for funding. The last JISC call (Rapid Digitisation Call) offered up £400,000 worth of funding, 7 projects were funded, 45 bids were received.

Davies also highlighted that digital preservation and storage are increasingly becoming problems. Sustainability need not be forever but you should at least have a 3 – 5 year plan in place.

I was also pleased to hear Davies highlight a project I am now working on: The IMPACT project, funded by the European Commission. It aims to significantly improve access to historical text and to take away the barriers that stand in the way of the mass digitisation of the European cultural heritage

Use of Digital Materials – Aquilies Alencar Brayner, British Library

After a coffee break Aquilies Alencar-Brayner considered how users are currently using digital materials. He mentioned research by OCLC that states that students consistently use search engines to find resources rather than starting at library Web sites. They are also using the library less as books are still the brand associated.

Alencar-Brayner ran through the 10 ‘ins’ that users want: integrity, integration, interoperability, instant access, interaction, information, ingest of content, interpretation, innovation and indefinite access.

He showed us some examples of how the British Library is carrying out work facilitating access to digital materials for example through the Turning the Pages project which will allow you to actually turn the page, magnify the text, see the text within context, listen to audio.

Codex Sinaiticus

Where to begin? Selecting resources for digitisation Maureen Pennock, British Library

Maureen Pennock introduced us to selection. She explained that selection is usually based on previously selected resources and commonly the reason given for selection is for improving face access. However sometimes the reason can be for conservation of original and occasionally it is for enabling non-standard uses of resource.

Pennock explained that selection is often based on the appraisal made for archival purposes, known as assessment – areas for consideration include suitability and desirability and whether they are what users need.

Selection is an iterative process and revisited several times after you’ve defined your final goals and objectives. It is important to identify internal and external stakeholders such as curators, collection managers and so on and include them in the process.

Once you’ve set a scope you will need to pre select items, but there is no one size fits all approach. Practical and strategic issues come into play and items will need to be assessed and prioritised.

Pennock explained that suitability will need to consider areas like intellectual justification, demand, relevance, links to organisational digitisation policy, sensitivity, potential for adding value (e.g. commercial exploitation of resources).

Alongside suitability there will need to be item assessment looking at the quality of the original, the feasibility of image capture, the integrity and condition of resources, complex layouts for different material types, historical and unusual fonts and the size of artefacts. Legal issues such as copyright, data protection, licences, IPR also have a role to play.

Pennock concluded that not all issues are relevant to everyone and some with have more weighting than others. Practitioners will need to decide on their level of assessment and define their shortlist. It is important that you can justify your selection process in case issues arise later down the line.

Metadata Creation, Chris Clark, British Library

To wet our appetite for lunch Chris Clark took us on a whirlwind tour of digitisation metadata and its value. He explained that metadata adds value unfortunately often left at the end with tragic consequences. He also warned us that there is still no commonly agreed framework and it is still an immature area. Quite often metadata’s real value is most realised in situations where it isn’t expected. Clark recommended Here comes Everybody by Clay Shirky as a text that illustrated this. He also suggested delegates look at the One to Many; Many to One: The resource discovery taskforce vision.

Metadata is a big topic and Clark was only able to touch the surface. He advised us to think of metadata as a lubricant or adhesive that holds together users, digital objects, systems and services. We could also see metadata as a savings account – the more you put in more you get out.

Clark then offered us a quick introduction to XML and some background to the most relevant types of metadata when it comes to digitisation (descriptive, administrative, structural) metadata. He explained that Roy Tennant OCLC had characterised 3 essential metadata requirements: liquidity (written once use many times, expose), granularity and extensible (accommodate all subjects).

Clark concluded with an example of a high level case study he had worked on: Archival Sound Recordings at the British Library. On the project they had passed some of the load to the public by crowd sourcing recording quality and asking people to add tags and comments.

Preparing Handling Guidelines for Digitisation Projects, Jane Pimlott, British Library

After a very enjoyable lunch Jane Pimlott provided a real-world case study by looking at a recent project on which the British Library had created training and handling guidelines for a 2 year project to scan 19th century regional newspapers. It had been an externally funded project but work carried out on premises at Colindale. The team had had 6 weeks in which to deliver a training project, though a service plan was already in place and contractors were used.

Pimlott explained that damage can occur even if items are handled carefully but that material that is in a poor condition can be digitised but can take longer. She explained the need to understand processes and equipment used – e.g. large scale scanners. Much of the digitisation team’s work had been making judgement calls on assessing the suitability of items for scanning for the newspaper project. Their view was that canning should not be at expense of the item, it should not be seen as last chance scanning. Pimlott concluded that different projects present different risks and may require different approaches to handling and training.

Preservation Issues, Neil Grindley, JISC

Finally the day moved in to the realm of digital preservation. Neil Grindley from JISC explained how he had come from a paper, scissors, glue and pictures world (like many others there) but that the changing landscape required changing methods.

He began by trying to find out whether people considered digital preservation to be their responsibility. Unsurprisingly few did. He explained that digital preservation involves a great deal of discussion and there is lot of overlapping territory, it is best undertaken collaboratively. Career paths are only just beginning to emerge and the benefits are hard to explain and quantify. He revealed that a recent Gartner report stated that 15% of organisations are going to be hiring digital preservation professionals in the future, so it is a timely area in which to work in. Despite this is still tricky to make a business case to your organisation for why you should be doing it.

Grindley explained that there are no shortage of examples of digital preservation out there; recent ones include Becta and Flickr.

Grindley then went on to make the distinction between bit preservation and logical preservation. Bit preservation is keeping the integrity of the files that you need. He asked is bit preservation just the IT departments back up? Or is it more? He saw the preservation specialist as sitting between the IT specialist and content specialist almost as a go-between.

Used the example of Heydegger showing pixel corruption, corruption is both easy and potentially dangerous – especially in scientific research areas.

Grindley took us on a tour of some of the most pertinent areas of digital preservation such as checksums. These are very important for bit preservation and ensure that when you use something and go back to you can check that the files are not corrupted or changed. It is very easy to see if a file has been tampered with over time. Some of the tools suggested include:

Grindley then considered some of the main digital preservation strategies: technology preservation, emulation, migration, which led him on to the subject of logical digital preservation – not just focussing on keeping the bits but looking at what the material is and keeping its value

To conclude Grindley looked at some useful tools out there including DROID – digital record object identification, Curators workbench – useful tool from University of North Carolina, creates a MODS description and Archivematica – comprehensive preservation system. He also touched on new JISC work in this area.

5 new preservation projects starting Feb – July 2011

Other Sources of Information, Marieke Guy, UKOLN

I concluded the day by giving a presentation on other sources of information on digitisation and digital preservation. My slides are available on Slideshare and embedded below.

I think by now the delegates had had their fill of information but hopefully some will go back and look at the resources I’ve linked to.

To conclude: I really enjoyed the workshop and found it extremely useful. If I have one criticism it’s that the day was a little heavy on the content side and might have benefited from a few break-out sessions – just to lighten it up and get people talking a little more. Maybe something for them to bear in mind for next time?

Posted in Conference, Events | 2 Comments »

Addressing the Research Data Challenge

Posted by Marieke Guy on 8th November 2010

Last week the Digital Curation Centre (DCC) ran a series of inter-linked workshops aimed at supporting institutional data management, planning and training. The roadshow will travel round the UK but the first one was held in central Bath. The event ran over 3 days and provided Institutions with advice and guidance tailored to a range of different roles and responsibilities.

Day one (Tuesday 2nd November) looked the Research Data Landscape and offered a selection of case studies highlighting different models, approaches and working practice. Day two (Wednesday 3rd November) considered the research data challenge and how we can develop an institutional response. Day three (Thursday 4th November) comprised of 2 half-day training workshops: Train the Trainer and Digital Curation 101.

Unfortunately due to other commitments I could only make the second day of the roadshow, but found it really useful and would thoroughly recommend anyone interested in institutional curation of research data to attend the next workshop (to be held in Sheffield early next year – watch this space!).

The Research Data Challenge: Developing an Institutional Response

Liz Lyon Presenting

Day two of the roadshow was aimed at high-level managers and researchers with the intention of getting them to work together to identify first steps in developing an institutional strategic plan for research data management support and service delivery. Although there was a huge amount of useful information to take in (if only I’d come across more of it when writing the Beginner’s Guide! Currently waiting for the go ahead for release.) it was very much a ‘working day’. We were to get our hands dirty looking at real research curation and preservation situations in our own institutions.

After coffee and enjoying some of the biggest biscuits I’ve seen we were introduced to the DCC and given a quick overview by Kevin Ashley, Director DCC, University of Edinburgh. The majority of the day was facilitated by Dr Liz Lyon, Associate Director, DCC and Director of UKOLN, University of Bath. Liz reiterated the research data challenge we face but pointed out that there are both excellent case-studies and excellent tools now available for our use. Two that are worth highlighting here are DMP Online (DCC’s data management planning tool) and University of Southampton’s IDMB: Institutional data management blueprint. The slides Liz used during the day were excellent, they are available from the DCC Web site in PPT format and can be downloaded as a PDF from here.

During the day we worked in groups on a number of exercises. The idea is that we would start fairly high level and then drill down into more specific actions. In the first exercise my group took a look at motivations and benefits for research data management and the barriers that are currently in place. Naturally the economic climate was mentioned a fair amount during the day but some of the long-standing issues still remain: where responsibility lies, lack of skills, lack of a coherent framework, taking data out of context, storage issues and so on. After our feedback Liz gave another plenary on Reviewing Data Support Services: Analysis, Assessment, Priorities. The key DCC key tool in this area is the Data Asset Framework (formerly the Data Audit Framework) which provides organisations with the means to identify, locate, describe and assess how they are managing their research data assets – very useful for prioritising work. Useful reports include those from the Supporting Data Management Infrastructure for the Humanities (Sudamih) project. There was a feeling that looking into this area was becoming easier, people tend to be more open than they were a few years back, there is definitely groundswell.

Group Exercises

In exercise 2 we carried out a SWOT analysis of current research data. In the feedback there were a few mentions of the excellent Review of the State of the Art of the Digital Curation of Research Data by Alex Ball. Liz also provided us with a useful resources list (in her slides).

After an excellent lunch and a very brief break (no time to rest when sorting out HE’s data problems!) we returned to another plenary by Liz on Building Capacity and Capability in your Institution: Skills, Roles, Resources whih laid the groundwork for exercise 3 –
a skills and services Audit. This exercise required us to think about the various skills needed for data curation and align them with people in our institutions. There was a recognition that librarians do ‘a lot’ and are more than likely to become the hub for activity in the future. There was also a realisation that there is a fair number of gaps (for example around provenance) and that there can be a bit of a hole between the creation of data by researchers and the passing on of curated data to librarians. Another reason why we need to create more links with our researchers. Again lots of excellent resources that I hope to return to including Appraise & Select Research Data for Curation by Angus Whyte, Digital Curaton Centre, and Andrew Wilson, Australian National Data Service.

Liz then gave her final plenary on Developing a Strategic Plan for Research Data Management: Position, Policy, Structure and Service Delivery. The suggestions on optimising organisational support and looking at quick wins put us in the right frame of mind for the final exercise – Planning Actions and Timeframe. We were required to lay down our ‘real’ and aspirational actions for the short-term (0-12 months), medium-term (1-36 months) and long term (over 3 years). A seriously tricky task! The feedback reflected on the situation we are currently in economically and how it offers us as many opportunities as clallenges. Now is a better time than ever for reform and for information services to take on a leadership role. Kevin Ashley concluded the day with some thoughts on the big who, how and why issues. He stressed that training is so important at the moment. Many skills are in short supply and employing new staff is not an option so reskilling your staff is essential.

Flickr photos from the day (include photos of the flip chart pages created) are available from the UKOLN Flickr page and brief feedback videos are available from the UKOLN Vimeo page. There is also a Lanyard entry for the roadshow. The event tag was #dccsw10.

Posted in Conference, Events | 1 Comment »

Moving out of the e-Fridge: iPres 2010

Posted by Marieke Guy on 27th September 2010

Last week I attended the 7th International Conference on Preservation of Digital Objects (iPres 2010) held at the Technical University of Vienna. The conference looks at both research and best practice in the field of digital preservation and comprises of a full week of events including the regular conference, several workshops, the International Web Archiving Workshop (IWAW), the PREMIS implementation fair and lots of organised and impromptu meetings.

@art by Gerald Martineo - the conference art work

This year they had just over 290 people registered and the programme offered keynotes, 2 tracks (made up of regular papers and late breaking results) and poster sessions. The content was an interesting mix of the more traditional presentations looking at areas like metadata and object properties and some more practical talks on areas like preserving Web data.

There was a lot to take on board during the four days I was in Vienna but here are some of my highlights.

Monday 20th September

The Fourth Paradigm

The Monday morning opening keynote entitled The Fourth Paradigm: Data-Intensive Scientific Discovery & the Future Role of Research Libraries was given by Tony Hey. Hey has his roots in the academic sector and was involved in setting up the Digital Curation Centre but he now works for Microsoft; he also has a wife who is a librarian – all this made for an broad perspective on current needs when it comes to preservation of research data. Hey did a good job of putting forward Microsoft’s assurance in this area, he explained that they are committed to open standards, open tools and open technology and keen to be more involved. In the Q&A he actually admitted that Microsoft could do more to ensure its software is properly archived and available to others and that he felt they had a ‘responsibility’ in this area.

Tony Hey gives his opening keynote

Hey’s talk looked at the previous paradigms in science – experimental, theoretical and computational, and the move to a new data-intensive paradigm – the fourth paradigm (the title of his talk and his new book). Science is now overwhelmed with data sets he gave the example of Chronozoom. Rather than shy away from data deluge Hey explained that we should embrace it; the future is collective peer reviewing, collective tagging and lab notebooks as blogs. Hey also talked about software preservation and asked if we can do better? We need to decide upon the key parts and save the valuable, here he explained the relevance of Microsoft – the computing industry is very much closest to the problem. Hey then went on to mention some valuable digital preservation work that Microsoft have had research role input into: Planets project, SCAPE project, APARSEN, datacite, COAR, CNI and ICSTI.

Hey concluded by asking what the future of research libraries is? Is it that librarians have abdicated and are in danger of being disintermediated? His quote from a US General hit the nail on the head “if you do not like change you will like irrelevance even less”. Hey suggested three tasks for libraries: digital library; tools of authoring and publishing; integration of data and publications. Here he advocated that research libraries should be guardians of the research output of the institution and mentioned that they should see the importance of repositories and not be afraid of cloud solutions.

Preserving Web Archives: One Size Fits All?

Straight after an interesting lunch of a pasta pie (Vienna isn’t the best place for vegetarians!) we were offered a panel session on Preserving Web Archives: One Size Fits All? The panel [Libor Coufal (National Library of the Czech Republic ), Andrea Goethals (Harvard University Library), Gina Jones (Library of Congress), Clément Oury (French National Library) and David Pearson (National Library of Australia)] were all members of the Preservation Working Group of the IIPC (International Internet Preservation Consortium), which is made up of about forty institutions that collect web content for heritage purpose. Each member of the panel was given two questions to answer: “Web archiving” Do we have the same understanding of what we are trying to do? What are our preservation strategies for web archives?” Do we have the right technologies?

A good summary of the answers is available from the iPres site. What became clear through the discussion was that there is significant variation in what organisations are capturing (for example the French National Library are keeping spam, seeing its inclusion as a more faithful mirror of contemporary French culture) and what they plan to do with it (there were differences in how ‘public’ different national libraries want to make their Web archives.)

The Q & A session was interesting. Kevin Ashley from the DCC pointed out that Web archiving is not just about rendering single Web pages, it is about the connections. It seems Web Archiving is still an un-cracked nut, as David Pearson put it “Web archives are the opposite of well-formed and homogeneous file systems- migration is going to be difficult”.

Poster Spotlight Session

Marieke Guy in front of the Twapper Keeper poster

Later in the afternoon I was given my 2 minutes of fame and was able to present my poster on Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges.

My very brief talk is available on Vimeo and embedded below. After my presentation there was a lot of interest in the Twapper Keeper software and I was lucky enough to talk to people from the Internet Archive and the Library of Congress.

Welcome at Vienna Rathaus

After all the days sessions we had time to nip back to our hotel to put on our glad rags before a group walk along Vienna’s Ringstraßen boulevard. The welcome drinks reception was in the Coat of Arms Hall (Wappensaal) of the Vienna City Hall (Rathaus). We were treated to fantastic ballroom dancing, great wine and lots of interesting discussion.

The Ballroom dancers at the Rathaus

Tuesday 21st September

Digital Preservation Research: An Evolving Landscape

Tuesday’s keynote was given by Patricia Manson from the European Commission. Manson has been involved in defining a research agenda. She sees the challenge as building new cross-disciplinary teams that integrate computer science with library, archival science and businesses. Manson explained that there is a need to move away from the ‘e-fridge’ idea of digital preservation i.e. locking objects away. She encouraged the view that preservation is about access. Manson also stated that digital preservation not just a research issue and it is too important to be only left to researchers. There is a need for a joined up approach linking policy strategy and technology actions. 10 years of research means that we now understand more complex, dynamic and distributed objects, but there is still much to do, for example Web archiving is not a simple problem but an area that will evolve. Manson also talked about the need to involve other sectors and convince industry of the reasons for preservation. New stake holders include aerospace, health care, finance – science: astronomy and genomics, governmental and broadcasters archives, libraries and Web archives. So far the European Commission has not been very good at handling risk and tended to be risk adverse, they need to build strategies that are more open to advance technologies.

Manson concluded by looking at the trends emerging in the latest call: new infrastructures, cloud, security and trust, open questions on governance, responsibility. The next four years will require more scalable solutions. There is a need for more automation to deal with the sheer volume and a need for less human input.

How Green is Digital Preservation

After lunch (spinach strudel for the second time) Neil Grindley from JISC moderated a panel session looking at How Green is Digital Preservation. I’m interested in environmental issues and the green ICT agenda (and have discussed in more detail on my remote worker blog) so was really looking forward to this particular panel. After a whirlwind introduction by Grindley looking at the points of engagement between digital preservation and the green agenda, which included a quick show of the “delete a petabyte save a polar bear” poster, each speaker was given the opportunity to say where their organisation stood.

Panel session on How Green is Digital Preservation

In a very ‘green’ talk because it was given by video cast Diane McDonald from the University of Strathclyde explained that for her “Green IT begins with Green data”. McDonald’s main points were questioning replication and asking for leadership in this area.

Kris Carpenter Negulescu of the Internet Archive gave a practitioner’s perspective being upfront about the fact that the IA were primarily led by economic drivers. They had found that for them power is the 2nd largest cost behind human resources, and power costs vary ‘wildly’ in North California. A tighter budget now required practices not to be wasteful, so this had helped them be more green in efforts. They had tried out various practices like turning off the air conditioning for 4 months over the year, venting heat into adjacent spaces that are too cool or to outside. Over time they had increased their storage density but their power costs had remained stable.

David Rosenthal from LOCKSS started off by admitting that digital preservation is not green at all. He showed how we have been increasing the time to read a disc from 240s in 1990 to 12000+ today; but transfer speeds don’t increase without capacity.

William Kilbride from the Digital Preservation Coalition explained that unfortunately green is not what politicians talk about when it comes to IT, they are more driven by privacy and economics. He gave a 10 point plan for points of at which to think about the green agenda. These included procurement, planning of new buildings and deletion.
The session ended a little flatly with recognition that we all need to lead in this area but that still little is being done. Hopefully escalating energy prices will mean that big data centres try harder to work collaboratively to reduce individual footprints.

Lightening talks

In a similar session to the poster spotlight one on the previous day all delegates were given the opportunity to talk for a few minutes on an area of interest. Talks included:

  • Amanda Spencer from the National Archives talking about Web Continuity project
  • Ross Spencer from the National Archives talking about contributions to the National Archive PRONOM data
  • John Kunze from the University of California Curation Center talking about EZID – actionable IDs
  • Andreas Rauber from the Vienna University of Technology talking about Challenges in digital preservation
  • Richard Wright from the BBC defining what a digital object is (in the form of a miracle)
  • Stephen Abrams from the California Digital Library talking about curation of microservices
  • Martin Halbert highlighting the Aligning National Approaches To Digital Preservation conference, Talin, 2011

The lightening talks worked really well and were a useful way to highlight people you might want to talk to later.

Later in the afternoon I gave my talk on Approaches To Archiving Professional Blogs Hosted In The Cloud. There were a few interesting questions around which approach we’d felt had worked best, unfortunately there wasn’t any easy answer! My talk was directly followed by probably my favourite one of the conference…

NDIIPP and the Twitter Archives

Martha Anderson from the Library of Congress (LOC) gave the story behind what happened on April 10, 2010, when the LOC and Twitter made the decision that the Library would receive a gift of the archive of all public tweets shared through the service since its inception in 2006. On this day Twitter not only gave their archives to the LOC but also sold them to Google. Anderson began by giving some examples for relevance of Twitter archiving. These included the Iran elections where Twitter would later prove to be a resource for historical research (they are the modern form of diary entries) and business records – the LOC already has a partnership with business and sometimes keeps the business records of .com businesses. She explained that the senate was now using Twitter and the LOC has many personal collections so Twitter is a natural addition. Anderson explained that the Twitter archives are less than 5TB so the conversation is not around space but much more around policy, privacy and access. The right to be forgotten movement has since created sites like #NoLOC.org Keep Your Tweets From Being Archived Forever. Anderson concluded that the issues for the LOC were not technical but social, and for her it had demonstrated that there are no clean boundaries about the work we do.

Martha Anderson talking about the NDIIPP and the Twitter Archives

Reception at the Austrian National Library.

In the evening we attended a drinks reception at the Austrian National Library. There was a tour of the Prunksaal (State Hall) with a talk by Max Kaiser from the Austrian National Library, one of our iPres hosts, about the 30-million-euro deal the library has made with Google to digitise 400,000 copyright-free books. After marvelling at the hall we had a drinks reception in the Aurum of the National Library.

The Austrian National Library ceiling

Wednesday 21st September

The final morning concluded with a number of case study sessions.

Capturing and Replaying Streaming Media in a Web Archive

Helen Hockx-Yu from the British Library talked about the approaches they had taken to archiving streaming media as part of the Anthony Gormley One and Other art project in the UK. The project has involved100 days of continuous occupation of the fourth plinth in Trafalgar square. Over this period 2400 real people had occupied the plinth for sixty minutes each and this time had been streamed over RTMP using Red Stream. The British Library now had the challenge of archiving the outputs. They did this using Jaksta but also needed to carry out validation, spot-checking and repairs. However their main challenges were initially curatorial (people wanted content removed) and legal – the videos are still only valid under a 5 year licence. The main conclusions drawn from the project were that it is highly costly to archive a site like this, there is still no generic solution and that there is a real need to manage expectations. The domain name now redirects to the British Library Web Archive site.

Final Thoughts…

This was the first time I’d attended an iPres conference and it really was quite an impressive event. Everyone was really friendly and I’ve made some great contacts which I hope to follow up. My path into digital preservation has been through the Web archiving route, I’ve always worked on projects that have had pragmatism and practicality at their heart (for example this project and the PoWR project). Some aspects of the conference did seem very research centric and technical, but there was still enough of relevance to me to keep my interest. From speaking to those who have attended before there does seem to be a move by iPres to embrace new digital preservation challenges (like Web archiving) and more hands on research (through the late breaking results papers).

I used the #ipres2010 hashtag a lot at the conference and felt that the insights shared by those tweeting really added to my experience. Unfortunately there was only a relatively small number of people tweeting, though this is likely to change over the next few years. I’d recommend that the iPres organisers themselves use their iPres Twitter account more and specify hashtags for individual sessions, as well as for the whole conference. All the conference tweets have been archived in an iPres2010 Twapper Keeper Archive.

One other thing I would really like to see is links to speaker’s slides. Unfortunately the only resources offered were the papers. These were printed out in the huge proceeding book, which went into the huge conference bag we were given!

After the conference I had the afternoon free to enjoy the delights of Vienna. Below are a few photos of the sights I saw.

More photos from the event are available on Flickr using the ipres2010 tag.

Posted in Conference | Comments Off

iPres 2010: Twitter Archiving Using Twapper Keeper

Posted by Marieke Guy on 15th September 2010

I’ve already mentioned my forthcoming trip to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

As well as presenting a paper on Approaches To Archiving Professional Blogs Hosted In The Cloud I will also be presenting a poster and giving a lightning presentation entitled Twitter Archiving Using Twapper Keeper: Technical And Policy Challenges. The full paper is held on the University of Bath repository and was written by Brian Kelly (UKOLN), Martin Hawksey (JISC RSC Scotland N&E), John O’Brien (Twapper Keeper), Matthew Rowe (University of Sheffield) and myself.

The paper explains that Twitter is now widely used in a range of different contexts, ranging from informal social communications and marketing purposes through to supporting various professional activities in teaching and learning and research. The growth in Twitter use has led to a recognition of the need to ensure that Twitter posts (‘tweets’) can be accessed and reused by a variety of third party applications.

It describes development work to the Twapper Keeper Twitter archiving service to support use of Twitter in education and research. The reasons for funding developments to an existing commercial service are described and the approaches for addressing the sustainability of such developments are provided. The paper reviews the challenges this work has addressed including the technical challenges in processing large volumes of traffic and the policy issues related, in particular, to ownership and copyright.

The paper concludes by describing the experiences gained in using the service to archive tweets posted during the WWW 2010 conference and summarising plans for further use of the service.

A copy of the poster is available on Scribd.

Tags: ,
Posted in Conference, ipres2010, Paper | 3 Comments »

iPres 2010: Archiving Professional Blogs

Posted by Marieke Guy on 13th September 2010

Next week (20 – 22 September) I will be travelling to Vienna for the 7th International Conference on Preservation of Digital Objects – iPres 2010.

I will be presenting a long late breaking result paper at the conference entitled Approaches To Archiving Professional Blogs Hosted In The Cloud. The full paper is held on the University of Bath repository and was written by Brian Kelly and myself.

This is a practical paper which recognises that early adopters of blogs will have made use of externally-hosted blog platforms, such as WordPress.com and Blogger.com, due, perhaps, to the lack of a blogging infrastructure within the institution or concerns regarding restrictive terms and conditions covering use of such services. There will be cases in which such blogs are now well-established and contain useful information not only for current readership but also as a resource which may be valuable for future generations.

The paper sees that there is a need to preserve content which is held on such third-party services – ‘the Cloud’ provides a set of new challenges which are likely to be distinct from the management of content hosted within the institution, for which institutional policies should address issues such as ownership and scope of content. Such challenges include technical issues, such as the approaches used to gather the content and the formats to be used and policy issues related to ownership, scope and legal issues.

It describes the approaches taken in UKOLN to the preservation of blogs used in the organisation and covers the technical approaches and policy issues associated with the curation of blogs a number of different types of blogs: blogs used by members of staff in the department; blogs used to support project activities and blogs used to support events.

My slides are available on Slideshare and are embedded below.

Posted in Conference, ipres2010, Paper | 4 Comments »