JISC Beginner's Guide to Digital Preservation » irg

DCC Tools Catalogue

Marieke Guy — Tue, 08 May 2012 08:28:31 +0000

The Digital Curation Centre (DCC) has has recently updated their catalogue of tools and services for managing and curating research data.

The catalogue is available from
http://www.dcc.ac.uk/resources/external/tools-services

This is more than a new look; the catalogue has been overhauled to focus on software and services that directly perform curation and management tasks. It splits these resources into five major categories, based on who the intended users are and what stage of the data lifecycle they will be most useful in.

There is a category for Archiving and Preserving Information Packages with sub categories including:

Access Platforms – Tools to publish content and metadata to the web.
Backup and Storage Management – Tools to coordinate responsible storage and preservation strategies.
Creating and Manipulating Metadata – Enriching object descriptions and standardising records.
Emulation – Re-creating obsolete software environments to access old formats.
File Format ID and Validation – Defining and validating digital files.
Metadata Harvest and Exposure – Using OAI-PMH to share records across repositories.
Normalisation and Migration – Transferring digital materials into preservation-friendly formats.
Persistent ID Assignment – Creating unique identifiers for digital objects.
Repository Platforms – Enabling deposit, preservation, and access to digital content.

Sub-categories contain tables for quick comparison of tools against others that perform similar functions, linked to in-depth descriptions of how the resource can help.

This resource will evolve; if you have suggestions of tools to add please send them to info@dcc.ac.uk

Launch Workshop for DataFlow and ViDaaS

Marieke Guy — Mon, 05 Mar 2012 15:50:24 +0000

Mark Thorley, data management co-ordinator for NERC set the tone for the day when he explained that “Data management is too important to leave to the data managers, it needs to be an important part of research“. The launch event, hald at the Saïd Business school, University of Oxford, on Friday 2nd March 2012 for two new UMF-funded infrastructure projects, was all about embedding research data management (RDM) into workflow using shared services. The UMF programme aims to help universities and colleges deliver better efficiency and value for money through the development of shared services.

Data Management at Oxford

Paul Jeffreys, director of IT, University of Oxford, gave an introduction to current data management practice at the University of Oxford. Currently activities in Oxford are varied and rarely co-ordinated. Although there is a RDM portal comprising of a research skills toolkit, RDM checklist, a University statement on research data management (based on the University of Edinburgh’s ’10 commandments’) and a training programme in place there are many people/areas they are failing to meet. One area for concern is non-funded research (i.e. people for whom their research is their life’s work). It remains very tricky to build in generic support and activities need to be flexible.

Introduction to DataFlow

DataFlow was introduced by David Shotton, the DataFlow PI. DataFlow is a collaborative project led by the University of Oxford. It is a two-tier data management infrastructure that allows users to manage and store research data. The project builds on a prototype developed in the JISC-funded ADMIRAL project.

The first tier, called DataStage, is a file store which can be accessed through private network drives or the web. Users can upload research data files and the service is backed up nightly. DataStage is likely to be used by single research groups and deployment can be on a local server or on an institutional or commercial cloud. There is optional integration with DropBox and other Web services.

The second tier is DataBank, which, through a web submission interface, allows users to select and package files for publication. Files are accompanied by a simple metadata and contain an RDF manifest, which is then displayed as linked open data. They are packaged using the BagIt service. Databank is a scalable data repository where data packages are published and released under a CCZero licence, though users can chose to keep data private or add an optional embargo period.

DataFlow is now at beta release v0.1. The DataFlow team are keen to build a user community and have lots of processes in place allowing users to comment on developments.

Introduction to ViDaaS

James Wilson, ViDaaS project manager introduced us to ViDaaS. Virtual Infrastructure with Database as a Service (ViDaaS) comprises of two separate elements. DaaS is a web based system that enables researchers to quickly and intuitively build an online database from scratch, or import an existing database. The virtual infrastructure (VI) is an infrastructure which enables the DaaS to function within a cloud computing environment, it is known as the ORDS service – Online research database service. It builds on ideas developed in the JISC-funded sudamih projects The ViDaaS service currently has three business models:

£600 per year for a standard project (25gb)
£2000 per year for large project (100gb)
Later option for public cloud for hosting

ViDaaS is officially launching this summer.

Further details on interoperability between ViDaaS are contained within the Data Management Rollout at Oxford (DaMaRO) Project.

Both services are seen as being ‘sheer curation’. This is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. http://en.wikipedia.org/wiki/Digital_curation#Sheer_curation

So Why Use these Services?

Many of the other speakers from the day attempted to convince us of why we should use these services. It seems that despite the efforts of many, including the DCC data curation is often seen as a ‘fringe activity’. There are negligible rewards for creating metadata and there is a noticeable skills barriers in metadata– researchers have raw data – institutions have repositories that are empty. The principle of ‘sheer curation’ – allow tools to work with you rather than against you. It is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. Both DataFlow and ViDaaS offer integration with simple workflows and immediate benefits.

Use of shared infrastructure services is supported by JISC. They offer potential cost savings, transferability and reuse of tools.

The key for getting people to use the services lies in getting buyin from users and allowing flexibility. As user Chris Holland explained “we are inherently creative people are going to do things in our own way”. There is a need to make services flexible and intuitive as no system can be all things to all researchers.

What about the Cloud?

Peter Jones, Shared Infrastructure Services Manager at Oxford University Computing service began his session introducing the Oxford cloud Infrastructure with a quote from Randy Heffner: “The trouble with creating a “cloud strategy”? You’re focusing on technology, not business benefit.” He explained how the main barriers to cloud adoption include understanding costs, reliability (network), portability (lock-in), control, performance and security. However the biggest issue was inertia and reluctance to change. He concluded that a local private cloud overcomes a number of these issues and that the most likely approach is a public private hybrid

It is becoming apparent that the cloud exposes a cost that was previously hidden. However research institutions need to stand by the data they create, therefore the costs need to be observed and paid. James Wilson, ViDaaS project manager, observed that this is how libraries work, however it is not yet recognised in the research world in which people are still trying to offload costs on to other people.

The afternoon breakout allowed more interaction and discussion around some of the highlighted issues, primarily cost, the cloud and national services.

Resources from the day are available on the DataFlow Website.

DCC Roadshow in Cardiff

Marieke Guy — Thu, 15 Dec 2011 11:58:40 +0000

Snow, sleet, hailstones, rain and sunshine! The Cardiff weather couldn’t make up its mind, but the Digital Curation Centre (DCC) roadshow carried on regardless. Although I have attended various days of the travelling roadshow (Bath and Cambridge) I’ve never actually managed to catch a day one. The opening day is an opportunity to hear an overview of the research data management landscape and is also the day on which local case studies make it onto the agenda, so I was looking forward to it.

Welcome: Janet Peters, Cardiff University

Janet Peters, Director of University Libraries and University Librarian for Cardiff University, opened the day by saying how keen she was to have the roadshow take place locally; feeling it to be very timely given current research data management (RDM) work in Cardiff. Janet explained that her attendance of the Bath roadshow had kick-started Cardiff’s work in this area. Cardiff have recently revitalised their digital preservation group and have been providing guidance and assisting departments with implementing changes to their RDM processes – more on this later. They have also recently rolled out an institutional repository, though it doesn’t cover data sets (at the moment).

The Changing Data Landscape: Liz Lyon, UKOLN

Liz Lyon on The Changing Data Landscape

Liz set the scene for the day by outlining the current data landscape. She began by introducing the new BIS report entitled Innovation and Research Strategy for Growth which expresses the government’s support for open data and introduces the Open Data Institute (ODI). Only last week David Cameron made the suggestion that “every NHS patient should be a “research patient” with their medical details “opened up” to private healthcare firms”. Openness and access to data are two of the biggest challenges of the moment and have stimulated much debate. Liz gave the controversial example of one tobacco companies FOI request to the University of Stirling for information relating to a survey on the smoking habits of teenagers. She explained that proposed amendment to FOI data will allow institutions to ask for exemption to FOI requests when research is ongoing. It’s often the case that researchers don’t want to share data and there have been instances when governments have placed restrictions on data use(e.g. the bring your genes to cal project. Liz shared some examples of more positive cases of when research is shared e.g. Alzheimers research, 1000 Genomes Project, Personal Genome Project, openSNP. She also offered some citizen science examples: BBC nature, project Noah http://www.projectnoah.org/, Galaxy Zoo, Patients Participate, BBC Lab. The Panton Principles are a recent set of guidelines that offer possible approaches: Open knowledge, open data, open content and open service. To some degree the key to all of this is knowing about data licensing and the DCC offer advice in this area.

Liz then moved on to what is often seen as the biggest challenge of all: the sheer volume of data now created e.g. large hydron collider. In the genomics area there are lots of shocking statistics on the growth of data and the implications of this. Another new report phg foundation: Next steps in the sequence gives the implications of this data deluge for the NHS. The text the Forth paradigm highlights data intensive research as being the next step in research. The DCC are working with Microsoft Research Connections to create a community capability model for data intensive research

It is apparent that big data is being lost, but so is small data (like excel spreadsheets) and part of the challenge is working out how scientists can deal with the longtail. What is framed as gold standard data is when you can fully replicate the code and the data, reproducible research is the second best approach. Data storage needs to be scalable, cost-effective, secure, robust and resilient, have a low entry barrier, have ease of use. Liz also also asked us to consider the role of cloud services, giving Dataflow http://www.dataflow.ox.ac.uk/, VIDaaS, BRISSKit, lab notebook as 4 JISC projects to follow in this area.

Liz then talked a little about policy, giving research council examples. The most relevant is the fairly demanding EPSRC expectations that have serious implications for HEIs: Institutions must provide a RDM roadmap by 1st May 2012 and must be compliant with these expectations by 1st May 2015. At the University of Bath, where Liz is based, there is a new project called research360@Bath and they have a particular emphasis on faculty-industry focus. There will also be a new data scientist role based at UKOLN. A full list of funders and their requirements is available from the DCC Web site.

Resources are available and the Incremental project http://www.lib.cam.ac.uk/preservation/incremental/ back in 2010 found that many people felt that institutional policies were needed in the RDM area. Edinburgh have developed an aspirational data management policy. The DCC have pulled together exemplars of data policy information http://www.dcc.ac.uk/resources/policy-and-legal/institutional-data-policies, ANDS also have a page on local policy.

It is also important to consider how you incentivise data management? There is quite a lot of current work on impact, data citation and DOIs. Some example projects: Total Impact http://total-impact.org/ and SageCite.

And what about the cost? Useful resources include the Charles Beagrie report on Keeping Research data safe http://www.beagrie.com/jisc.php, Neil Begrie has done some work into helping people articulate the benefits through use of a benefits framework tool.

In conclusion Liz asked delegates to think about the gaps in their institution.

Digital Data Management Pilot Projects: Sarah Philips, Cardiff University

Sarah explained how at Cardiff the University had retention requirements for quite a lot of corporate records and permanent records. They also have requirements for some of their research data for 5 -30 years. The University has set up three pilot projects in response to feedback from a digital preservation policy in the cultural area, in the school of Biosciences using genomic data and in the school of history and archaeology. Work in the school of history and archaeology department is now coming to a close and this is the area Sarah would concentrate on.

Three projects within the department were used as a test bed. The South Romanian Archaelogical Project (SRAP) at the University had collected excavation data and the team have been keen to make the data available. The Magura Past and Present Project had artists coming in and creating art; because the project was an engagement project it was required that the outputs be available, though not necessarily the data. The final project was on auditory archaeology. All three projects were run by Doctor Steve Mills.

Records management audits were carried out through face-to-face interviews with staff using the DCCs Data Asset Framework. Questions included: what records and data are held? How are the records and data managed and stored? What are the member of staffs requirements? A data asset register was created that dealt with lots of IP issues, ownership issues etc. Once this data was collected potential risks were identified e.g. Dr Mills had been storing data on any other hard-drives available but he didn’t have a systematic approach to this, there was some metadata available but file structure was an issue, proprietary formats were used and there are no file naming procedures in place. Dr Mills was keen to make the data accessible so the RDM team have been looking at depositing it with the Archaeology Data Service, if this solution isn’t feasible they will have to use an institutional solution.

High Performance Computing and the Challenges of Data-intensive Research: Martyn Guest, Cardiff University

Martyn started off by giving an introduction to advanced research computing
at Cardiff (ARCAA) which was established in 2008. Chemistry and physics have been the biggest users of high performance computing so far, but the data problem is relatively new and has really arisen since the explosion of data use by the medical and humanities schools.

He sees the challenges as being technical (quality performance, metadata, security, ownership, access, location and longevity), political (centralisation vs departmental), governance, ownership) and personal, financial (sustainability), legal & ethical (DP, FOI). Martyn showed us their data intensive supercomputer (‘Gordon’) and a lot of big numbers (for file sizes) were banded about! Gordon runs large-memory applications (supermode) – 512 cores, 2 TB of RAM, and 9.6 TB of flash. It has been the case that NERC has spent a lot of time moving data leaving less effort for analysing the data.

Martyn shared a couple of case studies: Positron Emission Tomography Imaging (PET) data where the biggest issues were that the data was raw, researchers weren’t interested in patient identifiable data but want image while clinicians wanted PID and image. He talked about sequencing data , which is now relatively easy, the hard bit is using biometrics on the data. As Martyn explained it now costs more to a analyse a genome than to sequence it and the big issue is sharing that data. Martin joked that the “best way to share data is by Fedex”, many agreed that this may often be the case! The case studies showed that in HPC it’s often a computational problem. HPC Wales has three various components to it including awareness building around HPC and the creation of a welsh network that can be accessed from anywhere and globally distributed.

Martyn concluded that the main issues are around how to do the computing efficiently while the archiving issues continue to be secondary.

Research Data Storage at the University of Bristol: Caroline Gardiner, University of Bristol

Caroline Gardiner explained that at the University of Bristol her team had originally carried out a lot of high performance computing but were increasingly storing research data. She noted that the arts subjects are increasingly creating huge data sets.

Caroline admitted to collecting horror stories of lost data and using this as a way to leverage support. The Bristol solution has been BluePeta which has been created using £2m funding and is a petascale facility. This facility is purely for research data at the moment, not learning and teaching data, thought is an expandable facility.

Caroline explained that their success in this area came from many directions. Bristol already had a management structure in place for HPC and for research data storage, they had access to the strategy people and those who held the purse strings. Bristol also have a research data storage and management board, there continues to be buy in from academics.

The process in place is that the data steward (usually principal investigator PI) applies and can register one or more projects. There is then academic peer review and storage policies applied. There is a cost model in place, the data steward gets 5TB free and then have to pay £400 per TB for annum disk storage. They are encouraging PIs to factor in these costs when writing their research grant applications. The facility is more for data that needs to be stored over the long term rather than active data.

Bristol are also exploring options for offsite storage and will also be looking at an annual asset holding review. They are also looking at preparing an EPSRC roadmap and starting to address wider issues of data management.

In answer to a question Caroline explained that they had made cost analysis against 3rd party solutions but when using the big players (like Google and Amazon) the cost of moving the data was the issue. There was some discussion on peer-to-peer storage but delegates were concerned that it would kill the network.

Data.bris: Stephen Gray, University of Bristol

Following on from Caroline’s talk Stephen Grey talked about what was happening on the ground through data.bris. Stephen explained that the drivers for the project were meeting the funder requirements (not just EPSRC), also meeting the publisher requirements and using research data in the REF and to increase successful applications. Bristol have agreed a digital support role alongside the data.bris project, though this ia all initially limited to the department of arts and humanities.

The team will be initially meeting with researchers and using the DMPOnline tool to establish funder requirements and ethical, IPR and metadata issues. After the planning there will be the research application and then hopefully research funding. The projects will then have access to BluePeta storage. The curation is planned to happen at the end of the project and high valued data identified for curation. Minimal metadata should be added at this stage, though there is a balancing act here between resourcing and how much metadata is added. Bristol have a PURE research management system and data.bris repository where they can check the data and carry out metadata extraction and assign DOIs. They will then promote and monitor data use

In the future the team also want to look into external data centres use. A theme running through the project is ongoing training and guidance and advocacy and policy. Training will need to go to all staff including IT support and academic staff and they are hoping for some mandatory level of training.

Bristol are also planning on using the DCC’s CARDIO and DAF tools

In the Q&A session delegates were interested in how Bristol had received som much top-down support for this work. It was explained that the pro VC for research ws a scientist and understood the issues. While there was support for research data it was felt that there could do with being more support for outputs.

Herding Cats – Research Publishing at Swansea University: Alexander Roberts, Swansea University

Alexander Roberts started off his presentation by saying that Swansea wants it all: all data, big data, notes scribbled on the back of fag packets, ideas, searchable and mineable data. Not only this but Swansea would like it all in one place, currently they have a lot of departmental data bases and various file formats in use. Swansea looked at couple of different systems including PURE but wanted an in house content management system, they also inherited a DSpace repository. They wanted this system to integrate with their TerminalFour Web CMS, with their DSpace system Cronfa and to give RSS feeds for staff research profiles, give Twitter feeds, Facebook updates etc. There was a consultation process that allowed lots of relationships to be formed and the end users to be involved. People were concerned that if they passed over their data they wouldn’t be able to get it back. A schema was created for the system. They started off using Sharepoint and were clear that they wanted everything in a usable format for the REF. The end result was built from the ground up: a form-based research information system that allowed researchers to add their outputs as easily as possible. It is a simple form based application that integrates with the HR database and features DOI resolving, MathML. The ingest formats are RSS, XML, Excel, Acess and others. It provides Open Data Protocol (oData) endpoint which provides feeds to the Web CMS and personal RSS feeds.

Alexander ended by saying that in 2012 they would like to implement automatic updates to DSPACE via SWORD and a searchable live directory of research outputs. They also want to have enhanced data visualisation tools for adminstrators. Mobile consideration is also a high priority as Swansea have a mobile first policy.

Michael Day and Alexander

Delivering an Integrated Research Management & Administration System: Simon Foster, University of Exeter

A Research Management and Administration System (RMAS) is more about managing data about projects but can also deal with research data. The Exeter project has been funded under the UMF, funded by HEFCE through JISC and is part of the HEFCE bigger vision of cloud computing and join up of systems. HE USB is being used: a test cloud environment from Eduserv. Simon Foster described how the project had started with a feasibility study which looked at whether there was demand for a cradle to grave RMAS system, 29 higher education institutions expressed interest. The project was funded and it was worked out that 29 HEIs phased in over ten years could save £25 million. The single supplier approach was avoided after concerns that it could kill all others in the market. The steering group looked at the processes involved and these were fed into a user requirement document. It was necessary that it was cloud enabled and were compliant with CERIF data exchange. Current possible systems include Pure, Avida etc. Specific modules were suggested. The end result will be a framework in place that will allow institutions to put out a mini-tender for RMAS systems asking specific institution related questions. Institutions should be able to do this in 4 weeks rather than 6 months.

The next steps for the project are proof of concept deliverables using CERIF data standards and use of externally hosted services. They also want to work with other services, such as OSS Watch.

There followed a panel session which included questions around the cost implications of carrying out this work. One suggestion was to consider the cost of failed bids due to lack of data management plans.

What can the DCC do for You?: Michael Day, UKOLN

Michael Day finished off the day with an overview of the DCC offerings and who they are aimed at (from researchers to librarians, from funders to IT services staff). He reiterated that part of RDM is bringing together different people from disparate areas and clarifying their role in the RDM process. The DCC tools include CARDIO, DAF, DMP Online, DRAMBORA. Some of the services include policy development, training, costing, workflow assessment etc. DCC resources are available from the DCC Website.

Conclusions

So after a day talking about data deluge while listening to a deluge of the more familiar sort (loud hail and rain) we were left with a lot to think about.

One interesting insight for me were that while the data deluge had come originally from certain science areas (astronomy, physics etc.) now more and more subjects (including arts and social sciences) are creating big data sets. One possible approach, advocated by a number of the day’s presenters, is to use HPC as a starting point from which to jolt start research data management. However there will continue to be a lot of data ‘outside of the circle’. As ever, join up is very important. Getting all the stakeholders together is essential, and that is something the DCC roadshows do very well. All presentations from the day are available form the DCC Web site.

The next roadshow will take place from 7 – 8 February 2012 in Loughborough. It is free to attend.

Free Research DM Workshop, Cambridge, 9-11 Nov

Marieke Guy — Tue, 04 Oct 2011 10:52:25 +0000

Unfortunately the DCC Brighton Roadshow was cancelled but the next roashow isn’t far off and places are still available.

To give some background….

The UK Digital Curation Centre is running a series of free inter-linked regional workshops aimed at supporting institutional research data management planning and training. The DCC Roadshows are designed to allow every institution in the UK to prepare for effective research data management and understand more about how the DCC can help. The sixth DCC Roadshow is being organised in conjunction with Cambridge Library and will take place from 9th – 11th 2011 November in the Paston Brown Room at the Homerton Conference Centre, Cambridge.

The roadshow runs over three days but each workshop can be booked individually. Attendees are encouraged to select the workshops which address their own particular data management requirements. The workshops will provide advice and guidance tailored to a range of staff, including PVCs Research, University Librarians, Directors of IT/Computing Services, Repository Managers, Research Support Services and practising researchers.

Day one is an introductory day aimed at researchers, data curators, staff from library etc. It provides an introduction to the DCC and the role of the DCC in supporting research data management. Day two is a more interactive day aimed at senior managers, research PVCs/Directors, directors of Information Services etc. and looks at strategy/policy implementation. Day three is a hands-on day and consists of the Digital Curation 101 – How to manage research data: tips and tools workshop.

To find out more about the workshops take a look at the DCC Cambridge Roadshow page. Registration for the workshop is free but places are limited.

If you can’t decide if the roadshow is for you Steve Walsh from the Interoperable Geospatial Data for Biosphere Study( IGIBS) Project, Aberystwyth University, has written a review of the most recent workshop held in Oxford.

Details of further roadshows will be announced soon on the DCC Web site.