Mark Thorley, data management co-ordinator for NERC set the tone for the day when he explained that “Data management is too important to leave to the data managers, it needs to be an important part of research“. The launch event, hald at the Saïd Business school, University of Oxford, on Friday 2nd March 2012 for two new UMF-funded infrastructure projects, was all about embedding research data management (RDM) into workflow using shared services. The UMF programme aims to help universities and colleges deliver better efficiency and value for money through the development of shared services.
Data Management at Oxford
Paul Jeffreys, director of IT, University of Oxford, gave an introduction to current data management practice at the University of Oxford. Currently activities in Oxford are varied and rarely co-ordinated. Although there is a RDM portal comprising of a research skills toolkit, RDM checklist, a University statement on research data management (based on the University of Edinburgh’s ’10 commandments’) and a training programme in place there are many people/areas they are failing to meet. One area for concern is non-funded research (i.e. people for whom their research is their life’s work). It remains very tricky to build in generic support and activities need to be flexible.
Introduction to DataFlow
DataFlow was introduced by David Shotton, the DataFlow PI. DataFlow is a collaborative project led by the University of Oxford. It is a two-tier data management infrastructure that allows users to manage and store research data. The project builds on a prototype developed in the JISC-funded ADMIRAL project.
The first tier, called DataStage, is a file store which can be accessed through private network drives or the web. Users can upload research data files and the service is backed up nightly. DataStage is likely to be used by single research groups and deployment can be on a local server or on an institutional or commercial cloud. There is optional integration with DropBox and other Web services.
The second tier is DataBank, which, through a web submission interface, allows users to select and package files for publication. Files are accompanied by a simple metadata and contain an RDF manifest, which is then displayed as linked open data. They are packaged using the BagIt service. Databank is a scalable data repository where data packages are published and released under a CCZero licence, though users can chose to keep data private or add an optional embargo period.
DataFlow is now at beta release v0.1. The DataFlow team are keen to build a user community and have lots of processes in place allowing users to comment on developments.
Introduction to ViDaaS
James Wilson, ViDaaS project manager introduced us to ViDaaS. Virtual Infrastructure with Database as a Service (ViDaaS) comprises of two separate elements. DaaS is a web based system that enables researchers to quickly and intuitively build an online database from scratch, or import an existing database. The virtual infrastructure (VI) is an infrastructure which enables the DaaS to function within a cloud computing environment, it is known as the ORDS service – Online research database service. It builds on ideas developed in the JISC-funded sudamih projects The ViDaaS service currently has three business models:
- £600 per year for a standard project (25gb)
- £2000 per year for large project (100gb)
- Later option for public cloud for hosting
ViDaaS is officially launching this summer.
Further details on interoperability between ViDaaS are contained within the Data Management Rollout at Oxford (DaMaRO) Project.
Both services are seen as being ‘sheer curation’. This is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. http://en.wikipedia.org/wiki/Digital_curation#Sheer_curation
So Why Use these Services?
Many of the other speakers from the day attempted to convince us of why we should use these services. It seems that despite the efforts of many, including the DCC data curation is often seen as a ‘fringe activity’. There are negligible rewards for creating metadata and there is a noticeable skills barriers in metadata– researchers have raw data – institutions have repositories that are empty. The principle of ‘sheer curation’ – allow tools to work with you rather than against you. It is an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets. Both DataFlow and ViDaaS offer integration with simple workflows and immediate benefits.
Use of shared infrastructure services is supported by JISC. They offer potential cost savings, transferability and reuse of tools.
The key for getting people to use the services lies in getting buyin from users and allowing flexibility. As user Chris Holland explained “we are inherently creative people are going to do things in our own way”. There is a need to make services flexible and intuitive as no system can be all things to all researchers.
What about the Cloud?
Peter Jones, Shared Infrastructure Services Manager at Oxford University Computing service began his session introducing the Oxford cloud Infrastructure with a quote from Randy Heffner: “The trouble with creating a “cloud strategy”? You’re focusing on technology, not business benefit.” He explained how the main barriers to cloud adoption include understanding costs, reliability (network), portability (lock-in), control, performance and security. However the biggest issue was inertia and reluctance to change. He concluded that a local private cloud overcomes a number of these issues and that the most likely approach is a public private hybrid
It is becoming apparent that the cloud exposes a cost that was previously hidden. However research institutions need to stand by the data they create, therefore the costs need to be observed and paid. James Wilson, ViDaaS project manager, observed that this is how libraries work, however it is not yet recognised in the research world in which people are still trying to offload costs on to other people.
The afternoon breakout allowed more interaction and discussion around some of the highlighted issues, primarily cost, the cloud and national services.
Resources from the day are available on the DataFlow Website.