The first internal face to face meeting for the SageCite project took place on Thursday 9th September, hosted by our project partners the British Library, representing DataCite.
One notable outcome of the meeting was the decision to organise a trip to Seattle, to meet the producers of the data. The meeting will be hosted by Sage, another of the partners in SageCite. Sage will facilitate access to a number of laboratories that are participating in the Sage federation with the intention of establishing data sharing practices.
The purpose of this visit from the point of view of SageCite is to learn how predictive disease models are built and to
investigate requirements for the citing of data sets used during model
construction. It was agreed that we need to get better understanding of the data, the models and how they are constructed, at the point the models are being generated. The best way to achieve this understanding is by talking to the data producers and modellers who are familiar with their standards, tools and workflows.
On Augst 12 2010, The NYTimes published an article titled “Sharing of Data
Leads to Progress on Alzheimer’s”
In 2003, a group of scientists and executives from the National Institutes
of Health, the Food and Drug Administration, the drug and medical-imaging
industries, universities and nonprofit groups joined in a project that
experts say had no precedent: a collaborative effort to find the biological
markers that show the progression of Alzheimer’s disease in the human brain.
This blog post was written from the perspective of someone who lost a mother to Alzheimer’s “Sharing Data Leads to Progress” Following the Sage Congress in April 2010, the power of patient advocacy groups in providing momentum for data sharing was recognised.
The challenge to make data citation part of the everyday practice of scientific endeavour and accreditation must address both evaluation and design of technical solutions as well as social and policy change. Whilst technical solutions can demonstrate the viability of citation and accreditation systems and can be used to influence change, they must in turn reflect the policy requirements of funders and institutions managing career progression.
The Sudamih researcher requirements report was published in July 2010. Sudamih is a JISC-funded project in the Research Data Management Infrastructure strand, and this report is about researcher requirements in the Humanities, essentially “Understanding how researchers in the humanities think about data, how they conceptualize, gather, store, use, and generally look after it” [P.7]
The focus of the Sudamih report was on researcher requirements for a’ Database as a Service (DaaS)’ and for data management training – both being tangentially related to SageCite. SageCite will be documenting the requirements for citation needs in science areas and this report could potentially offer insight into requirements capture processes with researchers, and I was also interested in whether any citation needs had featured. Some of their findings, for example “Good information management is time consuming, and academics often find themselves with insufficient time to keep on top of it.” [P. 5] could also be expected to hold across disciplines. Any potential citation service must strike a balance between the gains that could be made from making data citable and the information management burden required to make citation possible. Lesson 1 that could carry over to the citation framework requirements could well be that the information management barrier must be made as low as possible.
On the use of data as a source of reputation building, the Sudamih report noted “While the collection and organization of data is an essential part of many research projects, it seems generally agreed that data resources contribute less to the academic standing and reputation of the creator than conventional publications. In particular, pure data outputs were not recognized as being of similar value under the old RAE system. The compilation of substantial data resources is time consuming, and consequently may slow the rate at which researchers are able to publish books and articles. Scholars who choose to invest significant time in data projects may therefore find that they are providing a service for others at the expense of their own research career” [P.16]
The report has sections on data re-use and data dissemination. On data re-use the projects explains how each of the following considerations restrict the practice of data sharing, despite researchers being happy in principle to share their data [P. 17-19]:
*Data may be messy (e.g. containing personal notes)
*Data employs personal, idiosyncratic standards
*Data is partial and specific: partial data may limit its usefullness or even worse lead to misunderstanding.
*The existence of the data is not widely known
*The data needs to milked for publications first
*Political issues may make publication unwise or difficult
All of these above factors could be relevant in determining which data can be considered to be ‘citable’. Would it be right to suggest that data must be in a ‘shareable’ state as a pre-requisite to being ‘citable’?
The requirements process of Sudamih also documented interest (as well as some hesitation) to use the Daas as a means of dissemination. [P. 27] It was seen as an opportunity to showcase what data is being collected, and a strong sense of using the Daas as a basis to connect people based on common interests and create connections between researchers came through in the responses. Overall, the idea of using the Daas as a basis for making the data citable was not reflected in this section of the report.
However one of the requirements documented for the DaaS is “Records may be linked to external sources” [P. 6]. I didn’t manage to spot anything further on this, I suspect the idea behind it was to make resources link to outside stroage systems rather than linked from a citing resource. I wonder to what extent this reflects that the investigation did not specifically set out to explore the citation potentiality of DaaS as a service (therefore such questions were not probed by the investigators) and to what extent this simply reflects that citation of their data is not at all at the forefront of the concerns of the researchers being interviewed? (I was rather skim-reading towards the end where the training aspect was covered, apologies to the Sudamih team if I missed something, we would be delighted to hear back if you have further insight that you wish to share on this topic).
One little nugget specifically about persistence and identification of cited data is revelaed in one quote [P.28] (in this case, about the researcher’s own data that had been cited, and its management) ““I’ve had problems finding a stable home for my data: it was on the college website, but got lost in a site revamp.” The data has now been moved, meaning that the URL published in an article which drew on the data is now incorrect”
Overall, although the Sudamih requirements report does not appear to have addressed citation needs as an aspect of data management (and it wasn’t intended to, so that is in no way a criticism), what it does offer are insights into data management and requirements issues for researchers in the humanities, which may be helpful to SageCite when discussing citation needs with Sage users. After all, data management and data sharing are inextricably linked with data citation.
NYTimes Story on Open Review in Academia Sparks Conversations Across Campuses – this blog post follows on the heels of an article in the NYTimes on 24th August 2010 on “crowd-sourcing” of peer review.
The impact of the availability and citation of datasets on peer review, and the role of data citation on new mechanisms for attribution and career credit, are both areas of interest in SageCite.
via @atreloar
Editorial: http://www.jneurosci.org/cgi/content/full/30/32/10599
Blog commentary: http://scholarlykitchen.sspnet.org/2010/08/16/ending-the-supplemental-data-arms-race/
Supplementary material will no longer be reviewed or hosted by the journal. The increase in volume in supplementary material is seen to have “begun to undermine the peer review process in important ways”. The announcement explains how they think this has happened.
The journal will allow authors to link to external sources hosting their supplemental data; with regard to the persistence of that data, the editor-in-chief recognises that there may be issues with the stewardship of externally managed data, but states “However, supplemental material is inherently inessential, and we expect that authors will maintain their sites for as long as they consider their supplemental material to be valuable and important.”
An image illustrates the growth in volume of article and supplementary material.
JISC has announced the six new projects (including SageCite) that have just been funded through the Citing, Linking and Integrating Resrach data Strand of the Jisc Managing Research Data Programme.
JISC announcement
SageCite will develop and test a Citation Framework linking data, methods and publications. The domain of bio-informatics provides a case study, and the project builds on existing infrastructure and tools: myExperiment and the Sage Commons. Citations of complex network models of disease and associated data will be embedded in leading publications, exploring issues around the citation of data including the compound nature of datasets, description standards and identifiers. The project has international links with the Concept Web Alliance and Bio2RDF . The partners are UKOLN, the University of Manchester and the British Library (representing DataCite), with contributions from Nature Genetics and PLOS. The project is funded by JISC through the Managing Research Data programme.