This week, members of the SageCite Project will be attending the Sage Bionetworks Commons Congress which will be held over 15 – 16 April 2011 at the University of California, San Francisco. The Congress is the annual event organised by Sage Bionetworks, collaborators on the SageCite Project who represent our application domain of disease network modelling. The SageCite Project developed from participation of the project partners in the 2010 event, and this year the project will be back to inform the community of the progress achieved and to be inspired once again by the vision of Sage Bionetworks and the other participants at the Congress. Note also that this event will be streamed. Or you can follow on the twitter hashtag #sagecon
At the beginning of last week, SageCite participated in the JISC MRD International Workshop held in Birmingham. The SageCite presentation focussed on looking at the application domain and different challenges around citation: citing others, making data citable and being cited.
We also ran one of the unconference sessions looking at the use of metadata in various projects and thinking about the use of the Research Objects framework.
This blog post comes about as a result of jotting some notes while thinking through scenarios of how citation happens, in order to help with the process of identifying requirements for data citation. One particular need is to think about how supporting and doing data citation fits into researcher’s workflows, and what the implications are for the infrastructure and services that are required for citation. What I have written here is very much presented as work in progress, for example I have written down a minimal definition of citation simply for scoping purposes. It occurs to me that others may have written similar scenarios to help them think about data citation. I would love to hear from any others who have described citation scenarios and how these fit in with workflows or infrastructure, and I am sharing the thinking below in a very early stage as I do not wish to re-invent the wheel if there is other work that we can draw on and be informed by.
I am aware of Gudmunder ‘Mummi’ Thorrison’s slides at Science Online 2010 which give an outline of a scenario of the use of researcher identifiers in association with data sharing, and SageCite contributed a use case to the W3C incubator group on libraries and linked data. The latter follows the required template as suggested by the working group and had a focus on the potential link with library data and the use of linked data, to make it relevant to the use case call.
I am sure there must be other scenarios out there, please do let me know about them. In the meantime, here are my thoughts and jottings, which I wrote to unpack the factors around data citation, and which came about following a number of weekly telcons with the other SageCite participants when we had various discussions around data citaiton. Any other views and comments on what I have written below are also very welcome of course.
Scenarios around citation
Definition
Citation*: a reference to data for the purposes of
- attributing credit
- facilitating access to the data
Scenario 1 Publication establishes priority
A data creator [1] submits an article to a journal. The article references data that has been created during analysis (using a workflow).[2] This is the first time that the data is shared with the community [3] and establishes priority.
[1] instead of an individual data creator this could be a research team that includes data creators and analysers (and others), or a curator on behalf of the team
[2] other data (not created by the team) could also be referenced in the publication
[3] the data is shared in the sense that the analysis and creation process is described to the community together with other scientific conclusions. At this point the data may or may not be made available.[4]
[4] ‘data’ here is simplistic and could be unpacked further – different stages of analysis e.g. intermediary stages could be shared or not, tools could be referenced, programming scripts could be shared too.
Scenario 2 Managing data to support its citation
A data curator [1] submits some data [2] [6] produced during an analysis to a repository.[3] The data becomes shareable, ie it is made available to others. [4]. The researcher or team then reference this data (e.g. as in scenario 2), or reference it in an email to a collaborator, or as part of pre-publication to give access to a referee [5], or mentions it in a community mailing list or a blog.
[1] the data curator is intended as a role to represent either an individual researcher, or different members of a team of researchers, or somebody appointed on behalf of such researchers, all of whom may be the ones to submit the data.
[2] again ‘data’ here is used simplistically and can be unpacked further – see [4] in scenario 1.
[3] The repository can be managed by the researcher e.g. based at his or her institution, or it can be managed by external parties
[4] Prior to sharing the data may have been kept internally, e.g. kept in a repository with access restrictions which are lifted on sharing, or kept in a closed internally-managed repository and then moved to a public repository.
[5] is this last one a ‘citation’? does citation imply some longevity e.g if data is shared for a time-limited purpose e.g. access to the referee, then the reference is going to be short-lived – the reference given to the referee achieves sharing but is not within teh scope of citation. If the data goes on to be published it may not matter if the reference then changes when the data becomes more public. ie the short-term reference does not necessarily need to have a link to the long-term reference, so the short-term reference can be excluded from the citation analysis.
[6] The data may be in turn derived from other data which may or may not be managed alongside e.g. the original may have been derived from a public repository or shared privately between two parties. Access restrictions or privacy considerations may apply to it. A reference to the original data may be required but this can be considered as a sub-case of scenarios 3,4.
Scenario 3. Referencing someone else’s data (in publication)
A researcher applies a methodology described in an article to an alternative set of data – the process is very useful and generates new results in a complementary area e.g. a method used in genetics is then used by systems biologists. The researcher writes an article and credits the prior work by (a) referencing the journal article that originally described the process and (b) references the scripts and workflows that were re-used [1].
[1] assuming some of these had been made available for re-use.
Scenario 4 Referencing someone else’s data (re-analysing data).
A scientist retrieves some data and the associated workflow from a repository[1]. The scientist re-analysis the data, makes some modifications to the process and arrives at some alternative conclusions. The scientist makes the new data and modified process public [2]. The scientist references the original data [3] that she used [4].
[1] or just some data from a public repository
[2] this may be simply sharing the new data and process in a repository or writing a journal article about it.
[3] and the scripts/workflows
[4] again ‘references’ here is simplistic and subsumes some other details, which are the subject of the project’s research.
Scenario 5 – something about doing citation metrics.
(but perhaps this is another section, not how and when citation happens, but what could happen to it afterwards).
Notes: scenario 1 and scenario 2 can be considered self-citation.
Scenario 3,4 is citation by others.
*I am using this minimal definition of citation to start from somewhere.
I was invited to speak at the JISC Repository Support Programme winter school from 9-11th Feb at the delightful and dramatic Armathwaite Hall near Keswick in the Lake District. Very Emily Bronte. Below is my summary of this useful and enjoyable meeting.
RSP JISC funded support for institutional repositories:
Summary of meeting
Main activities focussed on…
- Identifying and discussing the continuing and emerging roles of repositories in UK HEI.
- Collecting and providing the intellectual capital of the HEI as a function of publications- Showcasing the HEI.
- Maintaining the repository with preservation plans etc
- Using the repository to inform, historically the RAE and presently the REF
- Confronting the risk of mixed concerns as traditional based publication repositories are mixed with digital content (this was extended a little to include data centre type activities).
- Discussions on the particular software solutions used to manage their repositories.
- Preparing for the REF
Data as legitimate content
Many of the repository managers I spoke with had considered data as legitimate content type but few had clear plans in place to address the issue and even fewer were actively administering data; of those that were, content was administrative data and supplementary data rather than raw research data.
Martin Hall, vice-chancellor of the University of Salford, spoke of the opportunity for Institutional repositories in higher education. Martin saw a profound ability for institutional repositories to influence the HEI policy as an extension of Library function that traditionally was at the heart of any good university. There were challenges but these could be confronted with political will. For this political will to gain momentum repositories needed to demonstrate benefit and he believed this was happening. He called his model the ‘Open Access University’
Keith Jefferies spoke of how the STFC and international initiatives he was involved in have addressed the issue of data. Mainly the so called ‘big science’ his vision for an achievable re connection of scholarly communication was distinct but highly connected entities with clear roles in data preservation, persistence and access OR publication repositories (including institutional repositories. The rationale Keith provided was the different requirements of a publication repository as compared to a data centre. He thinks they should exist as independent structures within the HEI landscape (not necessarily the same HEI but for e.g. Research Council funded data centres).
This works well in STFC, ESRC and NERC where large valuable data centres exist for all manner of data generated from research they fund. It also occurs to a lesser extent in biomedical sciences, e.g. the EMBL-EBI and Sanger institute. However it occurs less so in long tail biological research like ecology and evolution.
I introduced the British Library Datasets Programme and the project based activity there. Special note was given to DataCite as a mechanism to promote research data as first class citizens. I talked of the challenges with persistent identification and showcased the data citation services being developed by DataCite. Many of the audience were very familiar with DOIs but were interested to understand how they were being implemented for data.
I described two specific projects that the datasets programme was involved with and how data citation and repository models were being investigated fro sustainability and utility. SageCite and DryadUK.
While many of the delegates were not planning on using the data citation services offered by DataCite, there were questions regarding the increased burden of financial commitment and metadata submission. Financial overheads were discussed briefly but a more important issue would be to what does one assign a DOI to; a common and compelling argument about the problems of implementing any citation framework for data.
I explained that DataCite requirements are minimal and designed to balance open ‘declarations’ of data existence and a facility to support minimal metadata collection. Thus DataCite requires..
- an open landing page
- Responsibility for data assets and support from a stable organisation, e.g. their institution or a research council
- A commitment to keeping records in DataCite up to date.
Any further requirements we intend to confront as a partnership, e.g. solving solutions to specific data problems, developing valuable services specific to needs and working with members to promote the aims of DataCite, i.e make data a first class citizen.
Day two was concerned with implementations, how the repositories are getting ready for the REF and a benefits exercise.
The most relevant part of this for data citation was the challenge of linking institutional benefit the Institutional repository. e.g. repositories may act to attract faculty and students, increase the visibility of existing faculty and students but measuring causal link is almost impossible. Several suggestions were put forward including intervention in applications, interviews and follow up after students and faculty move on but each of these presented a particular problem that many repositories seemed unable to resource. Most people agreed political will and mandates seemed the best route. This author is not sure. Data citation may help this cause but a few metadata elements need to change.
- A temporal element linking creator to institute
- An institutional value or identity
- Creator/author disambiguation
- Integration with traditional repositories
A very useful meeting and a great opportunity to meet with people involved at the intersection of researchers, publications and institutional libraries.
Link to the presentation on Slideshare http://www.slideshare.net/DatasetsBL/british-library-datasets-programme-feb-2011 and on the RSP event website http://www.rsp.ac.uk/documents/get-uploaded-file/?file=BL_Datastes_JMW_JISC_20110210.ppt
As the half-way mark for the project run time has passed, I thought I would add an update on some of the activities within SageCite.
We have held our mid-way meeting in order to plan the work for the second half of SageCite. With only six months remaining, we wanted to have a clear idea of what the project will be delivering, now that we have had time to explore some of the issues and become more familiar with the domain data and workflows from Sage Bionetworks. It was also a good opportunity to review progress.
At the end of January, I attended the fantastic Beyond The PDF workshop in San Diego, which was an amazing opportunity to meet people with lots to say about the citation of data. I hope to write a blog post about the citation-related discussion in Beyond The PDF, which extended from the pre-meeting discussions, cropped up in various forms during the meeting, and continues in the very active forum even though the meeting is over.
We have set up a Knowledge Blog instance for SageCite, which will host the reviews being published as a result of our desk-research in Work Package 1 (Framework Foundations).
Work on the demonstrator is progressing well, and we plan to have demos at the upcoming JISC International Workshop at the end of March, and at the Sage Congress in April.
We are also exploring the possibility of running some sessions on metadata and ontologies for scientific metadata at the JISC workshop, tying in with the collaboration with other JISC MRDOnto projects in the JISC MRD strand. Contact us if you are interested in being involved.
Max Wilkinson has used SageCite as an examplar to illustrate his talk at the JISC Winter School.
We have ongoing discussions with ORCID regarding the possibility of being one of the first projects to test any new services coming out of this initiative.
Work Package 4 involves a Benefits Evaluation and we are starting the background work needed to apply the KRDS approach to carrying out the evaluation. On a related note, SageCite will be one of the case studies in the recently-started KRDS-I2S2 Digital Preservation Benefit Analysis Tools Project.
We are monitoring other upcoming events relevant to data citation, such as the Beyond Impact event being held in the UK (see beyond-impact@googlegroups.com) and planning attendance.
Wearing another hat, this week I will be found milling around other developers at Dev8D. If you plan to be there and have an interest in citation of data, let’s try to meet up and have a chat.
Yesterday I mentioned that I had started to get stuck into the review of ontologies by tackling OPM – the Open Provenance Model. Here I have tried to summarise OPM in 20 bullet points:
- OPM emerged as a consensus from community participants with activity starting back in 2006.
- Provenance challenge activities led to substantial ageement on a core representation of provenance.
- The OPM specification v.1 was released in 2007; an open-source model was adopted for governance of OPM; version 1.1. of OPM was presented in 2009.
- OPM is designed to meet the following requirements:
- To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model.
- To allow developers to build and share tools that operate on such provenance model.
- To define provenenace in a precise, technology-agnostic model.
- To support a digital representation of provenance for any “thing”, whether produced by computer systems or not.
- To allow multiple levels of description to co-exist.
- To define a core set of rules that identify the valid inferences that can be made on provenance.
- OPM consists of a directed graph expressing what caused things to be i.e. how things depended on others and resulted in specific states. Provenance graphs are aimed at representing causality graphs explaining how processes and artifacts came out to be. A graphical notation for provenance graphs is suggested.
- OPM is based on three kinds of nodes in the graph, defined as: Artifact: Immutable piece of state, which may have a physical embodiment in a physical object, or a digital representation in a computer system. Process: Action or series of actions performed on or caused by Artifacts, and resulting in new artifacts. Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling or affecting its execution.
- Causal dependencies between artifacts, process and agents are captured in the graph. The edges denote one of the following categories of dependency (the nodes are as defined in the previous bullet):
- used(R)
- wasGeneratedBy(R)
- wasControlledBy(R)
- wasTriggeredBy
- wasDerivedFrom
- R can be used to denote a role which is meaningful in the context of the application, and aims to distinguish the nature of the dependency when multiple such edges are connected to the same process. e.g. a process may use several files, reading parameters from one (R=parameters) and reading data from another (R=data). Communities need to define their own roles in OPM profiles, and roles should always be specified
- OPM adopted a weak notion of causal dependence and defines the dependencies, but recognizes that subclasses that capture stronger notions of causality may be needed in specific systems (e.g. a strong interpretation of the used edge requires the artifact to be available for the process to start).
- OPM recognises a need for detail at different levels of abstraction or of a different viewpoint of processes – giving rise to different accounts of the same execution, and describes how this can be achieved.
- An account represents a description at some level of detail as provided by one or more observers. The concept of account allows multiple descriptions to co-exist.
- OPM allows the addition of time information to processes. Time is optional.The model specifies constraints that time information must satisfy with respect to causal dependencies.
- OPM expects that reasoning algorithms may be used over provenance models and describes completion rules and multistep inferences to show how causal dependencies can be summarised by transitive closure.
- OPM provides a formal definition of what constitutes a legal graph and defines rules for OPM e.g. the requirement for identifiers for accounts, artifacts, processes and agents, what is optional and what is mandatory.
- An annotation framework allows extra information to be added to OPM entites to allow meaningful exchange in specific communities. Some properties that are expected to be commonly used (such as Label) are defined.
- OPM profiles are intended to define a specialisation of OPM, and capture best practice and usage guidelines developed by communities. Profiles must remain compliant with the semantics of OPM.
- Attribution can be attached as an annotation but work is currently in progress to deal with these concepts.
- The Open Provenance Model Vocabulary was released in December 2010 and is designed as a lightweight provenance vocabulary by implementing the OPM.
- There are a number of alternative provenance vocabularies; these have been mapped to OPM in an analysis by the W3C Provenance Incubator Group task force.
- http://openprovenance.org/ is the main page for OPM activity and contains links to the specifications, a tutorial, tools that implement OPM, the OPM wiki and other useful pointers.
Despite being aware of OPM over the years this has been my first real close look at its description, so I hope that I have done it justice in the above 20 bullet points. I invite others with more knowledge to make comments and additions or corrections, and I look forward to working collaboratively with colleagues on JISC MRD projects to evaluate the various standards that we are interested in.
Towards the end of October I attended a really useful meeting organised by JISC, where a number of common interests with other JISC-funded Managing Research Data projects were identified. In my breakout group, we focused on ontologies and identifiers. We agreed to share our findings and pool resources in order to avoid duplication of effort. This seemed like a great idea to me, as one of Sagecite’s deliverables is a review of standards and technolgies for citation, amongst which ontologies and identifiers do feature.
Immediately after the event a space was found on the JISC wiki to help collaboration, and other members of the group did a good job of getting the ball rolling and seeding the #MRDonto page. Things went a little quiet after that. I personally had a longish trip in the States talking to researchers in our apllication domain immediately after the JISC event. No sooner had I recovered from the jet-lag, than I fell ill and took a while to get back on my feet. With panic about the time left to complete the reviews creeping up on me in the background, the prospect of making sense of all the ontologies and how they might play a part in a citation framework seems more daunting than ever. The idea of sharing knowledge and resources with other projects is particularly appealing as I try to make up for lost time. I am however conscious of the need to manage time-scales for our project deliverables and balance these against taking part in collaborative activitie, which may need to move at a pace that takes into account other activities that collaborating projects need to prioritise. For example in SageCite the need for closer discussion with researchers, and familiarisation with their data and processes, had been identified early on as a priority and had to take precedence over other work.
In the meantime, I noted that over on the JISC Open Citation blog David Shotton published a very helpful summary of the SPAR family of ontologies. I also spotted an announcement of the release of the OPM Vocabulary. On the JISC-MRD mailing list there has also been useful discussion on identifiers, and in particular the role of our project partners DataCite. I hope that SageCite has contributed in a small way to facilitate these exchanges and to connect people. Another suggestion that has been aired, and that I have had a hand in exploring, is the possibility of a Metadata Forum event on metadata for scientific and research objects.
More recently, in an attempt to take a bite-sized chunk out of the ontology mountain, I have started looking at OPM and have almost finished writing up an OPM in 20 bullet points review, which I will be finalising and publishing tomorrow. I am a new-comer to OPM, and I am aware that there are others in the community who might be better placed to present this topic, but I offer this review in the hope that it may provoke feedback or stimulate discussion. Even if that fails to happen, writing up this summary has helped me feel that I have taken a concrete step in better understanding the pieces of the ontology puzzle. And come the New Year, when I intend to turn the focus on collaborative activity again, I will feel less sheepish as I approach the others not completely empty-handed.
Here’s looking forward to more collaboration in 2011, and watch this space for the OPM summary.
DataCite are excited to be involved with the JISC funded SageCite project.
DataCite is an international consortium which aims to increase acceptance of research data as legitimate, citable contributions to scholarly communication. To enable this DataCite assigns persistent identifiers for research datasets and manages the infrastructures that support simple and effective methods of data citation, discovery and access.
In SageCite, DataCite will implement data citation services in pilot demonstrators to showcase value and importance of data in scholarly communication.
@datasetsBL
www.datacite.org
I have recently returned from a trip to the USA where together with a SageCite colleague from the University of Manchester we had a valuable opportunity to spend time with data creators and users. We were joined by the editor of Nature Genetics, a collaborator on the SageCite project. SageCite has chosen a specific application domain on which to base its outputs about citation, so that any observations made and lessons learnt will be based on real exemplars and needs.
SageBionetworks is a new medical research organisation based in Seattle, Washington, and is a non-funded partner in the SageCite project. Two key activities at Sage are for the researchers to actively collaborate with a number of academic and commercial partners to apply advanced integrative genomic analysis to genetic and clinical datasets. The Sage Commons is planned as a new development of a major new biological network and systems biology resource, for shared research and development of biological network models and their application to human disease and biology. It will consist of very large network datasets, tools and models. The data that Sage works with consist of multiple biological types, though typically including phenotype, genotype and gene expression from experimental and clinical origins.
Early on in the project we identified a key requirement to talk directly with the researchers who are working on the data curated by Sage, building new models of disease, and working towards the new culture of sharing data and results. The SageCite proposal focuses on using Taverna as a tool to capture the workflows that represent the stages of data processing and model building carried out at Sage.
Our aim for these meetings was to understand the steps involved with developing the predictive models of diseases from raw data sets and analysis tools. We compiled a list of questions to form a basis of discussion and guide us in getting information as the researchers showed us examples of the models and how they are built.
One of the aims of the SageCite project is to cite and track the data sets that are used in modelling diseases so that the people who created the raw data and who contribute to the analysis can be credited. SageCite is planning to develop semi-automated pipelines using the Taverna workflow system to construct models from raw data which
may be in the form of co-expression and Bayesian networks. During the meetings we identified some examples that would be suitable for modelling as a demonstrator example. The Taverna workflow will provide an explicit record of the steps involved in model building, which can be shared with other people. Taverna can also record the provenance of the workflow execution which will allow us to track how raw data is transformed into models.
In the coming weeks, the project’s tasks will concentrate on building the Taverna-based demonstrator using the sample data and workflows. In future posts we hope to describe more details about the data and the model-building process, and start to document the issues that our visit has revealed to us.
Yesterday, Liz Lyon and myself attending a JISC workshop where we met up with a large number of other colleagues representing projects in the JISC Managing Research Data strand of activity #jiscmrd.
In this blog post I am focusing on the outcomes of the breakout group that I joined in the afternoon, which was perhaps the most interesting part of the day for me, and which I felt concluded by suggesting some concrete outcomes for good joint activity between projects:
- We identified a common interest in the use of OAI-ORE to describe aggregates of research data and related components that together make up a research object. There may be potential for joint efforts around tooling, and future meetings to share experiences, with a role for JISC to facilitate this (perhaps linking us up with international efforts).
- Domain ontologies drawn from the different domains we are working in will need to be inter-linked with more generic ontologies, like OAI-ORE or the OPM.
- We identified a common need to understand what appears to be a hierarchy (or stack) of ontologies that will need to be combined to meet the different needs in the various projects. We would like to share experiences on how to combine these effectively.
- We agreed to use a collaborative approach to sharing knowledge on these ontologies, with each of us contributing in the areas where we had most experience. We will use a wiki as a working area to pool resources and reviews, avoiding duplication of effort across projects.
- Similarly we will plan to publish finished reviews on the various standards, for the benefit of the community at large, using KnowledgeBlog.
- We will continue discussions around ontologies for research data on the JISCmail lists, and announce news of activities, in order to reach any other projects who were at the meeting but attending other break out groups. The tag #mrdonto will be used to tag posts and to capture the topic of interest identified by our group.
- We identified a common interest in using ORCID for researcher (or contributor) identification, and we would welcome JISC leadership to involve and represent us as a community within initiatives such as ORCID, to ensure they meet requrements for data. We think that between us we could offer a significant mass of adoption for ORCID; compatibility of ORCID with linked data was a priority for all those present.
- We share an interest in services to assign identifiers to data (with a focus on DataCite); once again compatibility with linked data was rated highly in our requirements for interoperability with the approaches that the projects are taking.
Disclaimer: this is my take on the outcome from the breakout, I welcome any corrections or clarifications from others in the group. Comments on the blog are welcome but please note I will be unable to moderate comments in the short term.
Over all, I felt this was one of the more productive JISC all-project meetings I have ever attended, it was heartening to see such a good number of people working within broadly the same aim, and I have made a number of valuable connections which will help to take things forward for SageCite.