Data Citation scenarios

2011 February 25

by Monica Duke

This blog post comes about as a result of jotting some notes while thinking through scenarios of how citation happens, in order to help with the process of identifying requirements for data citation. One particular need is to think about how supporting and doing data citation fits into researcher’s workflows, and what the implications are for the infrastructure and services that are required for citation. What I have written here is very much presented as work in progress, for example I have written down a minimal definition of citation simply for scoping purposes. It occurs to me that others may have written similar scenarios to help them think about data citation. I would love to hear from any others who have described citation scenarios and how these fit in with workflows or infrastructure, and I am sharing the thinking below in a very early stage as I do not wish to re-invent the wheel if there is other work that we can draw on and be informed by.

I am aware of Gudmunder ‘Mummi’ Thorrison’s slides at Science Online 2010 which give an outline of a scenario of the use of researcher identifiers in association with data sharing, and SageCite contributed a use case to the W3C incubator group on libraries and linked data. The latter follows the required template as suggested by the working group and had a focus on the potential link with library data and the use of linked data, to make it relevant to the use case call.

I am sure there must be other scenarios out there, please do let me know about them. In the meantime, here are my thoughts and jottings, which I wrote to unpack the factors around data citation, and which came about following a number of weekly telcons with the other SageCite participants when we had various discussions around data citaiton. Any other views and comments on what I have written below are also very welcome of course.

Scenarios around citation

Definition

Citation*: a reference to data for the purposes of

attributing credit
facilitating access to the data

Scenario 1 Publication establishes priority

A data creator [1] submits an article to a journal. The article references data that has been created during analysis (using a workflow).[2] This is the first time that the data is shared with the community [3] and establishes priority.

[1] instead of an individual data creator this could be a research team that includes data creators and analysers (and others), or a curator on behalf of the team
[2] other data (not created by the team) could also be referenced in the publication
[3] the data is shared in the sense that the analysis and creation process is described to the community together with other scientific conclusions. At this point the data may or may not be made available.[4]
[4] ‘data’ here is simplistic and could be unpacked further – different stages of analysis e.g. intermediary stages could be shared or not, tools could be referenced, programming scripts could be shared too.

Scenario 2 Managing data to support its citation

A data curator [1] submits some data [2] [6] produced during an analysis to a repository.[3] The data becomes shareable, ie it is made available to others. [4]. The researcher or team then reference this data (e.g. as in scenario 2), or reference it in an email to a collaborator, or as part of pre-publication to give access to a referee [5], or mentions it in a community mailing list or a blog.
[1] the data curator is intended as a role to represent either an individual researcher, or different members of a team of researchers, or somebody appointed on behalf of such researchers, all of whom may be the ones to submit the data.
[2] again ‘data’ here is used simplistically and can be unpacked further – see [4] in scenario 1.
[3] The repository can be managed by the researcher e.g. based at his or her institution, or it can be managed by external parties
[4] Prior to sharing the data may have been kept internally, e.g. kept in a repository with access restrictions which are lifted on sharing, or kept in a closed internally-managed repository and then moved to a public repository.
[5] is this last one a ‘citation’? does citation imply some longevity e.g if data is shared for a time-limited purpose e.g. access to the referee, then the reference is going to be short-lived – the reference given to the referee achieves sharing but is not within teh scope of citation. If the data goes on to be published it may not matter if the reference then changes when the data becomes more public. ie the short-term reference does not necessarily need to have a link to the long-term reference, so the short-term reference can be excluded from the citation analysis.
[6] The data may be in turn derived from other data which may or may not be managed alongside e.g. the original may have been derived from a public repository or shared privately between two parties. Access restrictions or privacy considerations may apply to it. A reference to the original data may be required but this can be considered as a sub-case of scenarios 3,4.

Scenario 3. Referencing someone else’s data (in publication)

A researcher applies a methodology described in an article to an alternative set of data – the process is very useful and generates new results in a complementary area e.g. a method used in genetics is then used by systems biologists. The researcher writes an article and credits the prior work by (a) referencing the journal article that originally described the process and (b) references the scripts and workflows that were re-used [1].

[1] assuming some of these had been made available for re-use.

Scenario 4 Referencing someone else’s data (re-analysing data).

A scientist retrieves some data and the associated workflow from a repository[1]. The scientist re-analysis the data, makes some modifications to the process and arrives at some alternative conclusions. The scientist makes the new data and modified process public [2]. The scientist references the original data [3] that she used [4].

[1] or just some data from a public repository
[2] this may be simply sharing the new data and process in a repository or writing a journal article about it.
[3] and the scripts/workflows
[4] again ‘references’ here is simplistic and subsumes some other details, which are the subject of the project’s research.

Scenario 5 – something about doing citation metrics.

(but perhaps this is another section, not how and when citation happens, but what could happen to it afterwards).

Notes: scenario 1 and scenario 2 can be considered self-citation.
Scenario 3,4 is citation by others.

Citing network models of disease and associated data.

Data Citation scenarios

Trackbacks and Pingbacks

Recent Articles

Blogroll

Admin