Requirements for Data Citation: The Prequel
I’ve just described the background to this post in the previous blog entry. So without further ado:
Requirement 1. The Citation needs to be able to uniquely identify the object cited.
This seemingly simple requirement is about making sure that the citation contains an unambigous reference to the cited object. However unpacking this requirement reveals some difficult consequences. Textual citations can be made unique through a combination of fields e.g. author, title, date. Truly unambiguous identification could alternatively be achieved through the use of unique identifiers. Do all the different types of citable object need to be identified and listed? How does a discipline work out and agree the granular level at which objects can be (or need to be) identified (ie cited), whether uniquely or as a package.
Requirement 2. The Citation needs to support the retrieval of the cited object.
Location of cited objects is an important driver for citation since it supports re-use, validation and reproducibility. Retrieval can be achieved indirectly by providing sufficient information for a human user to process the parts of the citation and use knowledge of the discipline (e.g. location of discipline-specific repositories and methods to search and access their contents) to find the cited object. This requirement also subsumes some other requirements commonly associated with identifiers e.g. persistence and implies properties of other infrastructure for data management (e.g. curation of data). Supporting automated retrieval leads us to requirement 3.
Requirement 3. The citation mechanism must be compatible with Web infrastructure.
With increasing calls for publications to present data in a form that is more interactive, any modern infrastructure for citation should be compatible with web infrastructure. In practice this implies that citations must contain actionable URLs. Any internal identifiers used for internal management must eventually be mapped to an actionable citation, which throws up interesting questions on the management of identifiers and objects as they move across curation boundaries from the private sphere to the public domain.
Requirement 4. The citation ‘system’ must be able to generate a citation with all the desired fields.
The desired fields must be agreed for all the types of citable object. Does this really need to happen up front? How necessary is it to agree common fields and labels within and across disciplines? A number of dependencies can be identified: The information required for the citation must either be captured in the datasets (from where it can be extracted), or in the metadata, or explicitly entered by the user. Systems to automatically generate or capture metadata (for example, provenance metadata) to decrease load on user are needed. This information must be available at the point at which the citation is generated and shared – this may be internally, where a data contributor cites their own work, or within external systems when others cite a dataset.
Requirement 5 The citation mechanism must be identifier-agnostic.
Globally, the community would need to identify which identifiers are in use, what are their life-cycles? How are they used? The range could include discipline-specific identifiers and globally agreed schemes. The requirement throws up some dependencies with respect to accommodating different resolution mechanisms.
Requirement 6 The citation mechanism must support gathering of metrics
Credit and attribution are considered important drivers for citation. The computation of metrics to measure value of contrbutions is made possible through citation. Does this requirement imply that the citation must be processable, preferably in automated format? What are the implications for the other requirements, especially 1, 3 and 4.
Requirement 7 The citation must be human readable
Humans use human judgement to decide whether the cited object is worthy of more attention, if it can be trusted, and other similar judgements, which are based in part on the information within the citation. For example the mention of a trusted disciplinary repository lends confidence that the item will be retrievable and formatted to a high standard. Information about the data (e.g. species) suggests if it is within the field of interest of the reader. The name of a data contributor can also be used as a proxy for determining if data is of interest. Citations for data may lead the human reader indirectly to supplementary information (metadata) that helps to arrive at these judgements.
Requirement 8 The citation must be machine processable
This requirement is linked to several of the others. In a Linked Data environment, citations that are compatible with Web Infrastructure are required to re-use conventions and ontologies established by Linked Data practices. The automation of metric computation could be made more reliable through agreed ways of making citations machine processable. Automated following of links and gathering of information to answer questions such as ”Find me all the data contributed by contrubutor x” are requirements that future citation infrastructure should be able to meet.
Comments are closed.