The active management and appraisal of data over the lifecycle of scholarly and scientific interest defines research data management (RDM) and should be an integral part of any best practice in research and their outputs. It forms the practical requisites to performing good research by defining rules that should be followed and touches upon open science and the FAIR principles in doing so. RDM includes many elements such as licences, repositories, metadata, and more, and together allow upholding research integrity and reproducibility.
For Horizon Europe proposals, RDM is explicitly referenced and consequently needs to be addressed by authors to show that contingencies are in place to safeguard data produced by the research being proposed. It will require authors to show evidence of practical measures that are to be put in place, from computing and storage infrastructure to what licences will be used through to the long-term preservation of data. The figure below (adapted from DCC) shows a simplified research curation lifecycle that provides a visual guide to the most important aspects to be considered in RDM and will be a visual cue to the itemised descriptions contained in this guide.
The following tables provides a visual guide showing mandatory and recommended actions to be taken when writing your proposal.
Detailed and rich documentation is fundamental to any good research and to provide reproducibility and uphold research integrity. Lab notebooks, whether on paper or by the increasing use of e-lab notebooks will aid this. Documentation is typically an umbrella term for what is required to be recorded, whether it is the type of file to be created, the protocols used in an experiment, justifications and reasoning for actions taken, and many other factors. Rich and detailed descriptions will aid future interrogation of the research conducted and this becomes a valuable resource.
A subset of documentation is metadata and which is described below.
Machine and human readable information, both at a technical and descriptive level, are the foundations of good RDM. Metadata encompass, file formats, documentation, controlled vocabularies and ontologies, licences, and persistent identifiers.
Much of this metadata will be automatically generated when, for example, a digital data object is captured, and these will be essential in providing provenance to the underlying data and form a technical metadata layer. Descriptive metadata, which are typically done by manual curation but which are also increasingly done through automated methods thanks to such advances as AI, allow annotation of digital data objects and provide a further layer of information that is crucial to comparative analyses.
A useful resource for finding available metadata standards can be found here, but there are several others that can be found through web searches.
|File formats||For long-term preservation, it is essential that a version of your data exists in open and lossless file formats which retain all their data and are accessible across platforms. This will ensure accessibility of the data through software that are both proprietary and non-proprietary while also containing the full complement of data prior to any manipulations. These data are typically those that are initially captured and form the basis for any downstream analyses. Examples of such file formats can be found here.|
|Controlled vocabularies||n/a (but highly recommended where possible)|
|Ontologies||n/a (but highly recommended where possible)|
|Licences||The ability to reuse data can be hindered by a lack of clarity on the rights that the data owner has failed to mention. By providing a licence, data reusers are made aware of their rights and the most common form of licence in research are Creative Commons. Licences for digital objects are machine readable and can enhance searches for data where filters can be used for different reuse rights. When ultimately depositing your data in a repository (see below), you should consult the repository’s licence policy which will determine what licence will then be placed on your data.|
|Persistent identifiers (PIDs)||
Provide a PID for the different outputs of your research. These will provide a permanent means by which your data can be retrieved and disambiguates them from other outputs. PIDs can also relate to non-research outputs such as the researchers themselves, or the institution in which the research will be carried out or the grant. PIDs will typically be automatically assigned by trustworthy repositories (see below) once your data are deposited there which provides a valuable service and is an incentive to use these repositories.
|Storage and backup||Through the active phase of the research curation lifecycle, before final deposition in a repository (see below), data need to be stored on networked and back up storage spaces which will provide a means by which data can be recovered in the event of data loss. Storage on local storage spaces on hard drives, pen drives, etc is discouraged but if this is done then it should be ensured that there are copies on networked and back up storage.|
|Repositories||To ensure long-term sustainability and to take responsibility away from your own hands in being able to manage your data, third party repositories need to be used. This step is data preservation and publishing your data in a repository allows reusers to find and access your data. When considering sensitive data, special attention must be given to make sure that data is safeguarded properly - it might not be possible to make the data fully accessible but there is the possibility to make the metadata discoverable.|
|Data Management Plans||
In Horizon Europe, DMPs have become mandatory and will provide documentary evidence of the steps that have been taken to ensure the long-term safety of your research outputs. DMPs follow a template, which now has a revised version for Horizon Europe. The list of points to be addressed crystallise the other points raised in this guidance document regarding RDM and will show that the authors have considered all the necessary measures to uphold best practises.
Additional factors to be considered in a DMP are ethics, legal requirements and costs: IPR, GDPR compliance and to those of local legal requirements, and conflicts of interest need to be declared, while the cost of doing RDM activities need to be estimated as part of the full grant amount, whether these are for capital expenditure or the time required by individuals to perform curation duties. Other factors to be considered in a DMP will be data retention periods before they will be deleted or destroyed.
Use of some file formats that are not open or lossless can be acceptable if there is widespread consensus on the use of those particular formats. Some file formats have become the de facto standard due to their ubiquity, but may still be proprietary. In these cases, it is still recommended to produce a copy of these data in an open format and that both be stored together.
Use standardised syntax for file naming to aid better searchability and the ability to perform batch processes. These can take the form of YYMMDD_filename, and it is recommended to use version suffixes wherever possible when creating versions of a file from a master copy. These could, for example, be generated after cleaning or analysis steps.
|Controlled vocabularies and/or ontologies||
Use standard domain specific controlled vocabularies and/or ontologies wherever possible to better align your outputs with similar data in your field of research. Using standardised vocabularies and ontologies will minimise free text which in turn has significant benefits for comparative analyses and searches and consequently increases the value of your data.
|Licences||Open licences are the preferred choice wherever possible such as CC0 or CC BY. However, this may not always be possible for such data as clinical or other sensitive data. For these latter types of data it is still possible to adhere to FAIR principles and open science by making the metadata freely available which will show potential data reusers of the existence of the underlying data without being able to actually see the data itself. Subsequently, these data can be managed through access control mechanisms and anonymisation and pseudanonymisation, and one such tool that can be used is Amnesia.|
|PIDs||Apply DOIs to your research outputs which can typically be generated through deposition in a trustworthy repository. It is also recommended to register yourself with ORCID to get a PID for you as an individual.|
|Storage and backup||Use in-house or institutionally approved spaces wherever possible, and those that are non-commercial spaces. This will better guarantee ownership of data than use of commercial third party spaces which may sometimes have physical storage in a geographical location beyond the jurisdiction of the data creator.|
The use of domain specific repositories is the most desirable and will give your data the most value. Such repositories will employ domain specific metadata standards and controlled vocabularies and/or ontologies which will enhance the ability to do analyses across similar datasets and perhaps even across domains.
Institutional and domain agnostic repositories should only be considered if no domain specific repository can be found and should be used as a last resort. However, there may be occasions when there will be an institutional mandate to deposit in their own repository and this should be fulfilled. Deposition of data in multiple repositories, although technically possible, should be avoided where possible, but if this is done it should be done in such a way so as to maintain the same persistent identifier across those multiple copies.
Finally, it is recommended to use a trustworthy repository which will provide extra peace of mind since these repositories have been evaluated for their robustness and long-term sustainability. Such repositories may carry a CoreTrustSeal (CTS) or ISO standards approval and can be identified in re3data.org searches.
|DMPs||Regularly updating your DMPs, after a grant has been awarded, is recommended and they should be considered as living documents. Any deviations from the original proposal can be documented here with justifications of why.|
The following additional support materials can help you with the RDM requirements in Horizon Europe projects:
OpenAIRE also offers tools for research data management: Argos is an OpenAIRE service that simplifies the management, validation, monitoring and maintenance of DMPs. [tool]
The following sources were used and contain more extensive information on how to address open science in Horizon Europe proposals: