Skip to main content

This Glossary provides the definitions of and practical advice regarding the key terms mentioned in the context of RDM requirements in Horizon Europe. Other terms relating to RDM can be found in the From Science Europe Data Glossary.


 

 

Anonymisation

Anonymisation is the process of removing personally identifiable information (information that directly or indirectly relates to an identified or identifiable person) from datasets containing sensitive data. As a result, data subject is no longer identifiable. As opposed to pseudonymisation, anonymisation is not reversible, which means that the re-identification of the data subject is not possible.

Practical advice:

  • OpenAIRE has developed a tool that can be used for anonymisation Amnesia.

 

 

Backup

Data backup is a process of creating a copy of data in a digital format and storing it on another device to ensure that data are saved and to prevent data loss.
Backups can be full (all files are backed up whenever a backup is made) or partial (only a part of the files, e.g. new files, are backed up).

Practical advice:

  • One backup should be at a physically separate location.
  • Backups should be made from the master copy.
  • The backup location should be as secure as the master copy location.

 

 

Controlled vocabularies and ontologies

Controlled vocabulary is an organised and standardised arrangement of predefined terms (words and phrases) that are used to index content in an information system with the aim of facilitating information retrieval. Controlled vocabularies connect variant terms and synonyms for concepts, link concepts in a logical order and organise them into categories, so as to provide a consistent way to describe data. They can be general and discipline-specific, and can take the form of subject heading lists, thesauri, authority files, taxonomies and alphanumeric classification schemes.

Ontologies are not controlled vocabularies, but they use controlled vocabularies to establish a formal specification of a conceptual model in which concepts and categories of concepts, properties, relationships among concepts and categories, functions, constraints, and axioms are defined.

Practical advice:

  • Use standard domain specific controlled vocabularies and/or ontologies wherever possible to better align your outputs with similar data in your field of research. 
  • Use well-documented vocabularies in which terms are assigned persistent identifiers
  • Use repositories that enable you to add terms from controlled vocabularies.
  • A useful domain agnostic resource for finding controlled vocabularies and ontologies can be found here.
  • Check also other resources.

 

 

Data Management Plan

Data Management Plan (DMP) is a formal document that outlines how data will be handled throughout the research data lifecycle – from planning, through collecting, analysing, publishing, preserving, to sharing and reusing.

In Horizon Europe, DMPs are mandatory and a template to guide the preparation of DMPs is provided. The list of points to be addressed includes data types and formats, compliance with the FAIR principles, (metadata, repositories, controlled vocabularies, licences, etc.),  legal requirements (intellectual property rights, GDPR), costs of preservation, data security and ethics, retention periods.

Online tools that can facilitate the preparation of DMPs are available:

Practical advice:

  • A DMP is a living document, which should be updated as the project develops. Any deviations from the original proposal can be documented and explained.
  • If possible, make the DMP publicly available.

 

 

Documentation

Data documentation includes various types of information that can help find, assess, understand/interpret, and (re)use research data – e.g. information about methods, protocols, datasets to be used and data files, preliminary findings, etc. Documentation helps understand the context in which data were created, as well as the structure and the content of data. Data should be documented through all stages of the research data lifecycle. Detailed and rich documentation ensures reproducibility and upholds research integrity. Documentation also includes metadata.

Practical advice:

  • Various tools, such as e-lab notebooks, are available to support you in the process of  creating documentation

 

 

File format

File format is a standard way of encoding information so that it can be stored in a computer file. Digital research data may be stored in a wide variety of file formats, depending on the devices and tools used in data collection and processing.

File formats may be proprietary (the encoding-scheme is designed and owned by a company or organisation, and is not published, due to which files can be opened only by those who have particular software or hardware tools) and/or prone to obsolescence (legacy formats, bit rot).

To ensure that users can access and understand data and that data can be preserved in the long term, use open (defined by an openly published specification that anyone can use) and lossless formats (ensuring that no data or quality loss will occur during file manipulation).

Practical advice:

Check also this OpenAIRE Guide.


 

 

File naming convention

File naming convention is a framework for generating file names that have a consistent structure, while describing the content of files and their relations to other files.

Practical advice:

  • Use standardised syntax for file naming to aid better searchability and the ability to perform batch processes. These can take the form of YYMMDD_filename, and it is recommended to use version suffixes wherever possible when creating versions of a file from a master copy. These could, for example, be generated after cleaning or analysis steps.
  • Define the file naming convention in an early stage of your research and apply it consistently throughout the research data lifecycle.

 

 

Licence

Licence is a written agreement by means of which the copyright holder defines the rights granted to the users. In a digital environment, standardised licences based on a set of predefined reuse conditions, such as Creative Commons, are used. Licences for digital objects are machine readable. 

In Horizon Europe, CC BY or CC0 (or equivalent) open licence is required for data in open access, while metadata deposited data must be open under the CC0 or equivalent licence.

Practical advice:

  • When depositing your data in a repository, consult the repository’s licence policy to check whether it is compliant with the Horizon Europe requirements.
  • If you are combining already published data into a new dataset, check the compatibility of data licences.

 

 

Metadata

Metadata are data that provide information about other data, e.g. a description of the content of the data, the date when the data were produced or collected, tools and devices used to obtain data, file formats and sizes, the names of the people who created or collected data, relevant persistent identifiers, etc. Metadata should be created and provided in accordance with commonly used metadata standards, which may be general or discipline specific. This ensures that metadata can be understood by humans and processed and exchanged by machines. 

Practical advice:

  • When preparing a DMP, you will be required to mention the metadata standards you will “follow to make your data interoperable”. A useful resource for finding available metadata standards can be found here. Choose a standard that is commonly used in your discipline. Do not invent your own!
  • Some metadata are automatically generated (by devices used to create and capture data) and embedded in data files, e.g. in digital audio and video recordings,while some have to be produced manually. 
  • When depositing your data in a repository, you will be guided by the user interface through the process of providing metadata. In the input form, some metadata fields are mandatory, which means that the procedure cannot be completed if this information is not provided. It is highly recommended to provide as detailed information as possible even in non-mandatory fields.

 

 

Persistent identifier (PID)

Persistent identifier (PID) is a long-lasting reference to a resource that provides the information required to reliably identify, verify and locate the resource. In a digital environment, PIDs have the form of URLs. When pasted in a browser, they take users to the resource.

Apart from digital resources, PIDs can also relate to researchers (e.g. ORCID, ISNI), institutions (e.g. ROR), grants, instruments and devices, etc. In this case, a PID leads to the record describing a researcher, an institution, etc. in the relevant registry.

Examples of PIDs include DOIs, ORCIDs, ISBN, Handles, etc. 

Practical advice

  • Apply PIDs to your research outputs. PIDs are typically automatically assigned by trustworthy repositories once your data are deposited there.  
  • It is also recommended to register yourself with ORCID to get a PID for you as an individual.

 

 

Pseudonymisation

Pseudonymisation (pseudo anonymisation) is the processing of personal data in such a way that the data can no longer be related to the data subject without the use of additional information. However, the additional information must be kept separately and subject to technical and organisational measures to ensure that data subjects remain unidentifiable. As opposed to anonymisation, pseudonymisation is a reversible process, which means that data subjects can be re-identified if access to the additional information is enabled.

Practical advice

  • OpenAIRE has developed a tool that can be used for pseudonymisation - Amnesia.

 

 

Repository

Repository is a digital platform that ingests, stores, manages, preserves, and provides access to digital content. A repository should support a commonly accepted metadata standard and have a protocol enabling metadata exchange. 

Repositories are usually classified into: 

  • subject/disciplinary, 
  • institutional and 
  • generalist repositories. 

In Horizon Europe research data should be deposited in a trusted repository, i.e. in a repository that operates in accordance with relevant standards and best practice, provides long-term access and preservation, and ensures compliance with the FAIR principles. Trusted repositories include certified or community-recognised repositories, stable institutional repositories and generalist repositories (such as Zenodo).

Practical advice:

  • Use the re3data repository registry to identify appropriate repositories.
  • The use of domain specific repositories is the most desirable. Such repositories will employ domain specific metadata standards and controlled vocabularies and/or ontologies which will enhance the ability to do analyses across similar datasets and even across domains.
  • Institutional and generalist repositories should only be considered if no domain specific repository can be found or if there is a mandate to deposit in a specific repository. If it is necessary to deposit data in multiple repositories, this should be done so as to maintain the same persistent identifier across those multiple copies.

 

 

Sensitive data

Sensitive data is information that should be protected against unauthorised disclosure because unauthorised access may negatively affect the privacy of an individual, trade and business secrets or even security. In the context of research, sensitive data usually include personally identifiable information (names, date and place of birth, place of living, employment information, etc.), health information,  and other private or confidential data.

Practical advice:

  • Even in case of sensitive data, it is still possible to adhere to FAIR principles and open science by making the metadata freely available, while not enabling public access to the underlying data. These data can be managed through access control mechanisms and anonymisation and pseudonymisation.

 

 

Storage

Data storage is a computing technology that enables saving data in a digital format on computer components and recording media, including cloud services.

In the context of Research Data Management, it is necessary to ensure that data are stored securely until the end of the project and throughout the minimum retention period. Storage options may include:

  • Portable devices,
  • University network drives,
  • Cloud services.

Practical advice

  • Storage on local storage spaces on hard drives, pen drives, etc. is discouraged because these devices are vulnerable and data loss may occur. If portable devices are used, it should be ensured that there are copies on networked drives and backup storage.
  • Using in-house or institutionally approved spaces is recommended, especially if regular backups are enabled. 
  • Cloud services are suitable for collaboration with partners from partner institutions. However, it should be checked whether the selected cloud service makes regular backups and whether it falls under European jurisdiction.