News
What do we mean by “Research Data” in the OpenAIRE Graph?
In the realm of Open Science, the OpenAIRE Graph stands out as a pivotal tool for enhancing discovery and research assessment. This comprehensive graph aggregates deduplicates, and enriches metadata from over 2,000 data sources, resulting in the largest open data Scholarly Knowledge Graph available today. It encompasses a vast array of scholarly outputs, including publications, research data, and research software, along with semantic links and citations to key research entities such as individuals, organisations, and data sources. Given that the term “research data” can vary significantly across disciplines and contexts, this article will clarify how we interpret and define research data within the framework of the OpenAIRE Graph. Join us as we explore the nuances and implications of this innovative resource!
The life cycle of data-driven science
The life cycle of data-driven science can be seen as a continuous loop of two main conceptual stages, the experimental phase, and the publishing phase. During the experimental phase, scientists gather research data and/or software and perform experiments until their thesis has been either confirmed or refuted. At this point, scientists publish their outcomes, in the form of scientific publications, research data, and research software. In the experimental phase, scientists gather data from two main channels:
- Publishing data sources: institutional, thematic, or catch-all repositories and some thematic databases. Such services host published research data that researchers have produced, curated, and packaged as evidence of experimental outcomes, ensuring repeatability and reproducibility while securing attribution as part of a researcher’s scientific curriculum. Published data typically takes the form of data files (e.g., tabular data in Zenodo.org) or entries in databases (e.g., proteins in PDB) and comes with bibliographic metadata (e.g., in compliance with DublinCore, DataCite, OpenAIRE Guidelines) that facilitates discoverability, assessment, and linking to other research outcomes.
- Scientific data sources: services that contain data collections generated by instruments (e.g., satellite data, sensor data, and monitoring equipment data) or built out of human and machine efforts to deliver curated collections (e.g., scientific/scholarly knowledge graphs or scientific databases). Examples include Copernicus for Earth observation data, CERN's data repositories for high-energy physics, the Ocean Observatories Initiative for marine data, the eBrains knowledge graph, and the OpenAIRE Graph. These services are operated to offer access to uniform research data collections, which can be typically queried to select a subset, to be processed or analysed.
Diagram illustrating the integration and relationships of “research data” in the OpenAIRE Graph's data model, including its connections to scientific and publishing data sources and other research outputs.
To enable research reproducibility and research assessment, when publishing research publications, scientists ought to cite the research data they have used for their experiments and the research data they have produced (and possibly published). This may happen in a variety of ways:
- Citation to “published research data”: reference from a publication to research data, e.g., a dataset in Zenodo.org, by specifying bibliographic metadata and possibly using persistent identifiers;
- Citation to a “scientific data source”: reference from a publication to a scientific data source, e.g., reference to a biobank used in an experiment, using a persistent identifier (if one was minted to identify the data collection of the data source) or, more often, by mentioning the name of the data source in the text.1
As an example, the OpenAIRE Graph is a “scientific data source”, a service that offers access via APIs to a collection of metadata records that can be used for any sort of scientific analysis. The Graph is however also published as research data in Zenodo, for scientists to download. Many scientific publications today mention the OpenAIRE Graph, some by referring to the data source name, others by including in the reference list the DOI and metadata of the OpenAIRE Graph dataset.
Research data in the OpenAIRE Graph
The OpenAIRE Graph research data acquisition policies specifically address the use-cases of discovery and research assessment within Open Science scholarly communication workflows. To this aim, the OpenAIRE Graph data model includes scientific and publishing data sources, published research data (as a special class of research products), as well as semantic citations between publications, research data, and data sources. More specifically, to operate a complete research data citation index, the OpenAIRE Graph:
- Aggregates metadata profiles of scientific and publishing data sources from known registries (FAIRSharing, OpenDOAR, re3data);
- Aggregates bibliographic metadata of and citations from/to published research data from publishing data sources;
- Applies mining methods to infer citations from research publication full-texts to both published research data and scientific data sources.
1 For completeness, such classification should include dynamic citations. These citations are references to a bibliographic metadata record that describes (and encodes) a stateless query performed over a scientific data source “at time t”.
To not miss out on OpenAIRE Graph news and developments, be sure to follow our Twitter, Mastodon, & Bluesky. We also encourage you to participate in our monthly Community Calls where our experts present Graph Use Cases & API, New Developments, and How it Works in an open dialogue with the community, and as always we welcome all questions, feedback, and more in our User Forum!