Aggregation and content provision workflows
Index and stats update:
Next update scheduled to start on: 2018-06-25
|Available on the portal
|2018-07-10||2018-06-27||Updated version of OpenAIRE mining algorithms processed more than 400K additional full-texts from Springer Open Access.|
|2018-05-15||2018-05-08||More than 200K additional full-texts processed by OpenAIRE mining algorithms.|
|2018-03-30||2018-03-26||New research community: Research Data Alliance|
|2018-03-20||2018-03-13||Content update delayed because of technical issues|
updated version of mining algorithm
updated data model (see detaiils at https://www.openaire.eu/openaire-xml-schema-change-announcement)
|2017-12-15||2017-12-11||FCT naming not fixed yet.|
|2017-11-27||2017-11-19||Portuguese funder FCT appears twice. Once with a wrong name.|
Added projects of funders RCUK and Turkey.
OpenAIRE makes openly accessible a rich Information Space Graph (ISG) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other. The ISG is constructed via a set of autonomic, orchestrated workflows operating in a regimen of continuous data integration. 
The OpenAIRE technical infrastructure collects information about objects of the research life-cycle compliant to the OpenAIRE acquisition policy  from different types of data sources :
What does OpenAIRE collect?
- Scientific literature metadata and full-texts from institutional and thematic repositories, Open Access journals and publishers;
- Dataset metadata from data repositories and data journals;
- Scientific literature, data and software metadata from Zenodo;
- Metadata about data sources, organizations, projects, and funding programs from entity registries, i.e. authoritative sources such as CORDA and other funder databases for projects, OpenDOAR for publication repositories, re3data for data repositories, DOAJ for Open Access journals;
- Coming soon: metadata of open source research software from software repositories (currently available only on https://beta.openaire.eu)
- Coming soon: metadata about other types of research products (e.g. workflow, protocols, methods, research packages, etc.)
- Coming soon: metadata about scientific literature, datasets, persons, organisations, projects, funding, equipment and services are collected through CRIS (Common Research Information Systems)
What kind of data sources are in OpenAIRE?Objects and relationships in the OpenAIRE ISG are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
- Institutional or thematic repositories: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
- Open Access Publishers and journals: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;
- Data archives: Information systems where scientists deposit descriptive metadata and files about their research data (also known as scientific data, datasets, etc.).;
- Hybrid repositories/archives: information systems where scientists deposit metadata and file of scientific literature, research data and research software (e.g. Zenodo)
- Aggregator services: Information systems that, like OpenAIRE, collect descriptive metadata about publications or datasets from multiple sources in order to enable cross-data source discovery of given research products. Examples are DataCite, BASE, DOAJ;
- Entity Registries: Information systems created with the intent of maintaining authoritative registries of given entities in the scholarly communication, such as OpenDOAR for the institutional repositories, re3data for the data repositories, CORDA and other funder databases for projects and funding information;
- CRIS (coming soon): Information systems adopted by research and academic organizations to keep track of their research administration records and relative results; examples of CRIS content are articles or datasets funded by projects, their principal investigators, facilities acquired thanks to funding, etc..
How does OpenAIRE collect metadata records?As of October 2017, OpenAIRE aggregates more than 25 millions of metadata records from more than 2,700 data sources.
OpenAIRE features three workflows for metadata aggregation:
- for the aggregation from data sources whose content is known to comply with the OpenAIRE content acquisition policy,
- for the aggregation of content that is not known to be eligible according to the policy,
- for the aggregation of information packages from entity registries.
Workflow for OpenAIRE compliant data sourcesThis workflow is for data sources that comply with the OpenAIRE guidelines and thus it is executed for the majority of data sources.
The workflow consists of two phases: collection and transformation.
The collection phase collects information packages in form of XML metadata records from an OAI-PMH endpoint of the data source (as the OpenAIRE guidelines mandate) and stores them in a metadata store.
The transformation phase transforms the collected records according to the OpenAIRE internal data model and stores them in another metadata store, ready to be read for populating the OpenAIRE ISG.
Workflow for data sources with unknown complianceThis workflow applies to data sources that are registered into OpenAIRE but are not known to be OpenAIRE compliant. This is the typical case for aggregators of data repositories (e.g. Datacite).
According to the content acquisition policies OpenAIRE can include a dataset into the ISG only if it has a link to an object (project or publication) already in the ISG.
Therefore, OpenAIRE collects all metadata records and transforms them according to the internal OpenAIRE data model. Inference algorithms process the records and mark those that satisfy the content acquisition policy, so that they are eligible to enter in the ISG.
Workflow for entity registriesThis workflow applies to data sources offering authoritative lists of entities.
The workflow consists of two phases: collection and transformation.
The collection phase collects information packages in the form of files in some machine readable format (e.g. XML, JSON, CSV) via one of the supported exchange protocols (OAI-PMH, SFTP, FTP(S), HTTP, REST).
The transformation phase transforms the packages according to the OpenAIRE internal data model and stores them into a metadata store ready to be read for populating the OpenAIRE ISG.
For additional details about the aggregation workflows, please refer to .
What does OpenAIRE do to enrich the collected metadata records?Once the ISG is populated, OpenAIRE performs de-duplication of organizations and publications  and runs inference algorithms  to enrich the graph with additional information extracted from the publications' full-texts, namely:
- links to datasets
- links to projects
- links to research communities
- links to publications
- links to software
- links to biological entities (e.g. PDB)
How is the enriched OpenAIRE graph published?The deduplicated and enriched ISG is materialized by the data publishing workflow into four ISG projections:
- a full-text index to support search and browse queries from the OpenAIRE portal and to expose subsets of the ISG on the OpenAIRE search API ,
- a E-R database and a dedicated key-value cache for statistics,
- a NoSQL document storage in order to support OAI-PMH bulk export of subsets of the ISG in XML format ,
- a triple store in order to expose the ISG as LOD via a SPARQL endpoint (currently in beta) 
The switch from pre-public to public, meaning that the currently accessible ISG projections and statistics will be dismissed and the new versions will take their place, is still manual for safety reasons.
Pre-public ISG projections are subject to a set of semi-automatic checks for quality control .
Those quality check are needed to evaluate whether the switch to public can be performed or some regressions in the overall data quality need to be addressed first.
How often is the OpenAIRE graph published?The ISG is published about once every two weeks unless critical quality issues arise in the quality check phase.
Whenever minor issues occur, the ISG is published anyway and details about the issues are
- tracked via the private ticketing system of the OpenAIRE technical team
- if the issue depends on the original collected content, it is notified to the affected data source
- briefly described in the table above, which keeps track of the index and statistics update
 Manghi P. et al. (2014) "The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures", Program, Vol. 48 Issue: 4, pp.322-354, https://doi.org/10.1108/PROG-08-2013-0045
 Check the data provider page (https://www.openaire.eu/search/data-providers) for the complete list of sources
 Bolikowski L. (2015) Text mining services in OpenAIRE: https://blogs.openaire.eu/?p=88
 OpenAIRE claiming functionality: https://www.openaire.eu/participate/claim
 The OpenAIRE acquisition policy: https://www.openaire.eu/content-acquisition-policy
 Check which funders are affiliated with OpenAIRE: https://www.openaire.eu/search/find#projects
 Atzori, Claudio, Bardi, Alessia, Manghi, Paolo, & Mannocci, Andrea. (2017). The OpenAIRE workflows for data management. Zenodo. http://doi.org/10.5281/zenodo.996006
 Manghi P. (2015) On de-duplication in the OpenAIRE infrastructure: https://blogs.openaire.eu/?p=116
 OpenAIRE API documentation: http://api.openaire.eu
 OpenAIRE Linked Open Data: http://lod.openaire.eu/documentation
 Mannocci, A., & Manghi, P. (2016, September). DataQ: A Data Flow Quality Monitoring System for Aggregative Data Infrastructures. In International Conference on Theory and Practice of Digital Libraries (pp. 357-369). Springer International Publishing. https://doi.org/10.1007/978-3-319-43997-6_28
Tags: content providers