Aggregation and content provision workflows

 Go to table with updates

OpenAIRE materializes an open, participatory research graph (the OpenAIRE Research graph) where products of the research life-cycle (e.g. scientific literature, research data, project, software) are semantically linked to each other and carry information about their access rights (i.e. if they are Open Access, Restricted, Embargoed, or Closed) and the sources from which they have been collected and where they are hosted. The OpenAIRE research graph is materialised via a set of autonomic, orchestrated workflows operating in a regimen of continuous data aggregation and integration. [1]

 

What does OpenAIRE collect?

The OpenAIRE technical infrastructure collects information about objects of the research life-cycle compliant to the OpenAIRE acquisition policy [5] from different types of data sources [2]:
  1. Scientific literature metadata and full-texts from institutional and thematic repositories, Open Access journals and publishers;
  2. Dataset metadata from data repositories and data journals;
  3. Scientific literature, data and software metadata from Zenodo;
  4. Metadata about data sources, organizations, projects, and funding programs from entity registries, i.e. authoritative sources such as CORDA and other funder databases for projects, OpenDOAR for publication repositories, re3data for data repositories, DOAJ for Open Access journals;
  5. Coming soon: metadata of open source research software from software repositories (currently available only on https://beta.explore.openaire.eu/)
  6. Coming soon: metadata about other types of research products, like workflow, protocols, methods, research packages (currently available only on https://beta.explore.openaire.eu/)
  7. Coming soon: metadata about scientific literature, datasets, persons, organisations, projects, funding, equipment and services are collected through CRIS (Common Research Information Systems) (currently available only on https://beta.explore.openaire.eu/)
Relationships between objects are collected from the data sources, but also automatically detected by inference algorithms [3] and added by authenticated users, who can insert links between literature, datasets, software and projects via the “Link” procedure available from the OpenAIRE web portal [4].

What kind of data sources are in OpenAIRE?

Objects and relationships in the OpenAIRE Research Graph are extracted from information packages, i.e. metadata records, collected from data sources of the following kinds:
  • Institutional or thematic repositories: Information systems where scientists upload the bibliographic metadata and full-texts of their articles, due to obligations from their organization or due to community practices (e.g. ArXiv, Europe PMC);
  • Open Access Publishers and journals: Information system of open access publishers or relative journals, which offer bibliographic metadata and PDFs of their published articles;
  • Data archives: Information systems where scientists deposit descriptive metadata and files about their research data (also known as scientific data, datasets, etc.).;
  • Hybrid repositories/archives: information systems where scientists deposit metadata and file of any kind of scientific products, incuding scientific literature, research data and research software (e.g. Zenodo)
  • Aggregator services: Information systems that collect descriptive metadata about publications or datasets from multiple sources in order to enable cross-data source discovery of given research products. Examples are DataCite, BASE, DOAJ;
  • Entity Registries: Information systems created with the intent of maintaining authoritative registries of given entities in the scholarly communication, such as OpenDOAR for the institutional repositories, re3data for the data repositories, CORDA and other funder databases for projects and funding information;
  • CRIS (coming soon): Information systems adopted by research and academic organizations to keep track of their research administration records and relative results; examples of CRIS content are articles or datasets funded by projects, their principal investigators, facilities acquired thanks to funding, etc.. 
  • Information spaces: services that maintain an information space of (possibly interlinked) scholalrly communication objects. Examples are CrossRef, ScholeXplorer and OpenAIRE itself.

OpenAIRE and the Content Acquisition Policies

As of July 2019, OpenAIRE aggregates more than 30 millions of metadata records from more than 16,000 data sources.

Until October 2018, OpenAIRE used to collect metadata records according to a strict Content Acquisition Policy (CAP), according to which OpenAIRE could collect and include in the Research Graph only:

  • Metadata about publications that are Open Access and/or linked to a project of one of the supported funders;
  • Metadata about datasets that are linked to at least one of the publications of the previous point.

The old policy described above clearly generated a "bias" in the metadata available in the OpenAIRE Research Graph. With the new CAP, the OpenAIRE Research Graph can include also metadata about publications with restricted or closed access rights, even if they are not linked to any of the supported funders. Accordingly, the CAP allows the OpenAIRE Research Graph also to contain metadata about research data, regardless its access rights or relationships with scientific literature.

Clearly, moving from the old CAP to the new CAP is not a "one-shot" process. In fact the OpenAIRE aggregation team is incrementally applying the new CAP, involving repository managers, when needed. You can have an overview of the implication of the new CAP on https://beta.explore.openaire.eu.

How does OpenAIRE collect metadata records?

OpenAIRE collects metadata records describing objects of the research life-cycle from content providers compliant to the OpenAIRE guidelines and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR, re3data, DOAJ, and funder databases).

The OpenAIRE aggregator collects metadata records in the majority of cases via OAI-PMH, but also supports other standard exchange protocols like FTP(S), SFTP, and RESTful API.

After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and the APIs.

For additional details about the aggregation workflows, please refer to [7].

What does OpenAIRE do to enrich the collected metadata records?

Once the Research graph is populated, OpenAIRE performs de-duplication of organizations and publications [8] and runs inference algorithms [3] to enrich the graph with additional information extracted from the publications' full-texts, namely:
  • Subjects
  • Links to datasets
  • Links to projects
  • Links to research communities and infrastructures
  • Links to publications (i.e. similar publications)
  • Links to software
  • Links to biological entities (e.g. PDB)
  • Citations
All other information (e.g. access rights, titles, authors, URLs to web resources) are collected from data sources. Whenever the de-duplication algorithm finds duplicates of the same publication, all information from all of the duplicates is kept. OpenAIRE keeps track of the provenance of information (i.e. if it has been inferred by mining algorithm, if it has been claimed by authenticated portal users or if it was present in the metadata record collected from a data source).


How is the enriched OpenAIRE Research Graph published?

The deduplicated and enriched graph is materialized by the data publishing workflow into four projections:
  1. a full-text index to support search and browse queries from the OpenAIRE portal and to expose subsets of the graph on the OpenAIRE search API [9],
  2. a E-R database and a dedicated key-value cache for statistics,
  3. a NoSQL document storage in order to support OAI-PMH bulk export of subsets of the graph in XML format [9],
  4. a triple store in order to expose the graph as LOD via a SPARQL endpoint (currently in beta) [10]
Every time the data publishing workflow executes, four new projections are generated and placed in a “pre-public status”  before being accessible by the general public.
The switch from pre-public to public, meaning that the currently accessible projections and statistics will be dismissed and the new versions will take their place, is still manual for safety reasons.
Pre-public projections are subject to a set of semi-automatic checks for quality control [11].
Those quality check are needed to evaluate whether the switch to public can be performed or some regressions in the overall data quality need to be addressed first.

How often is the OpenAIRE Research Graph published?

The OpenAIRE Research graph is published about once every two weeks unless critical quality issues arise in the quality check phase.

Whenever minor issues occur, the is published anyway and details about the issues are
  • tracked via the private ticketing system of the OpenAIRE technical team
  • if the issue depends on the original collected content, it is notified to the affected data source
  • briefly described in the table below, which keeps track of the index and statistics update

References

[1] Manghi P. et al. (2014) "The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures", Program, Vol. 48 Issue: 4, pp.322-354, https://doi.org/10.1108/PROG-08-2013-0045

[2] Check the data provider page (https://explore.openaire.eu/search/find/dataproviders) for the complete list of sources

[3] Bolikowski L. (2015) Text mining services in OpenAIRE: https://blogs.openaire.eu/?p=88

[4] OpenAIRE claiming functionality: https://explore.openaire.eu/participate/claim

[5] The OpenAIRE content acquisition policy: https://www.openaire.eu/content-acquisition-policy

[6] Check which funders are affiliated with OpenAIRE: https://www.openaire.eu/search/find#projects

[7] Atzori, Claudio, Bardi, Alessia, Manghi, Paolo, & Mannocci, Andrea. (2017). The OpenAIRE workflows for data management. Zenodo. http://doi.org/10.5281/zenodo.996006

[8] Manghi P. (2015) On de-duplication in the OpenAIRE infrastructure: https://blogs.openaire.eu/?p=116

[9] OpenAIRE API documentation: http://api.openaire.eu

[10] OpenAIRE Linked Open Data: http://lod.openaire.eu/documentation

[11]  Mannocci, A., & Manghi, P. (2016, September). DataQ: A Data Flow Quality Monitoring System for Aggregative Data Infrastructures. In International Conference on Theory and Practice of Digital Libraries (pp. 357-369). Springer International Publishing. https://doi.org/10.1007/978-3-319-43997-6_28

Index and stats update:

Next update scheduled to start on: 2019-08-19

Available on the portal

Start date                    

Notes

2019-07-30 2019-07-24  Statistics have not been updated
2019-07-15 2019-07-10

 Statistics have not been updated

2019-06-23 2019-06-19

 Statistics have not been updated

N/A 2019-06-03

 Records from Arxiv.org re-harvested. De-duplication and inference algorithms are running (info added on 2019-06-06). Content could not be published because of some quality issues.

 N/A 2019-05-23

 Content not published due to a loss of metadata records from Arxiv.org.

2019-05-16 N/A

 Updated statistics on monitor.openaire.eu

 2019-05-14 2019-05-06

 Statistics have not been updated.

Upgraded Solr server in use since May 22nd 2019.

 2019-04-08 2019-04-03

 Statistics have not been updated.

 2019-03-28 2019-03-11

 Statistics have not been updated.

 2019-02-28 2019-02-20

 

 2019-02-11 2019-01-28

New funder available: Academy of Finland (AKA). 

Publishing has been delayed to fix a temporary loss of links to SNSF projects.

 2019-01-04 2018-12-27  
 2018-12-13 2018-12-04 All types of research products have been de-duplicated.
2018-11-19 2018-11-12

Content from Portuguese repositories re-aggregated

Inference and de-duplication algorithms have been re-run to solve the issues about lost links.

As a consequence of the new algorithm run and of the increase of available full-texts, we note a general increase of links to projects of all funders.

N/A 2018-10-30

Content generated cannot be published as we noticed

  • a loss of records from Portuguese repositories (they are not exposing the openaire OAI set anymore)
  • a loss of links to projects involving more than 200 repositories.
The technical team is analysing the information space and investigating the issues. 
2018-10-16 2018-10-10 Updated mapping for new research object types
2018-10-10 2018-10-01  
2018-09-10  2018-08-28 The harmonisation of SNSF publication metadata is still ongoing.
 2018-08-01 2018-07-27

We noticed a decrease of SNSF publications due to a change in the resource types in the records collected from the SNSF P3 publication database.
This will be fixed in the next update.

2018-07-10 2018-06-27 Updated version of OpenAIRE mining algorithms processed more than 400K additional full-texts from Springer Open Access.
 2018-06-08 2018-06-05  
 2018-05-28 2018-05-22  
 2018-05-15 2018-05-08 More than 200K additional full-texts processed by OpenAIRE mining algorithms.
 2018-04-16 2018-04-10  
 2018-03-30 2018-03-26 New research community: Research Data Alliance
 2018-03-20 2018-03-13 Content update delayed because of technical issues
2018-02-20 2018-02-16  
2018-02-09 2018-02-05  
 2018-01-30 2018-01-17

updated version of mining algorithm

updated data model (see detaiils at https://www.openaire.eu/openaire-xml-schema-change-announcement)

2017-12-28 2017-12-22  
2017-12-15 2017-12-11 FCT naming not fixed yet.
 2017-11-27 2017-11-19  Portuguese funder FCT appears twice. Once with a wrong name.
 2017-11-13 2017-11-03 

Added projects of funders RCUK and Turkey.

 

 

Making your repository Open

Guides for Content Providers

Making your repository Open

This guide, is a companion Open Science (OS) checklist for Content Providers, about how to license repositories, meant to offer a state of the art, legally advanced, but still manageable set of rules, guidelines, and resources to enable the full potential of OS in the EU research field with a view to addressing copyright and related rights issues.

Contact us via our Helpdesk. We try to respond within 48 hours.

OpenAIRE Guidelines and Application Profile for repository managers and publication platforms 4.0: More Detail - More Connectivity

OpenAIRE is a network of joined-up repositories ensuring a streamlined infrastructure to support open access across Europe. Over the past 10 years, fromDRIVER to the subsequent OpenAIRE Guidelines, the European repository community has ensured that repositories expose bibliographic metadata in a standardized manner. The approach has always been based on using established formats (oai_dc) and transfer protocols (OAI-PMH) and use them uniformly via coordinated guidelines.


Why does this matter?

Repository contents shouldn’t remain hidden. By sharing content more researchers can reuse it. However, it doesn’t stop there; since publications are not a finite part of the research process, an enriched contextual information can be a valuable addition to the bibliographic record. Over the last 10 years, a range of additional opportunities and requirements have been developed by the repository community, including the following: information on funder and projects, access and license conditions, embargo periods, persistent identifiers and links to other research products). OpenAIRE has worked hard and reflects these elements in its new guidelines.


It’s all in the Detail


The new Guidelines v4 have taken an important step. They have replaced the Dublin Core format used in OAI-PMH and define an application profile based on metadata properties from Dublin Core, DataCite and OpenAIRE. This ensures the following:


  • more granularity of bibliographic information leads to more (semantic) accuracy,
  • (persistent) identifiers to all relevant entities of research information can be provided consistently (research products, authors, contributors, organizations, research sponsors and projects),
  • meaningful and machine-interpretable relations between entities or web resources can be specified,
  • the bibliographic citation can be generated by its individual attributes (series title, volume, issue, startpage, endpage etc.) and exported in different formats and citation styles and,
  • controlled vocabularies from OpenAIRE, COAR, DataCite, and other initiatives can be encoded, thus improving interoperability with other repository networks, including LA Referencia and the Japan Consortium for Open Access Repository (JPCOAR).

As a result, OpenAIRE as a scholarly communication infrastructure can create and provide a rich information space graph on research products and their authors and contributors which is of improved quality.


After consulting the repository community for review in recent months, the release v4 of the guidelines and the application profile is now finalized. It marks the first step followed by the implementation phase.


OpenAIRE cooperates among others with Duraspace to align compliance of repository software with the OpenAIRE Guidelines.

Accordingly, in the coming weeks, OpenAIRE services will be adapted to the new guidelines, such as the validator and format updates for funder and project information for DSpace and EPrints) in the OpenAIRE API.


Contact

V4 of the OpenAIRE Guidelines for institutional and thematic repository managers will be released in early July under doi:10.5281/zenodo.1299203. Contact

OpenAIRE Guidelines for Literature Repository Managers v 4.0 are now available!

After a long consultation period, we are pleased to introduce the new version of the OpenAIRE Guidelines for Literature repositories.
The Guidelines are intended to guide repository managers to expose to OpenAIRE open access and non-open access publications together with funding information, where applicable.

They have been gradually improved over the past year, having incorporated comments and feedback received from partners and other initiatives, and the recent developments in OpenAIRE, such as the new Content Acquisition Policy.

What's new?
The Guidelines for Literature Repository Managers v 4.0 introduce the following major changes:
  • Aplication profile and schema based on Dublin Core and DataCite, including a new OAI-metadataPrefix
  • Support of identifier schemes for authors, organisations, funders, and scholarly resources
  • Introduction of COAR Controlled Vocabularies
  • Compliance with the OpenAIRE Content Acquisition Policy
How does this affect you?
By implementing the OpenAIRE Guidelines you enable authors who deposit publications in your repository to fulfill the EC Open Access requirements, and eventually, the requirements of other (national or international) funders with whom OpenAIRE cooperates. Incorporating publications into the OpenAIRE infrastructure for discoverability, and utilising value-added services provided by the OpenAIRE portal are also feasible.
What's next?
In the coming months, OpenAIRE will foster a swift implementation of the Guidelines in the two major repository platforms, DSpace and EPrints.
Want to learn more?

OpenAIRE Provide Dashboard

Marking half year of achievements, production and improvements.

OpenAIRE
flag black white lowOpenAIRE-Advance receives funding from the European Union's Horizon 2020 Research and Innovation programme under Grant Agreement No. 777541.
  Unless otherwise indicated, all materials created by OpenAIRE are licenced under CC ATTRIBUTION 4.0 INTERNATIONAL LICENSE.
OpenAIRE uses cookies in order to function properly. By using the OpenAIRE portal you accept our use of cookies.
More information Ok