Skip to main content
Case studies

Curating the OpenAIRE Graph: disambiguating Greek organisations metadata with OpenOrgs

Correcting data from Greek organisations inferred in OpenAIRE Graph and more services

Overview

ATHENA RC is the National Admin of OpenOrgs and has taken active part in efforts to support the disambiguation of Greek organisations with research activity. During this process, it was observed that the list of pending organisations (to be approved in OpenOrgs) was larger than expected, approximately 500K, which is unusual for a country of Greece’s size. This observation led to refinement of the disambiguation algorithm which provided better results and allowed for better quality data to be available in the NKUA/UoA institutional dashboard on OpenAIRE MONITOR and in ARGOS Data Management Plans (DMPs).

Challenge & Scenario

OpenOrgs incorporates automated workflows to classify data entries ingested in the OpenAIRE Graph from external resources. During the ingestion of new information, the service automatically creates a metadata record for every ingested organization that is then merged with existing records that appear to be about the same organisation (duplicates). The suggested list of new metadata records is then provided to National Admins for curation that strengthens quality assurance in the disambiguation process.  The suggested list of Greek organisations was observed to be larger than expected with undetected duplicates due to the different alphabet. Although the algorithm performed excellently in Latin, still alterations were necessary to increase efficiency and effectiveness in performance over the Greek alphabet.  This issue quickly became evident in OpenAIRE services that use OpenOrgs information  that users search for regarding organisations. For example, in both the MONITOR institutional dashboard that monitors Open Science outcomes of NKUA and ARGOS that produces and publishes DMPs, the same organization was present many times in a drop down menu, causing confusion to the users who didn’t know which one to choose each time, or which was valid, or which name was more accurate and trustable.

Solution & Implementation

Via OpenOrgs, the algorithm was immediately modified to address these observations and the suggested list of Greek organisations was minimised, meaning that more duplicates were automatically detected and merged with existing records before National Admins got their hands on curating them. This improvement had positive results to services that use OpenOrgs to offer automated solutions of searching the OpenAIRE Graph, such as the MONITOR,  OpenScienceObservatory, UsageCounts, CONNECT.

Impact

The disambiguation of NKUA/UoA with OpenOrgs facilitated better classification, links (relations) and retrieval of +150K scientific publications, 13 projects, 2 research data and +30K other research products (Reference on OpenAIRE EXPLORE).
“ It is our responsibility as librarians to curate (public) scientific knowledge to prevent misinformation and incorrect data from entering big infrastructure services like OpenAIRE. OpenOrgs offers a powerful tool to reach our goal.”
Elli Papadopoulou, NOAD for Greece

Related resources

In depth description

Details

In February 2022, the NKUA was developing a MONITOR institutional dashboard to track Open Science using OpenAIRE data from the Graph. NKUA was found many times in OpenOrgs causing delays in the deduplication of the organization and in the proper linking with other data about the organization in the institutional dashboard. The OpenOrgs National Admin observed that the list of suggested organisations to be disambiguated for Greece was never-ending and contacted the service manager. They investigated the issues by using the NKUA example and found 35+ individual records in the English and Greek language: National Kapodistrian University of  Athens   | Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών.

 The algorithm needed to be refined to better recognize the Greek alphabet and besides the Greek alphabet to prevent new records to be created because the title breaks on the "&" character used in between the title sometimes.

 The image below captures a snapshot of the curation process, showing the different duplicates available for approval / deletion.

nkua OpenOrgs duplicates

 

This is the complete, curated and merged NKUA record in OpenOrgs:

nkua in OpenOrgs complete

The provenance of every curation performed is kept and you can see how the one organisation looks like after in OpenAIRE EXPLORE:

nkua different sources naming OpenOrgs

 

It should be noted that the disambiguation process is a continuous activity. Every time a new source is harvested by OpenAIRE, there is a chance that new records of existing organisations will occur and duplicates will emerge for National Admins to curate.

 

Service in focus

OpenOrgs

OpenOrgs is a tool created to solve a long-standing problem: the disambiguation of organisations variously involved in the research process. In particular, OpenOrgs addresses the ambiguity affecting the information aggregated by OpenAIRE from different research organisation registries (e.g. ROR, EC) and populating the OpenAIRE Research Graph. OpenOrgs combines automated processes and human curation. The deduplication algorithm does the first part of the work, grouping organisations with a certain degree of similarity in their metadata. After that, a process of manual curation corroborates the automated process. Data curators can resolve the ambiguity of duplicates detected with the automated process by stating whether two or more entities correspond or not to the same organisation. They can also suggest new duplicates which the algorithm has not found, thus improving the automated process. With these tasks, OpenOrgs users can compensate for the lack of information available and improve the organisations' discoverability.

Related Services

We want to hear from you

If you find the case study useful, contact us so we can guide you through the process.