Curating the OpenAIRE Graph: disambiguating Greek organisations metadata with OpenOrgs
Correcting data from Greek organisations inferred in OpenAIRE Graph and more services
ATHENA RC is the National Admin of OpenOrgs and has taken active part in efforts to support the disambiguation of Greek organisations with research activity. During this process, it was observed that the list of pending organisations (to be approved in OpenOrgs) was larger than expected, approximately 500K, which is unusual for a country of Greece’s size. This observation led to refinement of the disambiguation algorithm which provided better results and allowed for better quality data to be available in the NKUA/UoA institutional dashboard on OpenAIRE MONITOR and in ARGOS Data Management Plans (DMPs).
Challenge & Scenario
OpenOrgs incorporates automated workflows to classify data entries ingested in the OpenAIRE Graph from external resources. During the ingestion of new information, the service automatically creates a metadata record for every ingested organization that is then merged with existing records that appear to be about the same organisation (duplicates). The suggested list of new metadata records is then provided to National Admins for curation that strengthens quality assurance in the disambiguation process. The suggested list of Greek organisations was observed to be larger than expected with undetected duplicates due to the different alphabet. Although the algorithm performed excellently in Latin, still alterations were necessary to increase efficiency and effectiveness in performance over the Greek alphabet. This issue quickly became evident in OpenAIRE services that use OpenOrgs information that users search for regarding organisations. For example, in both the MONITOR institutional dashboard that monitors Open Science outcomes of NKUA and ARGOS that produces and publishes DMPs, the same organization was present many times in a drop down menu, causing confusion to the users who didn’t know which one to choose each time, or which was valid, or which name was more accurate and trustable.
Solution & Implementation
Via OpenOrgs, the algorithm was immediately modified to address these observations and the suggested list of Greek organisations was minimised, meaning that more duplicates were automatically detected and merged with existing records before National Admins got their hands on curating them. This improvement had positive results to services that use OpenOrgs to offer automated solutions of searching the OpenAIRE Graph, such as the MONITOR, OpenScienceObservatory, UsageCounts, CONNECT.
The disambiguation of NKUA/UoA with OpenOrgs facilitated better classification, links (relations) and retrieval of +150K scientific publications, 13 projects, 2 research data and +30K other research products (Reference on OpenAIRE EXPLORE).
National and Kapodistrian University of Athens (NKUA), or otherwise known as University of Athens (UoA), founded in 1837, is the first University of Greece and of the Balkan peninsula and the Eastern Mediterranean region. ATHENA Research Center is the OpenAIRE NOAD in Greece.
OpenOrgs is a tool created to solve a long-standing problem: the disambiguation of organisations variously involved in the research process. In particular, OpenOrgs addresses the ambiguity affecting the information aggregated by OpenAIRE from different research organisation registries (e.g. ROR, EC) and populating the OpenAIRE Research Graph. OpenOrgs combines automated processes and human curation. The deduplication algorithm does the first part of the work, grouping organisations with a certain degree of similarity in their metadata. After that, a process of manual curation corroborates the automated process. Data curators can resolve the ambiguity of duplicates detected with the automated process by stating whether two or more entities correspond or not to the same organisation. They can also suggest new duplicates which the algorithm has not found, thus improving the automated process. With these tasks, OpenOrgs users can compensate for the lack of information available and improve the organisations' discoverability.