Skip to main content
Case studies

​Linknovate knowledge base enrichment and curation via OpenAIRE Graph

​​​Leveraging the OpenAIRE Graph to eliminate duplicated profiles and enhance organization information within the Linknovate platform

Overview

Linknovate provides an "innovation search engine” to R&D and strategic divisions of different kind of organizations. We have incorporated and structured more heterogeneous data sources than any other solution (publications, patents, funding data, specialized news, web monitoring…), allowing our clients to collectively monitor these "innovation signals". This translates into time savings and improved internal communication. In this regard, the enrichment of our datasets should boost the scouting capabilities of our users cross-industry.

Challenge & Scenario

Linknovate has its own data about public and private organizations, and we have to deal regularly with one of our main issues nowadays, organization’s deduplication. Deduplication in the context of organizational entities extends beyond just research outputs because mentions of these organizations in communication can vary and are often not represented by a single identification (ID). In order to be able to show all the innovation happening within an organization, it is critical that the system understand that e.g “Linknovate” is the same organization than “Linknovate Science”. Sometimes, this is not a trivial statement even for a human with knowledge in the industry and geographic area of this company.  Furthermore, a complete profile is essential for a good recommendation of similar organizations in Startup Radar.

Solution & Implementation

OpenAIRE has a robust database of organizations maintained with the assistance of a team of data curators. These curators, often affiliated with national open access desks, work to assign stable IDs, specifically OpenAIRE IDs, to these organizations. An algorithm is employed to suggest affiliations for these organizations, and the curators play a critical role in either accepting or rejecting these suggestions. They are well-versed in the landscape of universities and research institutions, making them knowledgeable in making accurate assignments of IDs and affiliations, ensuring the data remains coherent and reliable. Therefore, at Linknovate decided to improve the organizations profiles in our platform. We leveraged in the OpenAIRE data at least to curate the profiles for the public organizations. We downloaded the last version of the OpenAIRE organizations dataset which contains 311,492 profiles. Crossing that dataset with ours, we have managed to match common profiles and enrich/curate the following aspects:  Merge duplicated profiles Add or correct the corporative website address Add or correct the location information Add standard IDs (like ROR, GRID…) to the company profiles

Impact

Crossing the OpenAIRE Graph organizations dataset with ours, we have matched 88,954 common profiles. Using the information of this common profiles, we have been able to:  Identified a total of 62,094 organization profiles that required either merging or the addition of aliases within our platform Verified the website information for 12,811 existing profiles within our platform. Identified 18,792 organizations that previously lacked geographic information. Compiled the various standard IDs (ROR, GRID...) included in the OpenAIRE dataset for all the 88,954 organizations matched in our platform.

In depth description

Details

In this project, the Linknovate team analyzed the OpenAIRE Graph to explore additional data related to organizations involved in the research life-cycle, such as universities, research organizations, and funders. The primary focus was on addressing the challenge of duplicate entries for organizations within the Linknovate platform.

The first major task involved cleaning up and consolidating organization profiles. By comparing official names and alternative names from the OpenAIRE dataset, the team identified and merged duplicated profiles. They also added these alternative names as aliases to ensure accurate association with future records.

Next, Linknovate worked on enriching organization profiles in three key aspects:

  • Website Information: Some profiles in the Linknovate platform lacked valid website addresses. To address this, the team retrieved this information from the OpenAIRE dataset, enhancing the completeness of these profiles and refining organization descriptions (also very important in different features of the platform).
  • Location Information: Leveraging the geographic data from OpenAIRE, the team complemented the organization profiles with country-level information. This was particularly valuable for accurate location-based searches, addressing challenges with potentially inaccurate city-level details, especially for large organizations.
  • Organization IDs: Recognizing the importance of organization IDs for future actions and data integration, the team compiled various standard IDs (e.g., ROR, GRID) from the OpenAIRE dataset for the 88,954 organizations matched between both datasets.

In essence, the project aimed to improve the overall quality, completeness, and accuracy of organization profiles within Linknovate platform by leveraging the rich data available in the OpenAIRE dataset. This effort not only addressed deduplication challenges but also enhanced the platform's capability to provide more precise recommendations for similar organizations.

Finally, Linknovate concludes with a clearer understanding of the OpenAIRE Graph and its potential use in the future. Although certain limitations prevented further progress at the moment, we hope that in the future it may eventually offer the opportunity to expand Linknovate's publications dataset. 

Service in focus

OpenAIRE Graph

The OpenAIRE Graph is a service that populates and provides access (via APIs or a downloadable dump on Zenodo) to the research community, SMEs, Research Infrastructures. The Graph includes metadata and links between scientific products (e.g., literature, datasets, software, "other research products"), organizations, funders, funding streams, projects, communities, and (provenance) data sources. In IntelComp, the Graph is the core element of the IntelComp Data Lake, that along with other types of datasets (internal from partners, from social media, from websites, from third parties like EC, EPO, etc.) forms a rich pool of data to use, combine, compare, analyze, and view.

Related Services

We want to hear from you

If you find the case study useful, contact us so we can guide you through the process.