Linknovate knowledge base enrichment and curation via OpenAIRE Graph
Leveraging the OpenAIRE Graph to eliminate duplicated profiles and enhance organization information within the Linknovate platform
Linknovate provides an "innovation search engine” to R&D and strategic divisions of different kind of organizations. We have incorporated and structured more heterogeneous data sources than any other solution (publications, patents, funding data, specialized news, web monitoring…), allowing our clients to collectively monitor these "innovation signals". This translates into time savings and improved internal communication. In this regard, the enrichment of our datasets should boost the scouting capabilities of our users cross-industry.
Challenge & Scenario
Linknovate has its own data about public and private organizations, and we have to deal regularly with one of our main issues nowadays, organization’s deduplication. Deduplication in the context of organizational entities extends beyond just research outputs because mentions of these organizations in communication can vary and are often not represented by a single identification (ID). In order to be able to show all the innovation happening within an organization, it is critical that the system understand that e.g “Linknovate” is the same organization than “Linknovate Science”. Sometimes, this is not a trivial statement even for a human with knowledge in the industry and geographic area of this company. Furthermore, a complete profile is essential for a good recommendation of similar organizations in Startup Radar.
Solution & Implementation
OpenAIRE has a robust database of organizations maintained with the assistance of a team of data curators. These curators, often affiliated with national open access desks, work to assign stable IDs, specifically OpenAIRE IDs, to these organizations. An algorithm is employed to suggest affiliations for these organizations, and the curators play a critical role in either accepting or rejecting these suggestions. They are well-versed in the landscape of universities and research institutions, making them knowledgeable in making accurate assignments of IDs and affiliations, ensuring the data remains coherent and reliable. Therefore, at Linknovate decided to improve the organizations profiles in our platform. We leveraged in the OpenAIRE data at least to curate the profiles for the public organizations. We downloaded the last version of the OpenAIRE organizations dataset which contains 311,492 profiles. Crossing that dataset with ours, we have managed to match common profiles and enrich/curate the following aspects: Merge duplicated profiles Add or correct the corporative website address Add or correct the location information Add standard IDs (like ROR, GRID…) to the company profiles
Crossing the OpenAIRE Graph organizations dataset with ours, we have matched 88,954 common profiles. Using the information of this common profiles, we have been able to: Identified a total of 62,094 organization profiles that required either merging or the addition of aliases within our platform Verified the website information for 12,811 existing profiles within our platform. Identified 18,792 organizations that previously lacked geographic information. Compiled the various standard IDs (ROR, GRID...) included in the OpenAIRE dataset for all the 88,954 organizations matched in our platform.
Linknovate is a tech-scouting and competitive intelligence platform, helping companies detect technological trends and emerging markets, as well as the organizations behind them. Our search engine aggregates and analyses millions of academic and industrial sources (scientific publications, conferences, grants, specialized news, patents, web monitoring) and provides updated news and information about the latest trends in specific sectors such as energy, healthcare, finance, and more. Linknovate brings AI to corporate innovation and facilitates internal team communication around innovation scouting and monitoring. Through data mining and ML we help our clients detect innovation activity of competitors, partners, providers and newcomers (e.g. startups).
The OpenAIRE Graph is a service that populates and provides access (via APIs or a downloadable dump on Zenodo) to the research community, SMEs, Research Infrastructures. The Graph includes metadata and links between scientific products (e.g., literature, datasets, software, "other research products"), organizations, funders, funding streams, projects, communities, and (provenance) data sources. In IntelComp, the Graph is the core element of the IntelComp Data Lake, that along with other types of datasets (internal from partners, from social media, from websites, from third parties like EC, EPO, etc.) forms a rich pool of data to use, combine, compare, analyze, and view.