Skip to main content

News 

Data Quality & More: Highlights from the Second Open Insights Session

Feb 5, 2025

On January 23, 2025, we hosted the second session of the Open Insights Series, an initiative to engage Ireland’s Open Science community with the evolving capabilities of the National Open Access Monitor, Ireland. This session focused on the critical role of data quality, highlighting the foundational contributions of the OpenAIRE Graph and the transformative impact of text mining in enriching data in the case of Research Funding Organisations.

Data quality is more than a technical concern; it’s the backbone of the Monitor’s utility. This session showcased how robust data infrastructure, powered by the OpenAIRE Graph, underpins the Monitor’s functions, and how advanced text mining techniques are unlocking insights tailored to the needs of funders. Through presentations and discussion, we continued building a shared understanding of how the Monitor serves as a critical tool for Open Access compliance and decision-making.

If you missed the session, the recording and slides are available. We also invite you to join us for the third session of the series “Strengthening Collaboration: Monitor & Repository Integration” on February 13, 2025; you can register here.

Building a Reliable Data Backbone

The session began with a detailed exploration of the Monitor’s data infrastructure, led by Claudio Atzori, who presented the OpenAIRE Graph as the foundation of the Monitor’s capabilities. As Ireland’s national resource for Open Access monitoring, the Monitor relies on this vast and carefully curated graph to collect, validate, and enrich data from a variety of sources.

The OpenAIRE Graph’s contribution goes beyond aggregation; it integrates information about repositories, publications, projects, and organizations, making it possible to:

  • Maintain consistency through deduplication and disambiguation.
  • Map research outputs to funding sources.
  • Connect affiliations and institutions accurately.

Claudio illustrated how this data backbone leverages on the aggregated bibliographic records and the relative Open Access publication full texts to ensure that users can trust the insights generated by the Monitor. By resolving organizational inconsistencies and linking metadata across sources, among other processing functions, the Monitor delivers a solid foundation for Open Access analysis and reporting.

Enhancing Insights for RFOs Through Text Mining

Next in the session was Harry Dimitropoulos’ presentation on how text mining capabilities are tailored specifically to the needs of funders. With millions of publications processed, approaches relying only on harvesting are not always enough to capture the nuances of funding acknowledgments. Text mining bridges this gap by extracting, refining, and validating funding data directly from publication text.

Harry demonstrated how the Monitor uses text mining to:

  • Precisely identify funding acknowledgments within research outputs, even when phrasing varies or funder names are abbreviated.
  • Enrich datasets by linking outputs to specific funders and projects, ensuring compliance with Open Science policies and mandates.
  • Support more granular reporting for funders, enabling them to monitor outcomes more effectively.

Through practical examples, Harry showed how text mining not only enhances the accuracy of the Monitor’s funding data but also adds value for funders looking to evaluate their investment in Open Access research. 

For example, Science Foundation Ireland (SFI) has partnered with the Monitor, providing project metadata that enables tailored text mining to identify specific projects acknowledged in publications. As a result, SFI benefits from precise insights into their funded outputs. For the Irish Research Council (IRC) and the Health Research Board (HRB), text mining at the funder level—without project metadata—has already significantly expanded the number of publications captured in their dashboards. Publication counts have nearly doubled for these funders, demonstrating the power of advanced mining techniques. HRB is also finalizing a Data Processing Agreement with OpenAIRE to enable project-level text mining, which will further enhance their insights.

However, the impact of project metadata is even more striking. Incorporating project metadata transforms text mining from a broad search to a targeted analysis, eliminating ambiguity and reducing false positives. For example, searching for specific project identifiers like GOIPG/2013/1110 ensures that publications are accurately linked to funders’ investments. Without project metadata, relying solely on funder names can lead to misattributions - such as confusing the Health Research Board (HRB) in Ireland with similarly named organizations like the Commonwealth Health Research Board in the USA or the Public Health Research Board in England.

Funders who supply project metadata gain precision and reliability. Before project-level mining, for some funders, up to 30% of publications attributed to them via harvested metadata were false positives - funded by other organizations entirely. By introducing project data into the mining process, we eliminated these errors, delivering a clean and trustworthy dataset for their dashboards.

To see the full list of funders indexed in the OpenAIRE Graph, visit OpenAIRE’s Funder’s page in EXPLORE. Funders with a “Registered” tick mark have provided project metadata, enabling more detailed and accurate analyses. 

A Collaborative Discussion

As the session transitioned to the Q&A, participants engaged with practical questions reinforcing the importance of maintaining high standards for data quality. Without consistent and enriched metadata, the ability to draw meaningful conclusions from the Monitor’s insights would falter. As Ioanna Grypari emphasized in her closing remarks, data quality is not just an operational challenge, it’s a strategic imperative.

Strengthening Collaboration: The Next Session

Looking ahead, the third session in the Open Insights Series will take place on Thursday, February 13, 2025, at 12:00 (GMT). The session will explore the role of repositories in advancing national Open Access goals, featuring updates on the National Open Access Repositories Project, practical insights into metadata guidance, and a hands-on segment on repository integration and interoperability with OpenAIRE Provide, alongside opportunities for discussion and Q&A.