What's in a Mine?
OpenAIRE aims to link open access publications to funding and project information, as well as related datasets. This article explains how the technical arm of OpenAIRE is working hard to be able to automatically extract this related information from publications and datasets, by text-mining. It outlines the different activities dedicated to understanding how to extract links and to activate a set of extraction tasks, in order to enrich the information about a publication.
The main goal of Work Package 7 is an analysis of OpenAIREplus content, i.e., it automatically “reads” publications and related data and looks for information that would be of interest to researchers or policy-makers. At a technical level, the OpenAIRE system consists of a set of connected modules. Each of them has a different task, such as: looking for references to data sets in publications, clustering together similar publications, extracting information about funding sources from publications, analysing OpenAIREplus website users’ activity, computing metrics for publications and their authors, etc. The results of these tasks are fed back to the OpenAIREplus Information Space, so they can be presented to end-users. The software created will of course be open and available for use by anyone outside OpenAIRE.
Just how this enriches and contextualises information can be seen below.
Figure 1: OpenAIRE analyses academic publications and links them to data sets, authors, funding sources and other publications.
Let's now take a closer look at some of the components comprising the OpenAIREplus mining system.
Funding metadata extractor
The initial aim of the module was to identify EU FP7-funded documents and later this was extended to also include Wellcome Trust projects. We have now further enhanced the module so that it is able to handle an arbitrary number of funding bodies & institutions (alpha version). Funding information is important metadata that can be useful in Article, authors, institutions and funding statistics/analytics, e.g. to discover and present trends in a temporal way, or track statistics for different funding bodies/calls. Funding information can also be used to aid content classification, disambiguation, or other knowledge discovery modules.
The funding metadata extractor module can either be used on individual publications (e.g., at the time a publication is deposited to facilitate authors in adding meta-data information), or in batch mode (e.g., in order to processes all documents found in a collection/repository). In both cases, the module scans the text of each publication, does some pre-processing (such as stop-word removal, tokenization, etc.,) and then finds matches against the current known lists of project grant agreement numbers and/or acronyms for various funding bodies. Contextual information is used to provide a confidence value for each match and to weed out any false matches – such as for example gene accession numbers, postcodes, report numbers, other identifiers, etc. – that may appear identical to valid grant numbers. Also, publications may reference project grants in a literature survey section or the references section, without actually being funded by those particular projects; these situations must be identified as false matches.
This module has been extensively tested against a large number of publication collections from library resources, ArXiv, PLoS, PUMA, OA set from Europe PubMed Central, etc., with high accuracy (over 99%), with almost all mistakes/false alarms given with very low confidence values, and so they can easily be identified and removed during a curation phase. Furthermore, as we continue to process more and more publications, the module's filtering rules can be further fine-tuned, based on the feedback received after curation. The module's processing times are short: between 2180 and 5090 full text articles per minute depending on the dataset (about 3600 full text publications/min on average).
Figure 2: Number of Open Access publications from Europe PubMed Central (PMC) that were linked to EU FP7 funding by the funding metadata extractor module.
Identifying links to research data
Providing cross-links from research publications to associated datasets is one of the main goals of OpenAIREplus. Although a much more complex problem to that of extracting funding information, by utilising a similar text mining mechanism to the one used by the funding metadata extractor, we are currently developing a module that will scan the text in a publication and identify different types of links to associated research data (e.g., gene accession numbers).
Content-based article classification
A computationally efficient supervised method has been implemented for classifying an unknown text (publication) into a set of pre-defined classes (publication labels). Towards this end, several taxonomies have been adopted including: arXiv, WoS and Dewey classification. The next step is to provide an accurate method for selecting between different taxonomies, given an unknown publication.
Supervised visualization of text classes
A method for visualizing supervised text data has been implemented, adopting dimensionality reduction techniques. The arXiv taxonomy has been used for evaluation and demonstration purposes, though other taxonomies (WOS, Dewey, etc.) will be added in the near future. This particular module is still under progress. One of its use cases will be to visualize sets of articles that share a common characteristic, and in particular to visualize the content of articles associated with FP7-funded projects. This will give an image of the content distribution among FP7-funded research or among other types of categorization.
Probabilistic topic modelling for unsupervised publication analysis
The goal of this module is to provide an automated and extensible multi-dimensional analysis of OpenAIRE's huge collection of publications based on probabilistic topic modelling techniques (PTM), aiming to annotate large archives of documents with thematic information and identify useful patterns and communities in related multi-dimensional linked data and attributes. An initial prototype for analysing corpus based on vector space / bag-of-words text representations has been implemented, mainly focusing on Latent Dirichlet Allocation (LDA) technique.
Bibliographic reference matching and statistics
Analysis of bibliographic references reveals relations to earlier works and quantifies impact of documents. One of the modules in OpenAIREplus finds citation links between publications using machine learning techniques, in particular: Conditional Random Fields (CRF) and Support Vector Machines (SVM).
Yet another module looks for similarities between publications, analysing both metadata (front page matter) and full text of documents. Similarity may stem from common topic addressed, similar works referenced, funding from similar projects, etc. Thanks to this functionality, users of the OpenAIRE.eu service will be presented with "see also" recommendations.The OpenAIRE technical team: Mateusz Kobos, Marek Horst, Lukasz Bolikowski from ICM, Unviversity of Warsaw and Harry Dimitropoulos, Omiros Metaxas, Theodoros Giannakopoulos, Lefteris Stamatogiannakis, John Foufoulas, Dimitris Pierrakos, Natalia Manola from University of Athens, Greece