Opscidia’s ontology generator
The main concept of our project is to build a pipeline of NLP algorithms that automatically generates and represents a domain-specific ontology from text-mining a large corpus of scientific publications covering this domain.
Our pipeline will combine algorithms for cleaning the data, analyzing the topics, vectorize the concepts, cluster them and represent them.
We have started to experiment this methodology on OpenAIRE graph data by using the OpenAIRE access API, and also on a dump of data prepared by OpenAIRE team with whom we have collaborated in order to design this proof of concept in a way it could be easily integrated to OpenAIRE infrastructure.
A first trial was carried out with FastText word embedddings and the UMAP projection method on 115 667 English articles (titles and abstracts). It produced the representation of the main concepts (and their relative similarity) shown on the left and proved the interest of our concept. In order to develop this concept into a functional prototype and a beta version, we will optimize and combine some of the most well-known NLP building blocks:
- Preprocess and cleaning: remove punctuation, remove stopwords, lemmatize, etc.
- Word Embeddings: Word2Vec, FastText, GloVe
- Projection: tSNE, UMAP
- Clustering: K-means, CAH
Tender priority topics addressed: Our proposal addresses the OpenAIRE Topic 3 “Expand the OpenAIRE Service portfolio” challenges:
Services for OpenAIRE that will add value: The ontologies that we will develop and the services built upon them will allow a complete and topic-specific indexation of the article full texts. Hence, this will increase the efficiency of discovery mechanisms for all stakeholders and this is typically the kind of high added-value services that constitutes our business model. An example, built on EPMC’s API, can be seen in our ElPub 2019 poster (www.opscidia.com/poster).
Development of Open Science/Access: The development of Open Science and Open Access is both the most important mission of Opscidia and a necessity to sustain its model. This project will support our long-term vision that Open Access and text-mining tools can support one another.
Integration into OpenAIRE infrastructure: as described before, the methodology is based on data of OpenAIRE Graph Access API (api.openaire.eu) and will take advantage of the full power of OpenAIRE Mining infrastructure to build the ontologies upon scientific publications.
Phase 1 budget: 5,862 €
Opscidia is a platform which host peer-reviewed scientific journals under the only condition that they are open access and free for the authors.