It's all in the Context......
The concept of linking scientific literature to data in multi-disciplinary research infrastructures is now going to be implemented practically. OpenAIREplus is developing a common approach to connect descriptions of research results and research information between many rich data sources (publications, data, project information) in a discipline-independent fashion. In short, to develop a service which allows users to interact with and support linking between research articles and data in the context of project funding.
To gain a fuller understanding of how this has to be done, OpenAIREplus has carried out a series of pilots to examine how to create these 'enriched' information packages, working closely with two scientific partners, mutually diverse in their discipline: Dutch Archiving and Network Services (DANS) and the European Bioinformatics Institute (EBI-EMBL). The pilots are also part of a series of studies investigating key areas for the realization of an Open Access and participatory infrastructure for scholarly communication.
If a publication is put into context it helps researchers assess a publication and discover related resources. The examples create context on different levels: Some embed research data into the actual publication, others link to it via the metadata. Since the OpenAIRE portal is primarily based on metadata, the work focused on the linking of publications and data on the metadata level.
Working with different research communities
OpenAIRE explored how different types of existing interlinked outputs can be managed by cross-disciplinary infrastructures. To this end, two demonstrators were built, showing examples of interlinked research results from the Life Sciences, the Social Sciences and the Humanities. (See below for links to the demonstrators)
In the process of building these demonstrators we explored commonalities, differences and other issues that can contribute to a general model for linking publications and datasets. The results shall support further discussion on a general model.
The prototypes connected together research results contained in different infrastructures by using hyperlinks. For instance, in the area of the Life Sciences, the European Bioinformatics Institute (EBI) provides freely available data and services, for example, nucleotide sequences, gene expression, protein information, chemicals, and biological pathways.
How does this fit into a generic infrastructure?
However, These different approaches of connecting research results make it difficult for third parties like the OpenAIRE infrastructure to provide sustainable, automated services that interpret, manage and exploit the added value of such related research output. Moreover, OpenAIRE itself aims to further enrich the network of related assets by adding information on projects, funding, and usage statistics.
The goal of constructing these demonstrators is to forward the development of a common discipline-independent model for linking publications, data and other contextual information. It identified commonalities and differences during the iterative construction of the demonstrators, and by providing concrete examples to support further discussion.
Demonstrator # 1: Life Sciences
The Life Sciences demonstrator focused on the problem of how an aggregation infrastructure such as OpenAIRE can re-use ‘added value’ elements produced by Europe PMC which actively links publications and biological research data.
In the area of the Life Sciences, the European Bioinformatics Institute (EBI) provides freely available data and services, for example, nucleotide sequences, gene expression, protein information, chemicals, and biological pathways. More recently, the EBI has led the development of Europe PubMed Central (Europe PMC) , a literature database containing abstracts and full text articles from the Life Sciences. The core content is enriched through the addition of citation information (i.e. who is citing who), text mining, allowing the user to highlight and browse keywords such as gene names, organisms, and diseases, and links to respective records in biological databases.
OpenAIRE created a simple web application for displaying, browsing, and searching publications. Its core entities are publications, authors, datasets, and FP7 projects, all of which are represented as HTML splash pages identified by stable URIs in the application front end.
To identify publications, PubMed IDs were used, the well-established universal identifier in the Life Sciences. For the sake of simplicity, we ignore most bibliographic details except for the publication title, the authors, and potential external identifiers (e.g., DOIs).
Figure 1. A screenshot of a publication splash page in the Life Science Demonstrator.
The application is populated with roughly 120 sample publications. These were imported from Bielefeld University's repository "PUB" which uses PubMed IDs in its metadata model. All FP7-funded publications in the demonstrator to the respective EC-funded project were also connected. Links to projects are provided by the index of the OpenAIRE infrastructure.
Demonstrator #2: Social Sciences & Humanities
The second demonstrator provides examples from the Social Sciences and one example from the Humanities. It builds upon NARCIS, a national portal that has a comparable role as the OpenAIRE portal. The demonstrator investigates how the OpenAIRE portal can best be extended to support different practices for linking publications with data from the Social Sciences and the Humanities as well as generic interlinking of publications, data, projects and researchers. The examples present different kinds of links and support the discussion how such relations can be captured, displayed and navigated.
The demonstrator is composed using metadata already available at http://www.narcis.nl . The relations to contextual entities are added manually. The domain-specific concepts and variables are available at http://zacat.gesis.org and their relation to EVS publications were defined by the authors using the EPE developed by CenterData http://www.centerdata.nl/en/TopMenu/Wat_doen_we/ICT-toepassingen/dataplusepe.html. Related datasets are available via DANS EASY (http://easy.dans.knaw.nl).
Interview-fragments were created using the Oral History Annotation Tool for INTER-VIEWS developed by SPEXhttp://www.clarin.nl/system/files/Heuvel_Sanders_Rutten_Scagliola_Witkamp_LREC2012.pdf and are hosted athttp://www.watveteranenvertellen.nl. The full interviews are available at http://easy.dans.knaw.nl.
The examples contain bibliographic descriptions of publications, data sets, research information, people and organisations. The publications and data sets have been harvested from the Dutch institutional repositories via their common standards. The publications and datasets are all identified by OAI-identifiers and most of them also have persistent identifiers for the copy in the repository as well as a persistent identifier for the publisher-copy.
The book is related to many researchers, of whom some are authors and others are editors. This also holds for related publications: one is a related paper which discusses how the interview fragments were processed, one is cited by this publication and another one is the same publication within another repository.
One of the relations already available within the existing NARCIS infrastructure is the relation between publications and researchers with a DAI.
Figure 2. Screenshot of a Social Sciences publication In Context
The relations of a bibliographic description to any other are displayed as contextual information in the sidebars: persons, projects and organizations on the left side, data sets and publications in the right side. These links allow the users to navigate to the corresponding items within the portal. The relations are bi-directional. When a user navigates from a publication to a cited dataset, this dataset will be displayed in the center. From here, the user can navigate back to the publication that the dataset is cited by (see Figure 2. Screenshot of a Social Sciences publication In Context).
Generic entities and relations
Both demonstrators are based on bibliographic descriptions for publications, datasets, research projects and researchers, interlink these and allow users to navigate between them. The bibliographic descriptions could all be fetched from their original sources in their original metadata. Much of them could be retrieved using OAI-PMH, but also other sources were used (normal web resources, dedicated API’s). The available metadata could be reused for the purpose of these demonstrators. It is beyond the scope of this paper to discuss the need for more standardized or specialized metadata schemas.
The relations between the different resources need to be semantically typed as they can have different meanings. The relation can also be defined in different ways: by the creator of the resource, by an expert curator, by an automated inference algorithm or by crowdsourcing. It is therefore important to register this origin as it implies different levels of reliability.
Discipline Specific Entities and Relations
An important difference between the demonstrators can be observed in the concept of what a “dataset” is. Different databases in the Life Sciences, as well as DDI3-encoded data in the Social Sciences are very well structured which allows detailed identification and relations with their contents. Their internal structure, however, differs from each other. The Humanities’ data is the most heterogeneous, so that the most common structure is defined by files and folders.
Early feedback from both researcher and repository managers indicates that such access to detailed data entities is not of primary concern. It is more important that a researcher can discover other publications and data sources related to e.g. a concept in a questionnaire than the ability to analyse it from the portal. This raises an important challenge: how can a cross-disciplinary portal provide different subject-specific indexing for all its resources?
The main characteristics of the different disciplines are the way they structure and describe the objects within their own specific databases. It is infeasible for infrastructures like OpenAIRE to manage such objects from each and every discipline. Stable and globally unique identifiers allow OpenAIRE to index these relations between these objects so users can search for publications that reference these objects or browse them in their original sources. In the future, OpenAIRE will collaborate with different scientific communities to specify the relevant schemes.
The relations among the different objects can be captured in different ways, and should be captured early in the workflows by the most knowledgeable stakeholder such as the author or creator of a resource.
The process of constructing the demonstrators was only a first step towards a common approach of discipline-independent interlinking of research information. The pilots serve as a test bed for further development of services in OpenAIRE. They allow for a better understanding of researchers' practice (like Research Data Management, use of identifiers, data citation etc.) in different scientific communities and emphasise challenges in collecting of information resources from heterogeneous sources.
What's your opinion on enhancing publications with related datasets?
We would really appreciate your feedback and suggestions.
For more information please contact:
Want to take a look at the Demonstrators? See below for links:
Life Sciences Demonstrator:
The demonstrator is located at
The user can browse by entities like publication, author, dataset and project. Furthermore when signing in as user "jane/default" the user is directed through different steps to enhance a publication with related information resources.
Documentation can be found at
Social sciences demonstrator
The demonstrator is accessible at
Please use "narcis/narcis" as login.
A detailed description of different issues can be found at
This article is largely based on the recent submission to the IDCC conference, January 2013: Hoogerwerf, M et al. (2012) 'Linking and Enriching Data and Publications Across Subject-Specific Infrastructures - Challenges and Issues for a Multidisciplinary Approach' Submission as a conference paper to IDCC, 2013.
pdf of the presentation: http://www.dcc.ac.uk/webfm_send/1146