Efficient dissemination and visibility of research results across scientific communication infrastructure boundaries is closely linked to the definition of standards for the description of scientific information and communication protocols. Metadata should be as complete and consistent as possible, as its quality is also part of the services that build upon it and is therefore a prerequisite for its use and acceptance by researchers and the public. At the same time, authors should not be burdened with additional effort and redundant input of bibliographic data.To achieve this goal, collaborations, resources and active contributions from different infrastructures and their organizations are required.
The integration of VIRTA in OpenAIRE will also solve the following issues. In case of Finland none of the commercial CRIS platforms is currently compatible with OpenAIRE aggregation requirements. Moreover Finnish repositories do not cover the complete research output available from academic institutions in Finland. VIRTA will allow to answer questions like what is the portion of Open Access compared to the total publication output and what is the share of native-language publications. Integration of (national) CRIS with OpenAIRE would provide answers to such questions and enables comparison across national borders. In parallel the integration of institutional CRIS is important as it will greatly improve the coverage and quality of metadata in OpenAIRE and will expand the monitoring capabilities provided by the OpenAIRE portal and dashboards.
VIRTA Publication Information Service
VIRTA Publication Information Service is an advanced data warehouse solution to integrate institutional data at the national level in Finland. VIRTA was launched in spring 2016. The service is developed by CSC – IT Center for Science and owned by the Finnish Ministry of Education and Culture. As a data hub, VIRTA has up-to-date bibliographic information of all scientific publications from 54 Finnish organizations using different local solutions for publication data collection, such as commercial CRISes, self-made publication registers and institutional publication repositories (Figure 1). About 60,000 scientific, professional and non-scholarly publications are transferred per year with all scientific fields covered. Publication metadata in VIRTA is based on a national data model that fulfills the requirements of national higher education institutions' funding model and other needs of monitoring research and development activities.
Figure 1. VIRTA Publication Information Service metadata flows and integrations to both organizational CRIS systems as well as national and international services.
Two (or three) steps to OpenAIRE integration
The OpenAIRE Guidelines for CRIS Managers version 1.1 have been released in June 2018 with smaller updates last December (current version 1.1.1). These guidelines are available at: https://openaire-guidelines-for-cris-managers.readthedocs.io/en/latest/index.html.
They are aiming to provide instructions for CRIS managers on how to expose their metadata in a way that is compatible with the OpenAIRE infrastructure and thus allows the integration into it. National aggregated CRIS systems, such as VIRTA, can also be compliant to these Guidelines by providing additional provenance information about their records. In the following we describe three major steps towards to become compliant with the OpenAIRE Guidelines for CRIS Managers.
1. Mapping the data model to CERIF
The first step of integration is to map the data model in your CRIS system to the CERIF data model as described in the Guidelines. The work needed may vary considerably between the different source systems and their data models. To use proper time and resources at this point it is highly recommended though, as it both improves the interoperability and quality of the metadata and makes the validation phase more fluently later on.
Gladly, there were many similarities between the VIRTA and CERIF data models to start with. However, some key differences had to be addressed. These included for example the vocabulary of publication types, the use of IDs in case of persistent identifiers as well as person IDs. Moreover, open access classifications needed to be homogenized and Finnish national classifications, e.g. scientific fields, needed to be taken into consideration when representing metadata both in human and machine readable formats required in CERIF. The mapping resulted in a rather long table, which includes the source VIRTA element and the equivalent CERIF element and examples for both. This up-to-date mapping is available at: https://wiki.eduuni.fi/x/lRLTB
2. Providing the data in CERIF-XML via OAI-PMH endpoint
As stated in the Guidelines, OpenAIRE harvests metadata by using the OAI-PMH protocol and the endpoint provided by the source system. This endpoint should provide the metadata in CERIF-XML which is made available by using the mapping done to the source system data model.
OAI-PMH was already implemented in VIRTA in order to provide metadata in both Dublin Core and VIRTA-XML formats (Figure. 2). This implementation was used as the basis for implementing OpenAIRE specifications. However, the implementation was extended and now supports an additional metadata prefix oai_cerif_openaire and the supported sets:
Figure 2. VIRTA’s technical architecture related to data flows, procedures and APIs
3. Any extra steps?
Source systems aiming for OpenAIRE integration may require additional effort to get harvested by OpenAIRE. This might be due to metadata ownership and GDPR related issues, technological or infrastructure solutions not being able support endpoints or other issues which are not directly related to OpenAIRE, but rather have to be solved at the source system level.
This one extra step for VIRTA was due to the fact that currently VIRTA only stores a copy of metadata and research organizations act as registrars, i.e. owners, of the metadata. As such, an extra step in a form of written permissions were needed for metadata to be allowed for external services to use this metadata. In these permissions each research organization could allow OpenAIRE to harvest records affiliated to that organization via VIRTAs OAI-PMH endpoint. With coordination with the Finnish OpenAIRE National Open Access Desk (NOAD) with research organizations on the plans and data model mapping was done.
As the VIRTA-OpenAIRE integration goes into production in the following months, more than 350 000 scientific, professional and non-scholarly publications' metadata can be added to OpenAIRE's database and explored via the OpenAIRE portal.
By using VIRTA's OpenAIRE integration, the Finnish research organizations do not need to invest in their own solutions for OpenAIRE compliance. This leads to both high cost efficiency and greatly enhances the interoperability of Finnish publication metadata at European level, and in addition expands OpenAIRE's coverage in national metadata aggregators.
Storing a full text in XML in JATS is also one of the Plan S Technical Guidance and requirements.
• About VIRTA Publication Information Service https://wiki.eduuni.fi/display/cscvirtajtp/VIRTA+in+English
• OpenAIRE Guidelines for CRIS Managers 1.1.1 https://openaire-guidelines-for-cris-managers.readthedocs.io/en/latest/index.html
• Up-to-date mapping table between VIRTA and CERIF data models https://wiki.eduuni.fi/pages/viewpage.action?pageId=80941717