News
Getting Under the Skin of our Scientific Communities
Enhanced Publications in OpenAIREPlus: One of the objectives of OpenAIREPlus is the linking of publications to research data in various infrastructures in different scientific fields. This is motivated by the concept of "Enhanced Publications", which describe structured and machine readable information on information packages. They facilitate the re-use of research output by users. Furthermore they can be enriched with further information through specialized services and describe research in context.
In an initial phase this approach is investigated and presented by prototypes in a few months. OpenAIREPlus is working closely with three project partners: the Data Archiving and Network Services (DANS), the European Bioinformatics Institute (EBI-EMBL) and the British Atmospheric Data Centre (BADC). By examining how these very diverse scientific communities manage their research data we can start to build up an idea of how we in OpenAIREPlus might provide a generic infrastructure for linking to data. For example, we might want to examine what their different metadata schemas describe, what types of data they deal with, how they persistently identify data, and ultimately how scientists in particular fields share and manage their data.
In May, some members of the OpenAIRE team visited both EBI-EMBL and BADC to find out more about their data and the services they offer and how researchers interact with data.
The European Bioinformatics Institute (EBI)

The EBI is an outstation of EMBL, an international organization at the forefront of Life Science research and development. The EBI provides freely available data and services in the biological sciences, for example, nucleotide sequences, gene expression, protein information, chemicals, and biological pathways. Some of the databases, such as the nucleotide database, are deposition databases, while others, such as Uniprot, are derived and have a higher degree of curation to meet the needs of biologists. More recently, the EBI has lead the development of UKPMC, a literature database of 26 million citations and 2.2 million full text articles in the life sciences. All data at the EBI are held in thematic, structured databases and the information is shared as widely as possible through websites, APIs and FTP sites, encouraging re-use as well as browsing. The availability and use of these data are fundamental to life science research.
Curated Data: One of the largest curated databases is the UniProt Knowledgebase (UniProtKB), which is used to access functional information on proteins. Each entry has a unique identifier (accession number) which is cited in.
Linking Data Entities and Publications: OpenAIREPlus might well want to bring some of the (metadata) from these datasets into our information space and infer links to publications, either within UKPMC, or within some of our networked repositories. If we think about bringing this metadata into OpenAIREPlus, we have to think: Swiss-Prot has 0.5 million entries, but it is manually curated, which is good. One of the other challenges we might want to consider is the ‘long tail’ research data which is added into research publications (so if these are non-OA therefore not accessible) and how we can get at this as it is highly varied, not computational and sitting on EBI’s FTP server. These could be images, little graphs, movies, all kinds of interesting research output. The database is made up of two sections, one that is manually curated by experts – UniProtKB/Swiss-Prot and the other, which contains computationally-derived records enriched with annotation and classification – UniProtKB/TrEMBL.
Another database – Array Express – contains gene expression datasets. Some of the questions users want to know when using this database are: which samples are like mine? Does the competitors’ data look like mine? Rich metadata associated with each dataset enables this kind of searching, and the graphical displays of the results allow browsing by users of large amounts of information. Interestingly, they are seeing more requests for Linked Open Data services. Last but not least a new web service giving access to all the content in UKPMC has been released. This would allow access to metadata and the embedded data links, as well as the full text of UKPMC OA articles.
The British Atmospheric Data Centre (BADC)
BADC is one of 6 data centres funded by the UK’s Natural Environment Research Council (NERC) having responsibility for the long term management of environmental data holdings and manages data on behalf of NERC. Some of the data gathered here includes Solar system data, global temperatures and green-house gases. The NERC data policy governs how they manage their data and the requests to access it.

What is a dataset? We had some interesting discussions during our visit, such as what actually constitutes a dataset, and how can you define it? BADC stressed that a common ‘theme’ is important and common administration procedure.
Citing data and Identifiers: (see previous newsletter for more on this).BADC are trying to promote publication of data to give credit and support the scientific record. As a result they prefer to talk about data “publication” instead of “sharing” as some data producers view sharing data with skepticism, while a publication is commonly accepted. To incentivize submission of data (a “carrot”), data centres can award scientists with a DOI for their completed dataset. Current practice for formally publishing data involves writing a paper about the data, almost like a proxy paper. So far, citation isn’t the ‘done’ thing among scientists. To change this, you need pressure from all (funders, researchers, institutions) in a critical mass. If you use data you should cite it.
Publishing data: In terms of scientific publishing of data, BADC can help with the review of datasets, similar to literature reviewing. This is done in two steps – a technical review (to ensure that the data is in an appropriate format with suitable metadata) which takes burden off the scientific reviewer, who can then concentrate on the scientific validity of the dataset. Many reviewers currently focus on the analysis and conclusions put forward in a paper, rather than reviewing the data, mainly because it is often hard to even open files in right software. And therein lies the crux ; data is detached at that stage from the process it went through, so it is difficult to see it in context.
Legal and Licenses: BADC make data publicly available, depending on the type of data. To things with clear conditions of use, like literature, making it available is commonly done and the rules are understood. But data is different. If you text mine different data-sets and results come from different sets, therefore whom does the data belong to?
Championing good data management: BADC also embed themselves in projects and work with scientists. Here they are all ‘data scientists’ supporting good data management. In NERC projects they allocate ‘Data Champions’: for example the Greenhouse Gas project. There are many disparate groups trying to collect data to fit data into data model, and one person is asked to be a data champion. Scientists then see it is useful to record data in same format.
Data Archiving and Network Services (DANS)
DANS is an institute under the auspices of Royal Netherlands Academy of Arts and Sciences (KNAW) and Netherlands Organisation for Scientific Research (NOW). Its mission is to promote sustained access to digital research data. For this purpose, DANS encourages scientific researchers to archive and reuse data in a sustained manner. For the Social Sciences and the Humanities it provides the online archiving system EASY. The NARCIS portal puts these datasets in context, by providing a gateway to these datasets as well as to all e-publication from the Dutch repositories and most Dutch research project descriptions.
Data Archive: EASY is the long term Electronic Archiving System focusing on datasets of the Social Sciences and Humanities. It currently provides for permanent storage and access for over 20.000 datasets. Persistent Identifiers are provided for each dataset. Identification of the depositors, composers (Digital Author Identifier) and funders are being developed.
The datasets within EASY are archived as collections of files and metadata. Among others, the following important measures are to ensure the durability and usability of the archived datasets: A list of preferred file formats allows DANS to minimize preservation efforts. Archivists at DANS advice researchers on structure and best formats for their data. Where possible (or when required by the funder) this is done at the start of a research project by means of a Data Management Plan. These plans make the researchers aware of durability and reuse and minimize the required efforts of converting and describing the data at the end of a project when the data needs to be deposited. After depositing a dataset the archivists validate the quality and documentation and publish it. This will always be as Open Access, unless there are clear concerns about e.g. IPR or privacy issues. DANS plays a leading role in the development of the Data Seal of Approval (DSA). The DSA is granted to data repositories that meet a number of clear criteria in the field of quality, preservation and accessibility of data.
Linking data and context: DANS increases the visibility and reuse of the research data by putting it in context and by bringing the data to the researchers. The context of a dataset can be created by allow researchers to create Enhanced Publications, which allow them to enhance their publication by adding structured references to research data and visualizations or additional information about the corresponding creators, institutions, research projects and funders. Portals like NARCIS or OpenAIRE can present this context and allow researchers to discover the research data by browsing through its context.
Within OpenAIREplus DANS will participate in modelling this context. As it doesn’t want to occupy the researchers with describing the context, it is interested in ways to automatically capture it during the research life cycle. It will show how this could be done by building a demonstrator.
More information about DANS can be found on the website www.dans.knaw.nl .
Conclusion
DANS is deeply involved in the Dutch Research Information landscape, in services and policies on data archives, not only in the SSH, experimenting and enabling newly kinds of information resources : research results which are more visible, transparent and re-usable – Enhanced Publications. BADC, like EBI produce structured and scientifically sound data. EBI’s heavily curated datasets, and BADC’s citable data ensure that data can be acknowledged and referenced. In the context of OpenAIREPlus, any work on creating structured data makes it easier to discover, analyse and reference to the data in order to link it to a publication and share it with users and other infrastructures.
Table 1: Comparing some data features across subject-specific infrastructures
DANS |
EBI |
BADC | |
---|---|---|---|
Data Types |
Collections, datasets and data files from the SSH domain |
Literature, variety of database records (structured and curated; unstructured as suppl. material to publications) |
Data Entity(measurement, simulation, analysis), Activity, Production Tool, … in the domain of environmental and atmospheric data |
Identifier |
URN:NBN for long-term identification of datasets and files. DataCite-DOI for citing datasets |
DOI, PMID/PMCID for Literature; |
Permanent URL for most datasets DataCite-DOI for some datasets |
Curation |
Preferred formats, Data Management Plans, Data Seal of Approval |
Yes on structured data (manually by data curators or computational) |
Yes, ensuring long-term integrity of atmospheric data |
Access / Licenses |
Open Access when possible. |
All biological data is free to reuse; |
Mostly OA, some restricted |
Linking Data-Publications |
as Enhanced Publications in NARCIS |
Citation references in biological database records; links from UKPMC abstracts and mined terms from fulltext to biological database records, vocabularies and taxonomies |
Experiments in Data Publication Encourage users to cite data and provide suggested citations for all datasets. |
Interfaces |
EASY: OAI-PMH; WWW; web service under development. NARCIS: OAI-PMH; SRU; WWW |
OAI-PMH for UKPMC Web-Services (WhatIzIt, Evidence-Finder), FTP for Datasets; WWW |
FTP; WWW |