Open Science strives for transparency, and opening up research data may have many advantages for science, ranging from replicability of experiments to reusability of the data itself. The idea is that the reuse of research data can produce clear benefits in terms of saving time and not duplicating the efforts, including economic ones, needed for data collection. The European Open Science Cloud promotes the implementation of the FAIR guiding principles (1) for research data in order to enable the exchange and reuse of research data.
It is worth mentioning that data reuse has many related or contiguous concepts (see for example the attempt done in the chart to differentiate them), and defining data reuse per se is not so immediate.
Besides, data reuse is not always a simple or even possible task. In some cases, significant effort is necessary to prepare the dataset for sharing, or complex data standardisation activities are required for data to be reusable. Sometimes the requisite anonymisation processes may reduce the significance of the dataset. And when databases are made accessible and data reusable, they still need maintenance and the related effort has to be considered.
However, in many data-intensive research areas, the reusability of data has great potential and it is vital for interdisciplinary experiments and cross-cutting research approaches. In these cases, FAIR data are much needed. As an example, in the fields of Artificial Intelligence and Machine Learning, the availability of huge amounts of data and their correct treatment for reuse is fundamental. In such cases, not only the quality of data but also the predisposition for use by machines are essential.
Below are some concrete examples of data reuse resulting in interesting new research results, thus having a wider impact in the scientific community and more generally even in the public sphere. These stories of data reuse highlight the importance of sharing the raw data with an open licence and the relevance of standardisation procedures. Both steps should be taken into account when encouraging the reuse of research data.
FishBase is a digital catalogue of fishes, collecting a variety of data on 34,300 fish species, such as geographical distribution, biometrics, habitats, population dynamics as well as reproductive, metabolic and genetic data. Those data were gathered over the years from different data sources (grey literature, books, journals, symposia proceedings, reports, etc., while the raw data are released on the website with an open licence (CC-BY-NC). The availability of robust and reliable raw data, accessible and issued under an open licence, has enabled a very large number of studies to be carried out, even of a disparate nature.
Since 2000 FishBase has been supervised by a consortium of nine international institutions and it has been widely used for scientific or scholarly purposes (the project reports 2,275 bibliographical citations). Those data have been processed with a new algorithm by The Institute of Information Science and Technologies of the Italian National Research Council in Italy - which has made the algorithm available as a web service, producing a new dataset, available in a repository of NetCDF files. A different consortium then reused this dataset, integrating it with other data, to build AquaMaps, a tool for generating model-based, large-scale predictions of natural occurrences of marine species in geographic areas based on environmental parameters including climate data. This tool has been used for many investigations on climate change and its impact on marine species.
The availability of those vast and comprehensive datasets has allowed many different studies, looking from different angles to the initial data or investigating specific aspects. For example, AquaMaps data were used to forecast the spread of invasive fishes in the Mediterranean or to investigate the sustainability of fishing activities in Europe.
The paper "Forecasting the ongoing invasion of Lagocephalus sceleratus in the Mediterranean Sea" looked at the spread of the silver-cheeked toad-fish taking into account different parameters. The interesting aspect here is that the work has been done with an Open Science approach to enable the replication of the estimate with other species.
AquaMaps data were also analysed with a different set of models in 2016, producing new observations which made it clear that 85% of EU fish stocks were below healthy levels and 64% of EU stocks were overfished - as stated in the report "Exploitation and status of European Stocks". That evidence was discussed at the European Parliament in 2017, thus the open data having somehow an impact also on the public decision-making process.
Meanwhile, climate change forecasts from AquaMaps and Nasa were mixed together to build this timeline on Climate Change, a visual arrangement in which the different scenarios linked to CO2 concentration catch the eye. The parameters taken into account by Nasa and Aquamaps forecasts use a huge variety of data, ranging from air temperature to ice concentration, from sea bottom and surface salinity to precipitation.
Another example of data reuse comes from marine floats: original raw data produced by Argo floats, a "broad-scale global array of temperature/salinity profiling floats". Data are released with a fully open licence (CC0) on the institutional repository without any kind of processing, and therefore a great deal of work is required for this data to be reusable.
«ARGO data have been long-used by marine science communities in global oceans observing systems - explains the paper where the standardisation process has been reported for the scientific community - These data are collected using a large network of floats, monitored by the ARGO Information Center (AIC) and are sent to Global Data Assembly Centers (GDACs).
The datasets are available for download on the official ARGO website (www.argo.ucsd.edu), as Network Common Data Format (NetCDF) Pointfeature files and CSV files through FTP sites and online tools. However, these formats present many challenges from a technical point of view, especially in terms of re-usability».
Subsequently, a workflow to convert ARGO observation data into a standard raster file was set up. Briefly, the raw data were processed through an algorithm developed by a second institution (also in this case, the Institute of Information Science and Technologies of the Italian National Research Council in Italy), where data were released with an attribution licence (CC-BY) on the institutional repository and it is required to mention both institutions in case of reuse.
We have seen some reuse examples with environmental data, and the lead of the paper on data standardisation refers to this research area when it states that «Research communities need to carry out their studies in a fast and efficient manner and thus require data to be well structured, well described, and possibly represented in standard formats that allow direct access and usage. In this context, reducing data preparation and pre-processing time is crucial». But the same thing can be said for the reuse of data in any research area.
Finally, it is worth mentioning that not only research data can be reused, but also associated outputs such as software, lab notes or models. For example, time series forecasting techniques have been reused to study the fishing pressure in the Indian Ocean in the paper "Analysing and forecasting fisheries time series: purse seine in Indian Ocean as a case study". In another case, in the paper "Distinguishing Violinists and Pianists Based on Their Brain Signals", an artificial neural network (ANN) model was reused to study the relation between music and the brain. ANNs are general models applicable in several domains.
A Web service, designed to apply ANNs to a marine environment, has been used in the brain computer interface. This was possible because the Web service was WPS standardised and used standardised data, as it is reported in the study: "The ANN implementation used for this paper, is open-source and part of the DataMiner framework and is published as a free to use Web service under the Web Processing Service standard (WPS). WPS standardises the representation of the input and output and makes the service usable by a number of clients and by external software. DataMiner saves the history of all trained and tested models using a standard and exportable format. Every executed process can be re-executed and parametrised multiple times by other users, thanks to collaborative experimentation spaces. In this view, this platform allowed making the presented experiment compliant with Open Science directives of repeatability, reproducibility and reusability of data and processes".
Ensuring data reuse requires investment and effort. Fostering reuse through the FAIR principles needs to be illustrated by showing the impacts it has and should be properly rewarded in the evaluation process. Moreover, the ownership of data - still considered a power, even in the scientific field - and the sale of data, even by public institutions, are aspects to be taken into account.
(1) The acronym FAIR stands for Findable, Accessible, Interoperable, Reusable. They can be considered the guiding principles when working with research data. They "put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals", in The FAIR Guiding Principles for scientific data management and stewardship, Wilkinson et al.: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/
Thanks to Gianpaolo Coro from ISTI-CNR Italy for the help in finding out the data reuse cases.