There is increasing recognition of the importance of data especially as its production and volume grows at ever increasing rates. It has been observed that the volume of data now present when compared to just two years ago is many-fold larger. This leads to the difficult question of what are valuable data and what are not. Indeed, much of the data being generated is from automated systems such as consumer internet-of-things (IoT) devices, weather detection instruments, and so forth. In many cases, the data captured may be junk. In the world of research, there are many intermediate data objects being created through the active phase of the curation lifecycle that will typically be unnecessary for interpretation of results. However, these should not obscure the importance of ensuring long-term preservation of some data that will have particular long-term benefits to society.
Long-term preservation also implies that data will be curated in a fashion that adheres to established standards where possible, such as metadata and ontologies, which will facilitate interoperability. This has direct consequences in the ability to compare and contrast with other datasets from within a domain – and possibly outside too if enough common parameters were to be defined. This is fundamental to science and research, aiding reproducibility, integrity and validity.
In Europe, the European Open Science Cloud (EOSC) aims to address many of the challenges faced when dealing with ensuring the long-term preservation of data, and the initial EOSC declaration makes enabling data reuse a core objective. Along with the growing uptake of FAIR principles, this will only keep being strengthened. Vast amounts of public money have been invested to build the infrastructure and safeguard data generated through EC funds and beyond. Here we will look at some examples of where long-term preservation will have consequential benefits to current societal challenges.
Many cases of fraudulent behaviour in the research domains have come to light through the centuries. Their unearthing in many cases has come about due to the research community as a whole being able to scrutinise the research in a free and open manner. Moreover, in some cases this has only been possible when new data emerges due to technological advances and progress. Preservation of the data for long term posterity is key to ensuring that these data can be indexed and archived in a suitable manner that allows efficient retrieval. The effect of this is the ability to interrogate the data for comparisons and validation – old experiments and observations can be reassessed based on current methods and technology which can unearth new or different results, but which can also provide validation.
Research and development is funded through many different streams and a sizeable chunk of research conducted in public institutions is funded through taxpayer money. This research produces data that should be considered public domain by default unless there are issues surrounding making these data "open". Usually, these arguments fall into two camps: those data that are sensitive in nature due to personal identifiers or those that are subject to intellectual property rights (IPR) sanctions. Even for these data, there are possibilities that should be explored to allow their discoverability by, for example, making their metadata open. Other research endeavours are funded through charitable means and private enterprise and although this is not technically the same as public funding, there is an argument to make these data as open as would be expected through public funding. Regardless of the source of funding, there is a moral argument for the proper preservation of data, considering the vast amounts of public and private money being spent to fund these activities. It is also important here to consider the impact that all areas of scientific research have on the ability to achieve the targets set by the UN's Sustainable Development Goals (SDGs), which endeavour to find evidence-based solutions to global problems.
Covid-19 has posed an unprecedented challenge globally but has also provided an unprecedented opportunity to find a cure collaboratively and innovatively. Although efforts to combat Covid might be primarily a success story of open science and data, and indeed there are several examples of national funding bodies encouraging sharing, it will have consequences too on long-term preservation. There has been a huge amount of work done in a very short space of time on a truly global scale: the WHO encouraged from the outset that Covid-related data should be made openly available and as of the start of February 2021, there are approximately 200,000 pieces of literature that have been collated by the WHO. In the European context there is the EU Open Data Portal, which has allowed data generated from the many international efforts to be shared and preserved for the long-term in a manner that facilitates others to be able to validate and reproduce results to uphold the scientific method. This was recognised from the outset by the Research Data Alliance where a working group was established immediately drawing upon a community that spanned the entire world and which aimed to solve a common problem.
The speed at which the global response to the Covid crisis has been undertaken has shown what is possible through concerted action, and the experiences learned will be valuable in conducting other global responses such as that against climate change.
Biological and medical research heavily relies on the basic tenets of evolution and the similarities that we hold between species. This manifests itself in many cases in the form of animal testing, whether to elucidate mechanisms of, for example, proteins or more abstract behavioural studies. If such things hold true in a mouse, then it has a strong probability of being the same or similar in humans. While it is commonplace to ensure humane handling practices of animals in such research, indeed usually mandatory, there is also a desire to reduce the numbers of animals that are used. How can this be achieved without compromising the need to ensure proper testing of therapeutic drugs as an example? Ensuring the safety of data for long-term posterity will enable research involving animals to be reduced as there will be an iterative increase in the knowledge obtained through them, negating the need to duplicate results for validation in the present. This could be particularly evident in the use of controls when conducting experiments. Computational methods for conducting experiments are ever increasing, and modelling and simulations based on observed results will also be able to increasingly obviate the need to use animals in testing. Specific research funding streams and initiatives have been established, in some cases to reduce the use of animals in laboratory experiments and whose aims are to provide the infrastructure to achieve this.
There are many thousands of languages in the world, with some more widely used than others. As a consequence of globalisation, many languages also face a fight for their continued existence as they become marginalised. Nevertheless, languages also reflect culture and the loss of any given language would also have consequences on the cultures they represent. Long-term preservation of data is paramount to safeguard these cultural markers which have played a key part in the advancement of humanity. For example, in India alone, there were once thousands of languages, but which have dwindled to a few hundred by the time of the last census. Most of the remaining languages are endangered and many have ceased to exist in written form or practiced only by a handful of individuals.
Several global projects exist that aim to address language extinction, such as the Language Archive and the Endangered Languages Archive (ELAR) who provide the tools for documentation in digital form. They are free to access and will provide a safe haven for languages that are on the verge of extinction, but also importantly a forum from which cultural links can be fostered and analysed. Indeed, many of these languages may be precursors to those that are flourishing and widespread in today's world and their significance in this respect should not be underestimated.
Another very recent example of the importance that should be placed in preserving items that document cultural changes to a society was the marriage equality referendum in the Republic of Ireland from 2015. The National Library of Ireland (NLI) has created a visual archive for photographs, over 6000 of them, which will be a permanent record for future generations of this momentous cultural shift. As part of a wider NLI initiative called the Digital Pilots project, which aims to document various aspects of Ireland's cultural legacy and offer a space for borne-digital documents such as photographs, the campaign waged by Yes Equality has been given a space that will have long-term safety and visibility.
Google Maps has been a huge success story and reshaped the field of cartography in the digital age. It was borne from satellite images that were originally taken for completely unrelated purposes by NASA and which were available to buy as scenes directly from NASA. However, this was very much a niche market and one that was expensive to the average user, who was not the intended target. Making these image data open and therefore allowing their commercialisation by Google allowed these images to be appreciated by a much larger audience than they had ever been before. This also had other consequences whereby new data could be overlayed onto the existing satellite data such as points of interest and businesses. The latter has particularly seen the benefits of this and helped drive e-commerce further. Taken together, the new value of the original images has increased several-fold and this is an example of the benefits of opening up data and allowing their repurposing and long-term preservation.
In recent years, a burgeoning field in commerce has been consumer digital health products such as fitness trackers. Their advent has been heralded through increased computing power, miniaturisation and reduced costs, making them affordable and practical for everyday use by the public. However, another key factor in their development and their usefulness has been the ability to harvest and analyse data generated by humans with ever increasing temporal and spatial accuracy. Research and development of such products requires tapping into vast data warehouses, spanning both several decades and research domains. In the case of the latter, this is, in many cases, due to repurposing of data that had been generated for a different use and which can be reused for new purposes. As well as research and development, these unrelated data, at least upon initial inspection, are also useful as reference points in on-the-fly analyses that can be viewed by the user.
Common benefits shared amongst the above examples and others are that long-term preservation of data ensures easier comparisons of similar – and dissimilar – data at different timepoints by mixing of the old and new: those data that were captured, analysed and stored in the past can be compared to similar present-day data thus enabling historical comparisons to be made. It will also enable better integration of these data so that their value is increased also. The FAIR principles should be adopted where possible and by doing so, long-term preservation will have a better chance of succeeding. Indeed, adoption of common standards to document data will aid in the preservation and analytical capabilities of that data whatever their age or origin. This can have particular relevance, for example, in socio-economic analyses where there are definite observable changes through time and which have major consequences on such things as government policies. Without the ability to monitor changes, society would not be able to anticipate changes too, and learn from historical evidence.
The EOSC has made long-term preservation of data a central pillar in its strategic outlook and this is an ongoing effort. There are many factors still to be addressed in terms of concerted effort across borders and the benefits of doing so are still being defined but a key driver will be incentivisation: researchers must be shown the benefits of managing and depositing their data in a manner that increases the value of their work. Another major factor in incentivisation is the need for certification of repositories, which is an ongoing effort. The trustworthiness of repositories is crucial to encourage data deposition and usage but is a major undertaking especially when considering the sheer number of repositories around the world.
However, due to examples such as those shown here and many others, the pay-off in proper management of data will have huge benefits to society and global initiatives such as the UN's SDGs. As society becomes ever more data intensive, the solutions for many of the grand challenges will be discovered through proper long-term preservation and data management.