Raw data, backup and versioning
By raw data we mean the original data that has been collected from a source and not yet processed or analysed. Raw data will provide the foundation for any downstream analyses. In many cases the captured or collected data may be unique and impossible to reproduce, such as time points in weather measurements and interviews. For this reason, they should be safeguarded from any possible loss. Moreover, raw data will typically be lossless - i.e. those file formats that are not compressed such as TIFF files for image data as opposed to compressed JPEG file format. Finally, in some cases, raw data may have additional information that may be specific to a brand and/or type of instrument used to capture the data. For example, Leica microscopes use a proprietary data format but is also a container for lossless data - the container contains metadata specific to the Leica microscopes that allows reading, writing and analysis through Leica software. See also our guide on file formats.
By processed data we mean data that has already undergone some kind of intervention. For instance, the data have been digitised, compressed, translated, transcribed, cleaned, validated, checked and/or anonymised.
By analysed data we mean data already processed, interpreted and analysed. Analysed data can assume several representations (text, tables, graphs, etc.), in order to facilitate a better understanding and communication of the data.
In most cases, one can also consider raw data as the official data, that is, the master copy of any given record (see also golden copy). As well as providing the starting point for derivatives generated downstream through analyses, there may be additional branches from which this data is used for other analyses. Therefore, in a typical workflow, we recommend that you create a copy of the raw data which you use as a "working copy". The original data should then be archived in an appropriate manner for long-term preservation. The working copy can then be used for processing and analysing without worrying about overwriting.
For more information on data formats please see this guide.
Firstly, one should be aware that backup is not the same as preservation and/or archiving. Once data reaches a final state, preservation allows easy access to that data, through, for example, a repository. If there are any data, raw or otherwise, and including those that are final products, that are deemed sufficiently important to be securely stored for a long period of time these should be archived - these data can be retrieved but not in the same easily accessible manner offered by preservation.
The preservation of the data may be associated with several reasons: for further analysis or research; for its potential value in terms of re-use; for national/international status and quality; for its originality and/or uniqueness; for its data production costs or innovative nature of the research; its importance for (science's) history; for its relevance of use for non-academic purposes such as cultural heritage or even by funder requirement. By contrast, backing up data is mainly to prevent data loss during the active (analysis) stages of the curation lifecycle. Researchers should do this while working with the data, and repositories do it when they preserve data. We recommend that important data are copied at least three times onto at least two storage media and at least one off-site. Moreover, where available, always use your institution’s managed digital services to allow automated backups. Commercial and non-commercial third-party storage options such as Dropbox are also currently popular, but there is no guarantee that such services will exist in perpetuity while also such options raise questions about ownership.
Here is a checklist to help define a strategy for creating backups:
During the course of analysing data, it is likely that you will create several derivatives of the working copy, sometimes even automatically through scripts. Whether one of these versions is valuable and should be kept for long-term preservation is completely dependent on the data owner. However, it is recommended that versions are frequently monitored to discard those that are not required for verification, reproducibility, or transparency, amongst others.
You might want to keep all versions, but these can be very large files which will take up valuable storage space. Versioning is a key part of any workflow and appropriate measures should be taken to enable this, whether it is simply versioning through slight alterations to file names or using dedicated version control tools. The latter are also commonly used in large projects to allow multiple users to check-out and check-in the same file after making alterations. This also allows provenance, i.e. documenting or inspecting the history of changes.
Just as with backups, the first step is to find out what your organisation provides.