Skip to main content

Data formats for preservation

The key considerations are longevity, interoperability, and alignment with FAIR principles. In Horizon Europe, these are not optional best practices but core requirements, ensuring that research data and other outputs remain findable, accessible, interoperable, and reusable throughout and beyond the project lifecycle.

The choice between proprietary and standard formats is influenced by factors such as the equipment used, software and hardware environments, and disciplinary practices. However, Horizon Europe explicitly encourages the use of open, standardised, and interoperable formats, as these enable data exchange across systems, institutions, and countries, and support long-term reuse.

There is no guarantee that proprietary file formats will remain usable in the future. For example, commonly used formats today may become obsolete as technologies evolve. This risk is even greater for bespoke or project-specific formats, which may lack documentation, community adoption, or long-term software support.

For this reason, Horizon Europe places strong emphasis on planning data formats within the Data Management Plan (DMP), requiring researchers to justify their choices and ensure that data remain accessible and reusable over time, including through the use of compatible standards and sustainable preservation strategies.

Ultimately, choosing appropriate file formats is essential to prevent data loss, avoid technological obsolescence, and maximise the long-term value, reproducibility, and impact of research outputs in line with Open Science requirements.

 

 

 

Different types of research data are acquired, processed, and stored (preserved and/or archived) in different ways, often depending on disciplinary practices. When starting a new project and developing a Data Management Plan (DMP), one of the first key considerations is to decide in advance which file formats will be used. In Horizon Europe recent guide, this decision is explicitly linked to ensuring compliance with FAIR principles (Findable, Accessible, Interoperable, Reusable) and must be clearly described in the DMP as part of the overall data lifecycle planning.

Many proprietary file formats act as “containers” for standard formats, providing additional functionalities tailored to specific software or instruments. While these may facilitate data processing and analysis, they can also limit interoperability and long-term accessibility. Under Horizon Europe requirements, researchers are expected to justify the choice of formats and, where possible, prioritise open, standardised, and interoperable formats to enable data sharing, reuse, and machine-actionability.

In addition, file formats can be either lossless or lossy, depending on whether data compression alters the original content. Lossless formats preserve all original information, while lossy formats reduce file size by removing some data. Although lossy formats may be suitable for analysis or dissemination purposes, they are not always appropriate for long-term preservation. Horizon Europe guidelines emphasises that data should be curated and preserved in a way that maximises reusability, which typically requires retaining raw data or high-quality, non-destructive (lossless) formats.

Did you know? 
A key update compared to Horizon 2020 is that Data Management Plans are now mandatory for all projects generating or reusing data (not only within a pilot), and must be submitted, regularly updated, and treated as a living document throughout the project lifecycle.

Furthermore, Horizon Europe expands the scope of data management beyond datasets to include other research outputs (e.g. software, models, workflows), which must also be considered in terms of formats, standards, and long-term accessibility.

Overall, early decisions on file formats are not only technical but strategic, directly affecting the interoperability, accessibility, and long-term value of research outputs, while ensuring full alignment with Horizon Europe Open Science and FAIR data management requirements.

Did you know? 
FAIR data alone does not guarantee long-term preservation.
While FAIR principles support data management and sharing, they do not fully address the long-term preservation of research data, leaving important gaps in sustainability.

Did you know? 
Data preservation must happen throughout the entire research lifecycle.
Preservation is not just about storing data at the end—it requires continuous actions, planning, and monitoring from creation to long-term access.

 

As an example, in the biomedical imaging field, a realisation of the huge variety of file formats that exist led to an initiative to make these interoperable. As part of the OMERO project, Bioformats is a software plugin which allows the conversion of multiple established proprietary and standard file formats. Image analysis software such as ImageJ (free and open source) have adopted Bioformats as a plugin to allow users to read and write their image data without having to consider their origin. However, such tools may not always exist for different disciplines, and a researcher should consider storing their acquired data in a standard format at the earliest available opportunity. Many (most?) commercial and open source software packages allow conversion of data into standard formats and this should be exploited.

During the course of the digital revolution, a number of file formats have been recognised to be the file formats of choice for longevity and interoperability.

Please find below some useful links to resources about data formats for long-term storage:

As an example, the following table describes a variety of file formats for different disciplines that are either recommended or acceptable (from the UK Data Service):

Type of data Recommended formats Acceptable formats
Tabular data with extensive metadata
variable labels, code labels, and defined missing values
• SPSS portable format (.por)
•  elimited text and command ('setup') file (SPSS, Stata, SAS, etc.)
• structured text or mark-up file of metadata information, e.g. DDI XML file
• proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
Tabular data with minimal metadata
column headings, variable names
• comma-separated values (.csv)
• tab-delimited file (.tab)
• delimited text with SQL data definition statements
• delimited text (.txt) with characters not present in data used as delimiters
• widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data
vector and raster data
• ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)
• geo-referenced TIFF (.tif, .tfw)
• CAD data (.dwg)
• tabular GIS attribute data
• Geography Markup Language (.gml)
• ESRI Geodatabase format (.mdb)
• MapInfo Interchange Format (.mif) for vector data
• Keyhole Mark-up Language (.kml)
• Adobe Illustrator (.ai), CAD data (.dxf or .svg)
• binary formats of GIS and CAD packages
Textual data • Rich Text Format (.rtf)
• plain text, ASCII (.txt)
• eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema
• Hypertext Mark-up Language (.html)
• widely-used formats: MS Word (.doc/.docx)
• some software-specific formats: NUD*IST, NVivo and ATLAS.ti
Image data • TIFF 6.0 uncompressed (.tif) • JPEG (.jpeg, .jpg, .jp2) if original created in this format
• GIF (.gif)
• TIFF other versions (.tif, .tiff)
• RAW image format (.raw)
• Photoshop files (.psd)
• BMP (.bmp)
• PNG (.png)
• Adobe Portable Document Format (PDF/A, PDF) (.pdf)
Audio data • Free Lossless Audio Codec (FLAC) (.flac) • MPEG-1 Audio Layer 3 (.mp3) if original created in this format
• Audio Interchange File Format (.aif)
• Waveform Audio Format (.wav)
Video data • MPEG-4 (.mp4)
• OGG video (.ogv, .ogg)
• motion JPEG 2000 (.mj2)
• AVCHD video (.avchd)
Documentation and scripts • Rich Text Format (.rtf)
• PDF/UA, PDF/A or PDF (.pdf)
• XHTML or HTML (.xhtml, .htm)
• OpenDocument Text (.odt)
• plain text (.txt)
• widely-used formats: MS Word (.doc/.docx), MS Excel (.xls/.xlsx)
• XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0

When writing a DMP, researchers are advised to refer to tables such as this to help decide the best file formats to use for their project and to state this clearly.

Video:
Utrecht University. “Preserving research data in the optimal, technically correct way” (How to minimize the risk of losing data. Here you’ll learn which methods there are to preserve your research data in an optimal way)

Publications

Other resources