News
UDFBench: A new benchmark for UDF queries
In today's data-driven research landscape, scientists and developers often need to run custom code inside databases to analyse research outputs and their metadata. UDFBench provides a specialised testing framework that evaluates the performance of User-Defined Function (UDF) queries across different data processing systems. Developed by researchers at Athena Research Center and the University of Athens, this innovative benchmark uses real-world data from the OpenAIRE Graph to provide a systematic and reproducible way to study how modern data engines handle user-defined procedural code and helps identify which database systems perform best for specific research needs.
This research was recognised with the VLDB Best Experiment Analysis & Benchmark Paper Award, highlighting its contribution to the data community.
About the research
Data processing engines are increasingly being extended with user-defined functions (UDFs) to enable custom analytics, data enrichment, machine learning and AI workflows. However, different systems handle these functions with varying levels of efficiency. Traditional benchmarking methods primarily focus on relational queries, providing inadequate support for evaluating UDF performance. UDFBench fills this gap by:
- Providing parameterised benchmark queries that integrate UDFs in real workflows based on real datasets.
- Supporting multiple data processing systems with a unified execution environment.
- Offering tools to identify and measure key performance choke points in UDF queries.
By making UDF behaviour observable and comparable across engines, UDFBench helps both researchers and practitioners assess efficiency, scalability, and optimisation opportunities.
Who benefits from this new benchmark
- Database researchers gain a controlled environment to test hypotheses on UDF execution.
- System developers can use UDFBench to identify bottlenecks and validate optimisations.
- Data practitioners benefit from insights on which engines are best suited for specific types of UDF-heavy workloads.
OpenAIRE integration
UDFBench leverages real-world data from the OpenAIRE Graph, one of the largest open scholarly knowledge graphs. It operates on a carefully curated subset of this graph and includes benchmark queries inspired by OpenAIRE’s operations. While the OpenAIRE platform employs over 150 UDFs for information extraction, text mining, and analytics across more than 130M publications, UDFBench adapts these workflows to systematically evaluate UDF performance across multiple data processing engines. The benchmark focuses on extracting key data such as publication metadata, abstracts, author metadata, view statistics, and links to related entities. OpenAIRE’s scale, heterogeneity, and open availability make it an ideal stress test for UDF-intensive queries, ensuring that UDFBench reflects realistic research-data scenarios.
Award recognition
This research has been recognised for its contributions to experimental methodology and benchmarking, highlighting UDFBench’s value to the broader database community. The work received the VLDB Best Experiment, Analysis & Benchmark Paper Award at VLDB 2025 in London (September 1–5), and its toolbox was also showcased as a demonstration paper at ACM SIGMOD 2025 in Berlin (June 22–27). Both VLDB and SIGMOD are widely regarded as the premier conferences in database research.
Read the full study
[1] Foufoulas, Yannis, Theoni Palaiologou, and Alkis Simitsis. "The UDFBench Benchmark for General-Purpose UDF Queries." Proceedings of the VLDB Endowment 18, no. 9 (2025): 2804-2817. URL: https://www.vldb.org/pvldb/vol18/p2804-foufoulas.pdf
[2] Yannis Foufoulas, Theoni Palaiologou, and Alkis Simitsis. 2025. UDFBench: A Tool for Benchmarking UDF Queries on SQL Engines. In Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion ’25), June 22–27, 2025, Berlin, Germany. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3722212.3725139
GitHub release
The UDFBench codebase, benchmark specifications, and usage documentation are openly available on GitHub (https://github.com/athenarc/UDFBench). Researchers and practitioners can download the benchmark, reproduce the experiments, and extend the benchmark with their own workloads or data engines.