A live benchmarking platform for Time Series Foundation Models using forecast pre-registration.
Research Paper
TS-Arena: A Live Forecast Pre-Registration Platform
Marcel Meyer, Sascha Kaltenpoth, Henrik Albers, Kevin Zalipski, Oliver Müller — Paderborn University, Data Analytics Group, 2026
Abstract. TS-Arena is a live benchmarking platform that evaluates Time Series Foundation Models (TSFMs) by requiring forecast submissions before ground-truth data exists — a “forecast pre-registration protocol.” This design eliminates test-set contamination and information leakage, since the evaluation target physically does not exist at submission time. The platform continuously collects forecasts from models across 186 energy-sector time series in 14 challenge definitions, scores them with MASE, and ranks them using an ELO rating system with confidence intervals.
TS-Arena Technical Report -- A Pre-registered Live Forecasting Platform
Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Henrik Albers, Oliver Müller — Paderborn University, Data Analytics Group, 2025
Abstract. While Time Series Foundation Models (TSFMs) offer transformative capabilities for forecasting, they simultaneously risk triggering a fundamental evaluation crisis. This crisis is driven by information leakage due to overlapping training and test sets across different models, as well as the illegitimate transfer of global patterns to test data. While the ability to learn shared temporal dynamics represents a primary strength of these models, their evaluation on historical archives often permits the exploitation of observed global shocks, which violates the independence required for valid benchmarking. We introduce TS-Arena, a platform that restores the operational integrity of forecasting by treating the genuinely unknown future as the definitive test environment. By implementing a pre-registration mechanism on live data streams, the platform ensures that evaluation targets remain physically non-existent during inference, thereby enforcing a strict global temporal split. This methodology establishes a moving temporal frontier that prevents historical contamination and provides an authentic assessment of model generalization. Initially applied within the energy sector, TS-Arena provides a sustainable infrastructure for comparing foundation models under real-world constraints. A prototype of the platform is available at this https URL.
TS-Arena runs continuously scheduled forecasting challenges on real-world energy data. When a new challenge round opens, models have a registration window to submit their forecasts for a future time period. Once the ground truth becomes available, submitted forecasts are automatically evaluated and rankings are updated.
Pre-Registration
Forecasts must be submitted before ground truth exists, making data leakage structurally impossible.
MASE Scoring
Mean Absolute Scaled Error provides scale-independent accuracy scores comparable across time series.
ELO Ranking
Pairwise ELO ratings with confidence intervals enable fair comparison between models over time.
Data & Challenges
The benchmark covers 186 energy-sector time series from multiple European and North American grid operators (SMARD, EIA, Fingrid, ENTSO-E, GridStatus), organized into 14 challenge definitions with varying forecast frequencies (15 min, 1 h) and horizons (1 day, 1 week). Challenges span electricity consumption and generation, providing diverse conditions for a thorough model evaluation.
We are open to collaboration, for example to integrate additional live time series into TS-Arena. If you are interested, please contact us at DataAnalytics@wiwi.uni-paderborn.de.
Marcel MeyerSascha KaltenpothHenrik AlbersKevin ZalipskiProf. Dr. Oliver Müller