Time-MoE
Decoder-only foundation model with a sparse mixture-of-experts FFN. Only a subset of experts is activated per token, enabling billion-scale capacity at modest inference cost. Pretrained on Time-300B (>300B time points across nine domains).
Time-MoE is a family of decoder-only foundation models published as an ICLR 2025 Spotlight. Each transformer block replaces the dense feed-forward layer with a mixture-of-experts router: only a few experts are active per token, so the parameter count grows independently of the per-token compute. This is the same idea that powers Switch Transformer / Mixtral in language modelling, applied to autoregressive time-series forecasting with context up to 4096 steps.
Pretraining uses Time-300B — over 300 billion time points across nine domains — and the released model line scales to 2.4B parameters in the paper. The TS-Arena leaderboard runs the 50M and 200M active-parameter checkpoints.
Versions on TS-Arena
Each version below corresponds to one registered model id in the leaderboard. Click through to its detail page for per-model rankings, forecasts, and history.
- Time-MoE 50Mtime-moe-50m50M params…
Smallest active variant; trained on Time-300B.
- Time-MoE 200Mtime-moe-200m200M params…
Larger active variant of the same architecture and data.