Foundation ModelTime-MoE collaboration

Time-MoE

Decoder-only foundation model with a sparse mixture-of-experts FFN. Only a subset of experts is activated per token, enabling billion-scale capacity at modest inference cost. Pretrained on Time-300B (>300B time points across nine domains).

Time-MoE is a family of decoder-only foundation models published as an ICLR 2025 Spotlight. Each transformer block replaces the dense feed-forward layer with a mixture-of-experts router: only a few experts are active per token, so the parameter count grows independently of the per-token compute. This is the same idea that powers Switch Transformer / Mixtral in language modelling, applied to autoregressive time-series forecasting with context up to 4096 steps.

Pretraining uses Time-300B — over 300 billion time points across nine domains — and the released model line scales to 2.4B parameters in the paper. The TS-Arena leaderboard runs the 50M and 200M active-parameter checkpoints.

Versions on TS-Arena

Each version below corresponds to one registered model id in the leaderboard. Click through to its detail page for per-model rankings, forecasts, and history.

  • Time-MoE 50M
    time-moe-50m
    50M params

    Smallest active variant; trained on Time-300B.

  • Time-MoE 200M
    time-moe-200m
    200M params

    Larger active variant of the same architecture and data.