ICLR2025

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

Abstract

Given an input time series, ① we first tokenize it into a sequence of data points, ② which are then encoded. These tokens are processed through N-stacked backbone layers, primarily consisting of casual multi-head self-attention and ③ sparse temporal mixture-of-expert layers. During training, ④ we optimize forecasting heads at multiple resolutions. For model inference, Time-MoE provides forecasts of flexible length by ⑤ dynamically scheduling these heads. ① Time Series Foundation Models ② Why Mixture-of-Experts? ③ Methodology ⑥ Analysis & Visualizations Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts ④ Zero-shot Forecasting ⑤ In-domain Forecasting ! MoE architecture makes TSFMs computationally efficient while maintaining high model capacity for time series tasks Resources & References Zero-shot forecasting results. All results are from four different forecasting horizons: ! ∈ 96,192,336,720 ; Red: the best, Blue: the 2 nd best Ablation studies. (Left) Average MSE for horizon-96 forecasting across six benchmarks, evaluated with different model components. (Right) Analysis of various multi-resolution forecasting configurations. Scalability analysis. (Left) Comparison of dense and sparse models w.r.t. training and inference costs. (Right) Average MSE for 96-horizon forecasting across six benchmarks, comparing Time-MoE and dense models, both trained with varying data sizes. Language Modeling Parametric Densities Model Cross Entropy E.g., Chronos, LLMTime, Time-MQA Model Prior Loss E.g., TimesFM, Moirai, Time-MoE (Ours)