ICLR2026

UNDERSTANDING TRANSFORMERS FOR TIME SERIES FORECASTING: A CASE STUDY ON MOIRAI

Dennis Wu, Yihan He, Yuan Cao, Jianqing Fan, Han Liu

Abstract

We give a comprehensive theoretical analysis of transformers as time series prediction models, with a focus on MOIRAI (Woo et al., 2024) . We study its approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, one of the state-of-the-art multivariate time series prediction models capable of modeling arbitrary number of covariates. We prove that MOIRAI is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish learning bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of using transformers for time series forecasting. * equal contribution PROBLEM SETUP This section presents backgrounds and formula definitions of the transformer model, and then introduce the auto-regressive model. Transformers. We consider a sequence of N input vectors h Given any H ∈ R D×N , we define the attention layer as follows. Definition 2.1 (Attention layer). A self-attention layer with M heads is denoted as Attn . The self-attention layer processes any given input sequence H ∈ R D×N as Attn † θ 0 (H) := H + 1 N M m=1 (VmH) × σ (QmH) ⊤ (KmH) , where σ(t) := ReLU(t)/N is the ReLU function normalized by N . Next, we introduce the any-variate attention, where Woo et al. (2024) uses it to replace the standard attention in transformers. The any-variate attention introduces two learnable variables: Attention