ICLR2026

ConvT3: Structured State Kernels for Convolutional State Space Models

Jaeyoung Hong, Yun Young Choi, Joohwan Ko, Minseon Gwak

摘要

Modeling long spatiotemporal sequences requires capturing both complex spatial correlations and temporal dependencies. Convolutional State Space Models (ConvSSMs) have been proposed to incorporate spatial modeling in State Space Models (SSMs) using the convolution of tensor-valued states and kernels. Yet, existing implementations remain limited to $1\times 1$ state kernels for computational feasibility, which limits the modeling capacity of ConvSSMs. We introduce a novel spatiotemporal model, ConvT3 (ConvSSM using Tridiagonal Toeplitz Tensors), designed to equivalently realize ConvSSMs with extended $3\times 3$ state kernels. ConvT3 structures a state kernel for its corresponding tensor to be composed as a structured SSM matrix on hidden state dimensions and a constrained tridiagonal Toeplitz tensor on spatial dimensions. We show that the structured tensor can be diagonalized, which enables efficient parallel training while leveraging $3\times 3$ state convolutions. We demonstrate that ConvT3 effectively embeds rich spatial and temporal information into the dynamics of tensor-valued states, achieving state-of-the-art performance on most metrics in long-range video generation and physical system modeling.