ICLR2026
Convergence of Muon with Newton-Schulz
Gyu Yeol Kim, Min-hwan Oh
被引用 14 次
摘要
We analyze MUON as originally proposed and used in practice-using the momentum orthogonalization with a few NEWTON-SCHULZ steps. The prior theoretical results replace this key step in MUON with an exact SVD-based polar factor. We prove that MUON with NEWTON-SCHULZ converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number q of NEWTON-SCHULZ steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in q and improves with the degree of the polynomial used in NEWTON-SCHULZ for approximating the orthogonalization direction. We also prove that MUON removes the typical squareroot-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why MUON with a few low-degree NEWTON-SCHULZ steps matches exact-polar (SVD) behavior at a much faster wall-clock time and explain how much momentum matrix orthogonalization via NEWTON-SCHULZ benefits over the vector-based optimizer. Overall, our theory justifies the practical NEWTON-SCHULZ design of MUON, narrowing its practice-theory gap.