WWW2026

Rethinking MoE with Retrieval-Memory Synergy: Towards Efficient Expert Coordination

Wanjie Tao, Qun Dai, Yantong Lv, Quan Lu, Ning Jiang, Zulong Chen

Abstract

Mixture-of-Experts (MoE) models are central to scaling Large Language Models (LLMs), but stateless and compute-intensive routing repeatedly re-explores expert assignments for similar inputs, causing computational redundancy and unstable behavior in web-scale applications like search and dialogue. We reframe expert routing as a retrieval-augmented process and propose the Retrieval-Memory Synergy Mixture-of-Experts (RMS-MoE), which integrates a Co-Activation Memory (CAM) to store and retrieve effective expert teams and a learnable, input-dependent gate to fuse retrieved priors with live routing predictions, enabling consistent expert coordination for semantically related inputs. Extensive experiments on web-scale QA and dialogue tasks show that RMS-MoE achieves a 26% latency reduction, a 2.7-point accuracy gain, and a 3.3% improvement in routing stability, demonstrating architectural memory as a principled path toward more efficient, stable, and scalable LLMs.