ICLR2026
CogMoE: Signal-Quality–Guided Multimodal MoE for Cognitive Load Prediction
Aamir Bader Shah, Yu Wen, Renjie Hu, Jiefu Chen, Jose L Contreras-Vidal, Xuqing Wu, Xin Fu
Abstract
The poor and variable quality of physiological signals fundamentally constrains reliable cognitive load (CL) prediction in real-world settings. In safety-critical tasks such as driving, degraded signal quality can severely compromise prediction accuracy, limiting the deployment of existing models outside controlled lab conditions. To address this challenge, we propose CogMoE, a signal-quality-guided Mixture-of-Experts (MoE) framework that dynamically adapts to heterogeneous and noisy inputs. CogMoE replaces conventional modality-based fusion with a quality-aware gating mechanism that integrates EEG, ECG, EDA, and gaze according to their estimated signal quality, shifting the basis of multimodal modeling from modality identity to signal quality. The framework operates in two stages: (1) quality-aware multimodal synchronization and recovery to mitigate artifacts, temporal misalignment, and missing data, and (2) signal-qualityspecific expert modeling via a cross-modal MoE transformer that regulates information flow based on signal quality. To further improve stability, we introduce CORTEX Loss, which balances task accuracy, quality-aware representation refinement and expert utilization under noise. Experiments on CL-Drive and ADABase demonstrate that CogMoE outperforms strong baselines across all modality combinations and sequence lengths, consistently delivering improvements across diverse signal-quality conditions. Our code is publicly available at https://github.com/shahaamirbader/CogMoE . INTRODUCTION Accurate prediction of cognitive load (CL) is critical in safety-critical domains such as driving, aviation, and healthcare, where elevated mental demand degrades decision-making and reaction time. Recent advances in multimodal sensing (EEG, ECG, EDA, and gaze) have made large-scale CL prediction feasible. However, the fundamental bottleneck is not the lack of sensors or models, but the variable quality of physiological signals. In real-world conditions, these signals are often noisy, misaligned, or partially missing as a result of motion artifacts, electrode drift, sensor dropout, and other well-documented physiological noise sources (Giangrande et al., 2024; Anwer et al., 2024) . Unlike controlled laboratory studies, practical deployments must cope with heterogeneous, unstable input streams, where a single corrupted modality can severely compromise prediction accuracy. Moreover, unlike traditional multimodal setups where modalities provide complementary information, EEG, ECG, EDA, and gaze in CL prediction largely reflect redundant views of the same underlying cognitive process (Martínez Vásquez et al., 2023) . Thus, signal quality, rather than sensor availability or model capacity, is the true limiting factor for accurate and reliable CL prediction. Existing approaches to CL prediction have largely focused on improving classification accuracy through single-modality modeling or naïve multimodal fusion (Angkan et al., 2024; Islam et al., 2020) . Traditional machine learning methods treat signals independently, while recent transformerbased models have demonstrated cross-modal integration. Yet, two critical limitations remain. On the data side, many approaches assume clean inputs, neglecting the pre-processing and recovery needed to handle artifacts, temporal inconsistencies, and missing segments. On the model side, * Equal contribution.