ICLR2026

Ghost in the Cloud: Your Geo-Distributed Large Language Models Training is Easily Manipulated

Zichen TANG, Zhenheng Tang, Gaoning Pan, Buhua Liu, Xin He, Kunfeng Lai, Xiaowen Chu, Bo Li

摘要

Geo-distributed training and Federated Learning (FL) provide viable solutions to address the substantial data and computational resource needs associated with training large language models (LLMs). However, we empirically demonstrate that a single attacker can significantly compromise the safety alignment of LLMs through malicious training, and existing defenses like robust aggregation or trustbased frameworks fail under this setting due to data heterogeneity. We identify two existing server-side defense strategies that effectively counter naive jailbreak attacks: Task Performance Check (TPC), which filters out model updates with low downstream performance, and Malicious Output Scrutiny (MOS), which detects harmful outputs by prompting uploaded models with malicious queries. To evade both defenses, we design a trigger-based jailbreak variant that preserves downstream performance using a novel regularization method to limit the excessive model updates on jailbreak datasets. We further conceal malicious triggers by mixing the malicious dataset with pseudo-contrastive safety-aligned answers to maintain the original safety alignment. Experiments on several widely used safetyaligned LLMs show that CloudGhost can consistently implant triggers into the global model without degrading downstream performance, achieving 74-93% attack success rate (ASR) and below 5% detection true rate (DTR).