ICLR2026

pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models

Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

被引用 1 次

摘要

Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero-and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, a personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up-and downprojection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain-and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. Code is available at https://github.com/sajjad-ucsb/pFedMMA .