WWW2026

Task-Aware Cloud-End Offloading for Vision-Language Model Serving via Dynamic Modality-Specific Adapter Scheduling

Zian Wang, Ziyi Wang, Jie Xing, Yaya Wei, Ziyan Zhong, Lanshan Zhang

Abstract

Large-scale vision-language models enable powerful cross-modal understanding and generation, driving rapidly growing demand for online inference services. However, cloud-centric serving often suffers from high latency, rising costs, and network dependency, while purely on-device deployment is constrained by limited memory and reduced accuracy on complex tasks. To address this accuracy–latency–cost trilemma, we propose ShiftVL, a task-aware end–cloud serving framework that shifts suitable execution to the end device with a cloud fallback. ShiftVL serves high-frequency requests on an end-side small VLM enhanced with ViTexLoRA, a modality-disentangled parameter-efficient tuning method that preserves cross-modal alignment, while routing low-frequency or complex requests to a cloud-hosted large VLM for higher accuracy. Under tight device budgets, ShiftVL employs a predictive adapter scheduler that combines LRU-style caching with imitation learning to pre-load task-specific adapters. Experiments with InternVL models show that ShiftVL reduces cloud cost by up to 76.3% and latency by up to 42.9% while maintaining high multi-task accuracy, demonstrating its practicality for real-world vision-language model serving.