WWW2025

WeInfer: Unleashing the Power of WebGPU on LLM Inference in Web Browsers

Zhiyang Chen, Yun Ma, Haiyang Shen, Mugeng Liu

被引用 9 次

摘要

Web-based large language model (LLM) has garnered significant attention from both academia and industry as it combines the benefits of on-device computation with the accessibility and portability of Web applications. The advent of WebGPU, a modern browser API that enables Web applications to utilize a device's GPU, has opened up new possibilities for GPU-accelerated LLM inference within browsers. However, our experiment reveals that existing Web-based LLM inference frameworks exhibit inefficiencies in GPU utilization, limiting the inference speed. These inefficiencies primarily arise from underutilizing the full capabilities of WebGPU, particularly in resource management and execution synchronization. To address these limitations, we present WeInfer, an efficient Web-based LLM inference framework specifically designed to unleash the power of WebGPU. WeInfer incorporates two key innovations: 1) buffer reuse strategies that reduce the overhead associated with resource preparation, optimizing the lifecycle management of WebGPU buffers, and 2) an asynchronous pipeline that decouples resource preparation from GPU execution, enabling parallelized computation and deferred result fetching to improve overall efficiency. We conduct extensive evaluations across 9 different LLMs and 5 heterogeneous devices, covering a broad spectrum of model architectures and hardware configurations. The results demonstrate that WeInfer delivers substantial improvements in decoding speed, achieving up to a 3.76× performance boost compared with WebLLM, the state-of-the-art Web-based LLM inference framework.