ICLR2026

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang

被引用 11 次

摘要

Speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework SpecBranch to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over 1.8× ∼ 4.5× speedups against the autoregressive decoding and reduces rollback tokens by 50% for poorly aligned models, while maintaining an identical sampling distribution. Our code is available at https://github.com/Sylvan820/Specbranch . Introduction Recent advances in Large Language Models (LLMs), such as GPT-4 and DeepSeek [15] , have revolutionized natural language processing [5]. However, their real-world deployment faces the critical challenge of inference latency due to auto-regressive token-by-token generation, which restricts LLMs to predicting one token at a time, creating a fundamental bottleneck for real-time and large-scale applications. To address this limitation, Speculative Decoding (SD) has emerged as a promising acceleration paradigm [21, 8, 37, 23, 22, 25] . SD uses a small draft model to proactively generate candidate tokens, which are then verified in parallel by the large target model. By replacing serialized token generation with parallel validation, SD decouples the computational workload from sequence length. However, a critical serialization bottleneck still remains. As shown in Fig. 1 (a), the draft and target * Equal contribution. † The Corresponding Authors.