CVPR2025

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, Ming Tao

摘要

Stage 2 Efficient Temporal Module Before Temporal Module After Temporal Module Real-Time Performance 200ms chunks 🚀 For 200ms Audio Chunk Condition Total Inference Time < 200ms 🚀 Audio Encode cost 7ms Stage2 cost 71ms Stage1 cost 106ms Diverse Move Figure 1. Teller framework is the first autoregressive framework for real-time, audio-driven portrait animation, achieving up to 25 FPS while preserving realistic body part and accessory movements. Demo can be found at https://teller-avatar.github.io/ .