EMNLP2025

RAV: Retrieval-Augmented Voting for Tactile Descriptions Without Training

Jinlin Wang, Yulong Ji, Hongyu Yang

Abstract

Tactile perception is essential for humanenvironment interaction, and deriving tactile descriptions from multimodal data is a key challenge for embodied intelligence to understand human perception. Conventional approaches relying on extensive parameter learning for multimodal perception are rigid and computationally inefficient. To address this, we introduce Retrieval-Augmented Voting (RAV), a parameter-free method that constructs visualtactile cross-modal knowledge directly. RAV retrieves similar visual-tactile data for given visual and tactile inputs and generates tactile descriptions through a voting mechanism. In experiments, we applied three voting strategies, SyncVote, DualVote and WeightVote, achieving performance comparable to large-scale crossmodal models without training. Comparative experiments across datasets of varying quality-defined by annotation accuracy and data diversity-demonstrate that RAV's performance improves with higher-quality data at no additional computational cost. Code, and model checkpoints are opensourced at https: //github.com/PluteW/RAV .