CVPR2025
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao
Abstract
Book-related Gameplay Q: What number is on the man's jersey? A: 14. Q: What is written on the road before the zebra crossing? A: Stop. Egocentric Scene-Text VideoQA EgoTextVQA-Outdoor EgoTextVQA-Indoor Q: What is the state of microwave before the person gets something from it? A: Closed. Egocentric VideoQA (EgoTaskQA) Q: What should I do to clean the pot after cooking? A: Use the fairy dishwashing liquid and scrub the pot with a cloth. 150.1s 80.8s 180s Q Kitchen 40.4s 60.2s EgoTextVQA-Indoor Scene-Text VideoQA (RoadTextVQA) Scene-Text VQA (TextVQA) Q: What should I be cautious of when driving through this area? A: Children crossing the road. Intention Reasoning 6.5s 4.3s 8.2s 10s Q 7.4s EgoTextVQA-Outdoor Figure 1. Our EgoTextVQA aims for QA assistance involving scene text from an ego-perspective mainly in outdoor driving (EgoTextVQA-Outdoor) and indoor house-keeping (EgoTextVQA-Indoor), with the questions reflecting the real user needs yet without the visual focus on scene text. Benchmarking results show that all models struggle on EgoTextVQA, highlighting continued efforts for improvements.