CVPR2025

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari

摘要

Our dataset: EgoTempo Free-form Video Q&A Temporal Event Ordering Q: What does the person do after draining the excess water? Object Counting Q: What is the sequence of actions the person performs with the mug? Multi-Modal LLMs Temporal Understanding Limitations of previous egocentric VideoQA datasets Single-frame Understanding Commonsense Reasoning Q: What is the main purpose of using aluminum foil? Q: What is the status of the microwave before the user gets something from it? Action Sequence Q: How many oranges does the person pick from the tree?