EMNLP2025
Transparent and Coherent Procedural Mistake Detection
Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai
Abstract
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent visionand-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and finetuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement. Success/Mistake Classification Has the procedure been successfully completed? Visual Question Answering (VQA) Visual Question Generation (VQG) Ask a series of questions to gather information… Procedure: Unclip the pegs on the cloth. 1. Is there a cloth in the image? 2. Are there pegs on the cloth? 3. Is there someone holding pegs? Yes Yes No Yes No No 48% 52% 77% 23% 98% 2% Figure 1: To reason through the complex task of procedural mistake detection (PMD), vision-and-language models (VLMs) are conditioned to gather visual evidence through an iterative self-dialog to rationalize their final decision. directly condition classification. 2 Since recent VLMs struggle to extract detailed, temporally coherent information from videos, but have exhibited more mature image understanding capabilities, we curate an approachable large-scale dataset for PMD based on individual video frames annotated in Ego4D (Grauman et al., 2022). We define two metrics for the coherence of generated rationales based on a natural language inference (NLI) model. To lay a foundation for research in coherent PMD, we establish baselines by exploring three natural interventions to VLMs: (1) we use our metrics to re-rank candidate questions generated by VLMs, (2) we harness VLMs' incontext learning capability to generate additional candidate questions based on human-written examples, and (3) we use our metrics to fine-tune VLMs to generate more coherent questions. Our results show that while VLMs struggle off-theshelf, these interventions can improve VLMs' accuracy, coherence, and rationale generation efficiency, albeit creating tradeoffs between these aspects. We lastly show how our multi-faceted metrics visualize common outcomes in coherent PMD (e.g., unjustified decisions, object hallucination, and more), enabling fine-grained evaluation and identification of areas for future improvement. Problem Formulation and Dataset In this section, we define the extended problem of coherent PMD in an approachable manner for VLMs, describe how to apply VLMs to the problem, then lastly introduce a benchmark dataset we curated for evaluating coherent PMD.