AAAI2025

Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning

Karl Elbakian, Samuel Carton

摘要

A key aspect of alignment is the proper use of withindocument evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language models in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard humanannotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error-a hopeful sign for downstream applications built on this mechanism. Code - https://github.com/kelbakian/llm-rationale-fidelity 1 Introduction AI alignment refers to the goal of ensuring that model output is aligned with human intents and values (Shen et al. 2023; Anwar et al. 2024; Shen et al. 2024) . One key element of alignment is verification, the ability to confirm that a model's predictions have indeed accorded with those intents and values. When we use models to automate or assist human-affecting tasks such as moderation (Kumar, AbuHashem, and Durumeric 2024), resume screening (Gan, Zhang, and Mori 2024), grading (Pinto et al. 2023) , or medical decision-making (Thirunavukarasu et al. 2023), we want a human auditor to be able to review their decisions for mistakes or pathologies of behavior such as bias or the use of spurious evidence. Verification has traditionally been one of the major goals of model interpretability (Fok and Weld 2023) . Implicitly, the assumption underlying this function is that it is easier for a human auditor to catch model mistakes at the explanation level and propagate them upward to an appropriate skepticism about the model's overall prediction, than to inspect that prediction alone. With the rise of large language models (LMs), the discourse on AI interpretability has turned towards methods