CVPR2024

GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?

Mohammad Reza Taesiri, Tianjun Feng, Cor-Paul Bezemer, Anh Nguyen

被引用 7 次

摘要

For each open source model, we used the provided sample code and demo from their respective repositories. Minor modifications were made to enable automatic processing of all images with designated prompts. The results were then stored in individual CSV files for each model. For OtterHD, which offers an API, we used the API to submit each image along with the appropriate prompt and recorded the responses. Our experiment was done prior to the official release of the GPT-4V API, and we used the ChatGPT web version for the benchmark, using a Chrome extension to assist in the process. We kept the temperature and other parameters of each model unchanged. The only modification involved increasing the max token limit, ensuring that the model's response length was not restricted. A1.2. Details about the judge In our experiment, the Llama-2-70B model served as the judge. We utilized the API from perplexity.ai, which is compatible with OpenAI's Python package. Additionally, we employed a custom system message, as detailed below: Your task is to compare a model-generated text with a ground truth reference, assessing whether the key information and themes are similarly conveyed, even if worded differently. Focus on semantic content, thematic alignment, and intent, rather than exact phrasing or word usage. Recognize synonyms, paraphrases, and different stylistic expressions as valid, provided they faithfully represent the ground truth's meaning. Offer feedback on the correlation between the texts and suggest improvements for alignment, while appreciating creative or varied linguistic expression that maintains the essence of the ground truth. First analyze, then report the final answer in either of Yes or No A2. Additional Results A2.1. Breakdown of Performance by Various Glitch Types Table A1. Breakdown of Performance for Different LMMs by Various Glitch Types (%)