ACL2023

Exploiting Commonsense Knowledge about Objects for Visual Activity Recognition

Tianyu Jiang, Ellen Riloff

2 citations

Abstract

Situation recognition is the task of recognizing the activity depicted in an image, including the people and objects involved. Previous models for this task typically train a classifier to identify the activity using a backbone image feature extractor. We propose that commonsense knowledge about the objects depicted in an image can also be a valuable source of information for activity identification. Previous NLP research has argued that knowledge about the prototypical functions of physical objects is important for language understanding, and NLP techniques have been developed to acquire this knowledge. Our work investigates whether this prototypical function knowledge can also be beneficial for visual situation recognition. We build a framework that incorporates this type of commonsense knowledge in a transformer-based model that is trained to predict the action verb for situation recognition. Our experimental results show that adding prototypical function knowledge about physical objects does improve performance for the visual activity recognition task.