NeurIPS2023

ChatGPT-Powered Hierarchical Comparisons for Image Classification

Zhiyuan Ren, Yiyang Su, Xiaoming Liu

被引用 41 次

摘要

The zero-shot open-vocabulary setting poses challenges for conventional image classification. Vision-language models pretrained on image-text pairs like CLIP offer a solution based on comparing image and class label embeddings. Incorporating class-specific knowledge provided by large language models (LLMs) such as ChatGPT in descriptions can further enhance CLIP's accuracy. However, CLIP still exhibits a bias towards certain classes and generates similar descriptions for closely related but different classes. To address these problems, we present a novel image classification framework via hierarchical comparisons. By recursively comparing and grouping classes with LLMs, we construct a class hierarchy. With such a hierarchy, we can classify an image by descending from the top to the bottom of the hierarchy, comparing image and text embeddings at each level. Through extensive experiments and analyses, we demonstrate that our proposed approach is intuitive, effective, and explainable. Code is available here.