CVPR2025

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

DOI Publisher

Abstract

USER: Provide a comprehensive description of the area [1] . USER: Why [1] is reaching out to [0]? USER: What is the region [2] doing? Omni-RGPT: The deer is lying down on the ground, appearing relaxed and stationary. Omni-RGPT: The person is positioned on the right side of the frames, interacting with a group of deer. They are wearing a light blue shirt and have their hair styled in a braid. The person is holding a branch with green leaves, which they use to feed the deer. The background is a lush, green forest with tall trees and dense foliage, indicating a natural, outdoor setting. The person's movements are gentle and deliberate, suggesting a calm and respectful interaction with the animals.