CVPR2024

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

Abstract

You are an AI assistant / task generator in the room. You need to generate a task in the scene. Demonstration: For Room 1: [Few shot example] Generate similar responses for Room 2. Response : For Room 2: Q: Is the donut ready to eat? t1 input: Q + I see a donut. output: <select> [Choose donut] t2 input: Q + I see a donut. <select> output: <touch> [tactile] [temperature] t3 input: Q + I see a donut. <select> <touch> [tactile] [temperature] output: It is hard, cold and not ready to eat.