Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance.
Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs' Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs.
Experimental results show that our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements on other generic image and video understanding benchmarks, such as AI2D and Egoschema. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
Large Multimodal Models (LMMs) have made remarkable advancements, but they still face challenges in grasping instance-specific details, hindering their ability to achieve fine-grained understanding. Instance-level understanding involves identifying the attributes of individual instances and their relationships, which is essential for real-world tasks where users focus on Instances-of-Interest.
In this work, we aim to advance the multimodal instance understanding in both images and videos. Our contributions are:
@article{peng2024boosting, title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning}, author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang}, journal={arXiv preprint arXiv:2412.03565}, year={2024} }