Inst-IT

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance.

Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs' Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs.

Experimental results show that our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements on other generic image and video understanding benchmarks, such as AI2D and Egoschema. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Large Multimodal Models (LMMs) have made remarkable advancements, but they still face challenges in grasping instance-specific details, hindering their ability to achieve fine-grained understanding. Instance-level understanding involves identifying the attributes of individual instances and their relationships, which is essential for real-world tasks where users focus on Instances-of-Interest.

In this work, we aim to advance the multimodal instance understanding in both images and videos. Our contributions are:

An instance-centric annotation pipeline: We propose an automated pipeline, assisted by GPT-4o, to generate fine-grained annotations for both images and videos, with a particular focus on Instances of Interest.
An instance-specific understanding benchmark: We present Inst-IT Bench, a benchmark designed to evaluate instance-level understanding in multimodal models, and perform extensive evaluations on it.
An instance-grounded instruction tuning dataset: We introduce Inst-IT Dataset, the first dataset for instruction tuning that features explicit instance-level visual prompts and corresponding fine-grained textual annotations.
An instance-enhanced Large Multimodal Model: We integrate Inst-IT Dataset into the tuning of LMMs and propose a continuous instruction-tuning approach. This method enhances spatial-temporal instance understanding while improving general comprehension.

We process the video frames sequentially. At each timestamp \(t\), GPT-4o is prompted to create a frame-level annotation \(Y_t^f\) based on the current frame \(X_t\) and the previous frame \(X_{t\text{-}1}\). Then, all the frame-level annotations are aggregated to produce a video-level description \(Y^{vid}\) and create a set of open-ended question-answer pairs \(Y^{qa}\). Below is an annotated example, and you can see a full sample at here.

Existing multimodal benchmarks primarily focus on global understanding, failing to provide more in-depth insights into the instance-level comprehension capability of models. Specifically, Inst-IT Bench includes two parts: image-split and video-split, and is able to evaluate the models' ability in understanding instances in both images and videos. The image-split contains 1,000 QA pairs for 338 images, while the video-split contains 1,000 QA pairs for 206 videos. Each QA pair is available in both open-ended and multiple-choices formats.

Data Examples in Inst-IT Bench

Note: We use the format [ID] to refer to instances, and the format <timestamp> to refer to time.

Evaluating LMMs on Inst-IT Bench

We conduct extensive evaluations on ourbenchmark, including state-of-the-art open-source image models, video models, and cutting-edge proprietary models. The results that even state-of-the-art models struggle with fine-grained, instance-level understanding.

#IT indicates the number of training samples used during the instruction-tuning stage. N/A indicates that the number is unknown.
Model	LLM	#IT	Image		Video
Model	LLM	#IT	Open-Ended Q&A	Multi-Choice Q&A	Open-Ended Q&A	Multi-Choice Q&A
Random Guess	-	N/A	-	25.0	-	25.0
GPT-4o	-	N/A	74.1	84.8	65.5	81.0
Gemini-1.5-pro	-	N/A	69.9	79.7	61.4	76.7
Gemini-1.5-flash	-	N/A	65.3	79.5	57.9	75.8
Open-source image models
LLaVA-1.5	Vicuna-7B	665K	41.6	32.1	-	-
ViP-LLaVA	Vicuna-7B	~1.2M	42.1	29.2	-	-
SoM-LLaVA	Vicuna-7B	695K	45.1	40.0	-	-
LLaVA-Next	Vicuna-7B	765K	46.0	42.4	-	-
Open-source video models
LLaVA-NeXT-Video	Vicuna-7B	860K	46.5	39.5	25.8	24.8
ShareGPT4Video	Llama3-8B	~1.0M	43.2	48.7	27.8	16.1
MiniCPM-V 2.6	Qwen2-7B	~7.0M	57.6	66.8	40.0	45.2
LLaVA-OV (SI)	Qwen2-7B	~7.2M	60.3	61.8	31.4	36.4
LLaVA-OV	Qwen2-7B	~8.8M	48.0	71.7	33.2	45.6
LLaVA-Video	Qwen2-7B	~7.4M	45.1	67.0	34.1	53.2
InternVL2	InternLM2.5-7B	N/A	58.6	66.5	39.8	45.5
Qwen2-VL-Instruct	Qwen2-7B	N/A	48.3	64.9	38.2	59.4
Qwen2-VL-Instruct	Qwen2-72B	N/A	55.5	74.7	45.5	74.6
Our models
LLaVA-Next-Inst-IT	Vicuna-7B	920K	68.6	63.0	49.3	42.1
LLaVA-Next-Inst-IT	Qwen2-7B	920K	67.9	75.3	45.7	53.3

We create a large-scale instruction tuning dataset, the Inst-it Dataset. To the best of our knowledge, this is the first dataset that provides fine-grained annotations centric on specific instances. Inst-it Dataset contains 21k videos and 51k images (we treat images as static, single-frame videos). On average, each video includes one video-level description, 7.3 frame-level annotations and 15.6 open-ended question-answer pairs. In total, Inst-it Dataset includes 21k video-level descriptions, 207k frame-level descriptions, and 335k open-ended QA pairs.

Data Examples in Inst-IT Dataset

In Inst-it Dataset, each video contains a series of fine-grained annotations, includes:

\(N\times\)Frame-level Annotations \(Y^f\): each encompassing descriptions of individual instances, the entire image, and the temporal changes.
\(1\times\)Video-level Description \(Y^{vid}\): a comprehensive description of the entire video, organized in chronological order.
\(M\times\)Open-Ended QA Pairs \(Y^{qa}\): each QA pair is focused on the instances that we are interested in.

Frame-level annotations

Frame	Instance-level Captions	Image-level Captions	Temporal Differences
timestamp <1>	1: Wearing a light gray suit with a white shirt, standing indoors. 2: Wearing a sleeveless white lace dress, holding an object in the hand. 3: Wearing a dark floral-patterned dress with long wavy hair.	[1] [2] [3] are standing closely together in an indoor setting. [1] is on the left side wearing a formal, light gray suit with a white shirt. [2], in the middle, is wearing a sleeveless white lace dress, holding something in their hand. [3] is on the right side in a dark floral-patterned dress with long, wavy hair. They appear to be in a room with wooden paneling and some framed art on the wall.	null
timestamp <2>	1: A person wearing a gray suit with a white shirt, short hair. 2: A person in a white, sleeveless dress with long dark hair. 3: A person wearing a dark floral dress with long dark hair. 5: A person wearing red, partially visible in the background. 6: A small black cellphone held in a hand.	The scene appears to be in an office setting with a wooden table at the foreground. [1] is standing to the left, facing [2], and appears to be holding [2]'s finger or hand. [2] stands slightly to the right, returning focus with [1]. [3] is to the right of [2], slightly in the background, smiling and looking forward. A bouquet of white flowers lies on the table near [2]. [5] is partially visible in the background on the right, seated and wearing red. [6] is a cellphone held by [5]. Background shows a wooden wall and a reflection in a window.	[1] has moved closer to [2] and is now in contact with [2]'s hand. [2] has turned slightly towards [1] compared to the previous frame. [3] remains in a similar position, but the expression suggests more engagement with the scene. [5] and [6] have appeared in the frame; [5] is visible in the background holding [6]. The table with a bouquet of flowers is now visible, indicating a shift in camera angle slightly to include more of the right side of the room.
timestamp <3>	1: Wearing a grey suit, standing beside [2] and slightly turned towards them. 2: Wearing a white, sleeveless dress with floral textures. Holding a bouquet of white flowers. 3: Wearing a dark patterned dress, standing slightly behind [2]. 4: Partially visible, wearing dark clothing, located at the edge of the left side of the frame. 5: Seated, wearing a red outfit. Holding a white object above their head, possibly obscuring their face.	The scene shows [1] [2] [3] near a wooden conference table in a professional setting, possibly an office. [1] wears a grey suit and is standing to the left, engaged with [2] who is wearing a white dress and holding flowers. [3], who is in a patterned dress, stands closely behind [2]. The newly appeared [4] is seated to the far left, partially visible at the edge of the frame. [5] is seated on the right side, holding an object above their head, possibly obscuring their face. The room has wooden walls and a framed picture hanging on the wall.	Object [5] has lifted an object above their head, possibly a piece of paper. Object [4] has appeared in the scene, seated on the left side of the frame, which was not visible earlier. The positions of objects [1], [2], and [3] remain unchanged, as does the background and setting of the room. Overall, no significant movement is noticed in terms of camera angle or position for objects [1] [2] [3].
timestamp <4>	1: Wearing a light gray suit jacket, white dress shirt, and dark pants. 2: Wearing a white dress with a lace overlay, fitted at the waist. 3: Wearing a patterned dress with a floral design, strapless. 4: Visible part of a person wearing a dark shirt, seated or standing near the table.	The setting appears to be indoors, with [1] [2] and [3] standing together around a table with a bouquet of flowers on it. [1] is interacting with [2], who is at the center, and they are possibly holding hands or engaged in some form of exchange. [3] is standing beside [2] and looking on, slightly leaning towards her. The room has wooden walls and a large framed picture in the background. The setting suggests a formal or ceremonial atmosphere, possibly a wedding or an official gathering. The camera angle is focused on this group, highlighting their interaction.	[1] has moved slightly closer to [2], and they appear to be holding hands or exchanging something. [5] is no longer visible in the frame, possibly due to a change in camera angle or positioning of the individuals.
timestamp <5>	1: An adult wearing a light gray suit with button details and a white shirt. The expression and stance suggest focus and engagement. 2: An adult in a white, lacy dress with thin straps. The person has long dark hair and appears to be smiling, holding hands with [1]. 3: An adult wearing a multicolored, patterned dress. The person has long, wavy hair and is smiling while observing [1] and [2].	The current frame captures a moment in an interior setting with [1] wearing a light gray suit, [2] in a white lace dress, and [3] in a patterned dress. [1] and [2] are engaged, with [1] facing [2] and holding their hand, suggesting an exchange, possibly a ring. [2] smiles, indicating a moment of happiness. [3] stands to the right, smiling and observing the interaction, detached but engaged with the scene. The background shows a wooden wall and framed picture, reflecting a formal environment possibly used for ceremonies. A bouquet of flowers rests on the table in front of the group.	Between the previous and the current frame, [1] and [2] have shifted slightly closer, with [1] now directly holding [2]'s hand, indicating a progression in their interaction, possibly the continuation or conclusion of an exchange, such as the placing of a ring. [3] remains in a similar position but continues to observe [1] and [2], emphasizing their passive role in the interaction. There is no notable change in the background or environment.
timestamp <6>	1: [1] is wearing a grey suit with a white shirt, looking forward, standing upright and smiling slightly. 2: [2] is wearing a white sleeveless dress, with hair tied back, and is standing with a calm expression. 3: [3] is wearing a floral dress with an energetic expression, standing with arms slightly bent.	The image depicts a formal setting with a group of three adults, [1], [2], and [3], standing closely together. The background features a wooden paneled wall and a framed picture. [1] and [2] are positioned in the center, both facing forward, suggesting they are the focus of the occasion. [1] is on the left, wearing a grey suit, and [2] is to the right of [1] in a white dress. They appear to be engaged in a ceremony or formal event. [3] is to the right of [2], wearing a floral dress, and displays a cheerful demeanor. The lighting is bright, illuminating their faces and creating a formal, celebratory atmosphere.	Between the frames, there is a noticeable shift in the poses and expressions of [1] and [2]. In the current frame, [1] is now standing upright with a slight smile, while previously [1] was leaning towards [2], holding [2]'s hand, suggesting a shift from interaction to posing. [2], who was previously looking at [1], is now facing forward with a calm expression, indicating a change from an interactive pose to a more neutral one. Both [1] and [2] have adjusted their posture to face the camera more directly. [3] remains in similar positioning as before but has moved slightly closer to [2] and is displaying a more energetic expression, emphasizing the cheerful atmosphere. The objects on the table in the foreground, visible in the previous frame, are no longer the focal point, showing that the primary focus is now the individuals standing together.
timestamp <7>	1: [1] is dressed in a grey suit with a white shirt, looking formal and neat. 2: [2] is wearing a white, sleeveless dress with a lightly patterned texture. 4: [4] is dressed in a dark outfit, including a dark scarf or similar accessory.	In the current frame, [1] is positioned in the center, wearing a grey suit and a white shirt. [2] is to the right of [1], dressed in a white sleeveless dress. [4] appears on the left side of the image, wearing a dark outfit, which includes a scarf, giving a formal look. The environment is a room with wooden walls, and a large map or blueprint hangs on the wall in the background. The lighting highlights the three individuals, [1] [2] [4], and the focus is on them standing in a formal setting. [1] and [2] appear to be closer together, engaged in the setting's activity, with [4] seeming to join or rejoin the group.	[3] is no longer visible in the current frame. [4] has appeared, standing to the left side of [1] and [2]. [1] and [2] remain in similar positions as in the previous frame, but the group now includes [4].
timestamp <8>	1: Person in a gray suit with a white shirt underneath. 2: Person wearing a white dress with long dark hair. 3: Person with long hair wearing a patterned dress, standing in the background.	The current frame shows a group of three individuals indoors, with [1] on the left in a gray suit and white shirt, facing slightly towards [2], who is dressed in a white dress with long dark hair. [2] is looking at [1], suggesting an interaction or communication between them. [3] is slightly behind [2] and smiling, indicating a positive mood. The environment appears to be an office or meeting room with a large map or artwork on the wall in the background and a wooden wall, suggesting a formal or semi-formal setting. The lighting is bright, coming from the windows in the background, creating a clear but slightly shadowed detail on the individuals.	From the previous frame to the current one, [1] and [2] appear to have shifted slightly closer to each other, with [2]'s head turned towards [1] indicating interaction. [3] is now visible in the scene, having entered from the right, which suggests a new addition to the group. [4] from the previous frame is no longer visible, indicating they may have exited the frame or moved out of view. The overall composition suggests a change in group dynamics as [3] enters and [1] and [2] interact more closely.
timestamp <9>	1: Wearing a light gray suit with a white shirt, standing with arms relaxed at the sides. 2: Wearing a sleeveless white dress, with black hair visible, standing sideways. 3: Clapping hands, wearing a dark, sleeveless floral-patterned dress. 4: Visible hands clapping, appearing on the left side of the frame.	In the current frame, [1] is standing next to [2], both are positioned near a wooden wall, with a large framed picture or window in the background. [2] is wearing a white dress and stands slightly leaning towards [1], who is dressed in a gray suit. [3] is to the right, wearing a patterned dress and clapping her hands. On the left side of the frame, [4]'s hands are visible, indicating a clapping gesture. The environment appears to be well-lit, possibly indicating a celebratory or formal gathering.	[4] has appeared in the current frame, clapping, which was not present in the previous frame. [1] and [2] have slightly shifted positions, indicating a minor adjustment in posture. The lighting in the room appears brighter in the current frame.
timestamp <10>	1: [1] is wearing a grey suit with a white shirt. The person's expression is neutral. 2: [2] is wearing a white dress, has long dark hair, and is smiling. 3: [3] is wearing a dark patterned dress, has long dark hair, and is smiling. 4: [4] is partially visible, clapping hands, wearing a long sleeve.	In the current frame, [1] stands on the left wearing a grey suit and appears slightly more composed than before. [2], next to [1], in a white dress, continues smiling, directed towards [1]. [3] stands behind [2] with a continuous smile and hands still positioned as if clapping, indicating a joyous or celebratory mood. [4] is partially visible on the edge, with both hands shown as if engaged in clapping. The background remains the same, with wall decor and a wooden frame, suggesting an indoor setting. The lighting is consistent, highlighting a positive atmosphere.	Between the previous and current frames, [1] has shifted from smiling to a neutral expression. [2]'s expression remains unchanged, still smiling. [3] continues to smile, maintaining the same engagement level. [4] shows hands in clapping motion slightly more forward than before. The physical positions of all individuals are largely the same, with slight adjustments in posture, possibly due to motion between shots.
timestamp <11>	1: Individual in a grey suit with a light-colored shirt underneath. 2: Individual in a white dress with a flower in their hair. 3: Individual in a dark floral dress with bare shoulders. 4: Visible hand, partially in the frame, with a watch on the wrist.	The current frame captures four adults in what appears to be an intimate celebration setting, inside a room with a wooden backdrop and a framed picture on the wall. [1] and [2] are the main focus, engaged in a kiss. Both are facing each other, with [1] in a grey suit and [2] in a white dress. [3] stands to the side, clapping, and appears joyous, indicating approval or celebration. The environment is that of a seemingly formal setting with elements suggesting a personal or official celebration. [4] is partially visible, with just a hand showing, suggesting a congratulatory gesture.	Between the previous and current frames, [1] and [2] have moved from standing side by side to facing each other and kissing, indicating a change from a neutral to an intimate interaction. [3] continues to display a supportive gesture by clapping, suggesting this action started in the previous frame and continued into the current one. The position of [4] indicates movement from a neutral position to a congratulatory gesture, seen by the positioning of the arm and hand. The overall increase in physical interaction between [1] and [2] and the supportive gestures by [3] and [4] contribute to a more emotionally engaging scene in the current frame.
timestamp <12>	1: Adult wearing a light grey suit with a white shirt. Short dark hair, clean-shaven, and standing upright. 2: Adult in a white, sleeveless dress. Long dark hair pulled back. Appears to be smiling with eyes partially closed. 3: Adult in a dark floral dress with a sleeveless design. Long dark hair down and clapping.	In the current frame, [1] and [2] stand close together in the center of the image. [1] is wearing a grey suit with a white shirt and appears to be speaking or smiling. [2], dressed in a white dress, is leaning slightly towards [1] with a content expression. [3] is on the right, wearing a dark floral dress and clapping, seemingly celebrating with [1] and [2]. The environment is indoors with a wooden wall and a large framed picture in the background. The overall mood is celebratory, suggesting an event or occasion has taken place.	Compared to the previous frame, [1] and [2] were previously kissing, but now they are standing apart, with [2] leaning slightly towards [1]. [1] has shifted from facing [2] to facing slightly outward and appears to be speaking or smiling. [3] remains in the same position but continues clapping, indicating ongoing celebration. The celebratory mood persists, reflecting a continuation of the event captured in the previous frame.

Video-level description

"The video appears to document a formal or celebratory event indoors, possibly a ceremony such as a wedding or official gathering, occurring in a room with wooden paneling and art or framed pictures on the wall. At the beginning, during <1>, [1] is wearing a light gray suit and stands with [2] in a sleeveless white lace dress, and [3] in a dark floral-patterned dress. The three are close together, suggesting an intimate or focused setting. The progression between <2> and <3> involves subtle changes in posture and interaction. [1] moves closer to [2], appearing to hold hands or engage in an exchange, possibly involving a ring, as indicated by a bouquet of flowers. [3] remains supportive and smiling, while [5], in red, momentarily holds an object above their head, before disappearing from view by <4>.In frames <5> to <7>, [1] and [2] maintain a close interaction, suggestive of a significant moment such as an exchange of vows or rings. They are closely observed by [3], who stands smiling nearby, while [1] and [2] occasionally adjust their positions, facing each other initially and then turning outward, which may signal transitioning from an intimate moment to posing for a photo. By <7>, [4] joins, dressed in darker attire, emphasizing the formal setting as [3] is no longer visible.Through <8> and <9>, the group dynamics change slightly with the absence of [4] and [3] entering the scene again. [1] and [2] appear to engage in a warm interaction as [3] supports them, clapping, alongside visible hands of [4] indicating applause, marking a cheerful tone.Finally, during <10> to <12>, the focus shifts as [1] and [2] first engage in a kiss, underscoring an intimate conclusion to their ceremony. They later stand apart slightly at the center, with [1] smiling or speaking, and [2] leaning towards [1] suggestively content. Throughout, the consistent joyous mood is accentuated by [3]'s ongoing clapping and expression of joy, emphasizing shared celebration and approval from the audience captured."

Open-ended question answering

Question	Answer
What change occurs with [1]'s expression between <10> and the previous frame?	[1] changes from smiling to a neutral expression.
What activity are [1] and [2] involved in at <11>?	[1] and [2] are engaged in a kiss.
What is the overall mood during <11> as suggested by [3]'s actions?	A celebratory or joyous event.
What interaction occurs between [1] and [2] at <5>?	[1] holds [2]'s hand, suggesting an intimate gesture or exchange, likely a ring.
Who joins [1] and [2] in the frame at <7>?	[4] appears in the frame, joining [1] and [2].
What changes in the group's composition between <7> and <8>?	[3] reappears, and [4] is no longer visible.
What common setting element is seen throughout the frames <1> to <12>?	The scene is in an indoor setting with wooden paneling and framed art.
What type of event is likely taking place based on the atmosphere in <4> and <6>?	A formal event, possibly a wedding or official gathering.
What new elements are introduced in the scene at <2>?	[5] holds a cellphone in the background, partially visible.
What is the mood and lighting like at <6>?	The mood is formal and celebratory, with bright lighting enhancing this atmosphere.
What new background element appears at <7>?	There is a map or blueprint on the wall.
What is notable about [5]'s actions at <3>?	[5] is lifting an object above their head, possibly a piece of paper.
What is the setting like in <3>?	The group is gathered near a wooden conference table in a formal setting.
How are [1] and [2] interacting at <8>?	They are engaged in conversation or communication, indicated by body language and focus.
What does [1]'s expression suggest at <12>?	[1] speaks or smiles, suggesting engagement with [2] or others.
What shift occurs in the focus of the camera between <5> and <6>?	The camera focuses more on individuals standing together, reducing focus on the foreground objects.
What are [3] and [4] doing at <9>?	They are clapping their hands in celebration.
What decorative element is visible at <2>?	A bouquet of flowers lies on the table near [2].
How has the posture of [1] and [2] changed by <6>?	[1] and [2] face slightly outward, suggesting a pose for a photograph or audience.
What overall physical change occurs between [1] and [2] from <10> to <11>?	There's a noticeable increase in their physical interaction, enhancing emotional engagement.

Based on the Inst-IT Dataset, we propose a continuous instruction-tuning recipe to effectively mix instance understanding datasets with generic instruction-tuning data. By adding this small amount of data, the enhanced models demonstrate strong performance among various benchmarks as well as our Inst-IT Bench.

Results on image benchmarks.

Image-results

Results on video benchmarks.

Video-results

📃 BibTeX

      
        @article{peng2024boosting,
          title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
          author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
          journal={arXiv preprint arXiv:2412.03565},
          year={2024}
        }

Inst-IT:
Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Abstract

Multimodal Instance Understanding

Inst-IT: An Instance-centric Data Annotation Pipeline

Inst-IT Bench: An Instance Understanding Benchmark