Instruction-Tuned Video-Audio MLLMs for Brain Alignment

Subba Reddy Oota¹, Khushbu Pahwa²^†, Prachi Jindal⁴, Satyasai Srinath Namburi⁵, Maneesh Singh⁶, Tanmoy Chakraborty⁴, Raju S. Bapi⁷, Manish Gupta⁸

¹Technische Universität Berlin, Germany ²Rice University, USA ³AWS AI Labs, Amazon ⁴IIT Delhi, India ⁵University of Wisconsin - Madison, USA ⁶Spector Inc, USA ⁷IIIT-Hyderabad, India ⁸Microsoft, India ^†Work was done while at Rice University, prior to role at Amazon

Contact: subba.reddy.oota@tu-berlin.de, gmanish@microsoft.com

Paper Code

Abstract

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by ~15%) and unimodal models (by ~20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems.

Overview of Multimodal Model Evaluation Settings in Brain Encoding Studies

Study	Model Type	Stimulus Modality	Brain Data	Dataset	Instruction-Tuned
Doerig et al. (2022)	Vision-Language (CLIP)	Unimodal (Images)	fMRI	NSD	✕
Wang et al. (2022)	Vision-Language (CLIP)	Unimodal (Images)	fMRI	NSD	✕
Oota et al. (2022)	Vision-Language (CLIP, VisualBERT, LXMERT)	Unimodal (Images)	fMRI	BOLD5000	✕
Popham et al. (2021)	Vision-Only CNNs vs. Vision-Language	Unimodal (Silent Videos)	fMRI	Gallant lab short video clips	✕
Tang et al. (2022)	Non-instruction-tuned multimodal model (BridgeTower)	Unimodal (Silent Videos), Unimodal (listening stories)	fMRI	Gallant lab short video clips	✕
Oota et al. (2025)	Instruction-tuned Image+Text MLLMs	Unimodal (Images)	fMRI	NSD	✓
Sartzetaki et al. (2024)	Image Recognition models, Action recognition models	Unimodal (Visual)	fMRI	Bold Moments Dataset	✕
Nakagi et al. (2024)	Language models (BERT, GPT-2, Lllama2, OPT)	Multimodal (Videos with audio)	fMRI	8.3 hours of video dataset	✕
Subramaniam et al. (2024)	Non-instruction-tuned multimodal models (SLIP-CLIP, SimCLR, BLIP, BEIT)	Image frame-text pairs (Movies)	SEEG	AMMT	✕
Dong et al. (2023)	Non-instruction-tuned multimodal models (Merloreserve)	Multimodal (Movies: Videos with audio)	fMRI	Neuromod Friends dataset	✕
Oota et al. (2024)	Non-instruction-tuned multimodal models (TVLT and ImageBind)	Multimodal (Movies: Videos with audio)	fMRI	Neuromod Movie10	✕
Our study	Instruction-tuned video and audio MLLMs	Multimodal (Movies: Videos with audio)	fMRI	Neuromod Movie10	✓

Table 1: Comparison of different studies evaluating multimodal models for brain encoding

The table above provides a comprehensive overview of recent studies in multimodal brain encoding research. Notably, our study is among the few that utilize instruction-tuned models and the first to apply them to multimodal stimuli (movies with audio). This comparison highlights the novelty of our approach in using instruction-tuned video and audio MLLMs for brain encoding with naturalistic multimodal stimuli.

Why Instruction-Tuned MLLMs?

Figure 1: Overview of our framework for aligning instruction-tuned video-audio MLLMs with brain activity during naturalistic movie viewing.

Instruction-tuned multimodal large language models (MLLMs) have shown remarkable ability to follow complex instructions across different modalities. Our research demonstrates that these models can provide insights into functional specialization in the brain when processing multimodal stimuli like videos with audio. Unlike previous approaches that used non-instruction-tuned models, our method leverages task-specific instructions to better capture the nuanced processing that occurs in different brain regions.

Key Contributions

🔬 First Brain-MLLM Alignment Study

We are the first to align instruction-tuned video and audio MLLMs with brain activity during naturalistic movie viewing, providing new insights into multimodal processing in the brain.

🎥 Task-Specific Representations

Our approach demonstrates that task-specific instructions lead to clear disentanglement in MLLM representations, which better align with functional specialization in the brain.

🧠 Hierarchical Alignment

We show that MLLM layers align hierarchically with brain regions, with early layers matching sensory areas and later layers matching higher-level processing regions.

Results

Figure 2: Performance comparison of instruction-tuned MLLMs vs. non-instruction-tuned models in brain alignment tasks.

Our experiments demonstrate that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal models by approximately 15% and unimodal models by around 20% in brain alignment tasks. This performance gain is consistent across different brain regions and for various task-specific instructions.

The Figure above illustrates the consistent performance advantage of instruction-tuned MLLMs across different brain regions. The task-specific nature of these models allows them to better capture the nuanced processing that occurs in different functional areas of the brain.

Key Findings

Instruction-tuned video MLLMs outperform non-instruction-tuned models by ~15% on average
Performance gains are most pronounced in higher-level visual and language processing regions
Using task-specific instructions leads to better disentanglement of functional specialization
Layer-wise analysis shows hierarchical alignment between model layers and brain regions

Methodology

We use Neuromod Movie10 fMRI data from participants while they watched naturalistic movie clips with audio. We then extracted embeddings from different layers of six video and two audio instruction-tuned MLLMs using 13 different task-specific instructions. These embeddings were used to predict brain activity in various regions, and the prediction performance was compared with non-instruction-tuned models and unimodal models.

Data Collection

Participants viewed naturalistic movie clips while undergoing fMRI scanning. The stimuli included diverse content spanning different genres, emotions, and semantic content to capture a wide range of brain responses.

Model Selection

We utilized six instruction-tuned video MLLMs (including VideoChat-R1, Qwen-2.5-VL, Video-LLaVA, LLaVA-OneVision, LLaVA-Next-Video, InstructBLIPVideo) and two instruction-tuned audio MLLMs (Qwen-2.5-Audio, Kimi-Audio). For comparison, we also included non-instruction-tuned multimodal models and unimodal models.

Task-Specific Instructions

We generated 13 different task-specific instructions for the video models, including:

Describe the visual content in detail
Identify all objects and their interactions
Analyze the emotional tone of the scene
Describe the temporal sequence of events
Explain the spatial relationships between objects
And 8 more task-specific instructions

Brain Alignment Analysis

We computed the correlation between model embeddings and brain activity using ridge regression with cross-validation. The alignment was measured in terms of prediction accuracy across different brain regions. We also performed layer-wise analysis to understand the hierarchical relationship between model layers and brain regions.

Citation

BibTeX

@article{oota2025instruction,
  title={Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain},
  author={Oota, Subba Reddy and Pahwa, Khushbu and Jindal, Prachi and Namburi, Satyasai Srinath and Singh, Maneesh and Chakraborty, Tanmoy and Bapi, Raju S and Gupta, Manish},
  journal={arXiv preprint arXiv:2506.08277},
  year={2025}
}