Meta's TRIBE v2 Predicts How Your Brain Responds to Images, Sound, and Language — With Uncanny Accuracy
Meta FAIR's new AI model, trained on 1,000+ hours of fMRI data from 720 subjects, can predict neural responses to video, audio, and text stimuli with accuracy exceeding most individual human brain scans — and replicated decades of established neuroscience findings automatically.

D.O.T.S AI Newsroom
AI News Desk
Meta's fundamental AI research lab, FAIR, has published a model that does something genuinely remarkable: given a piece of video, audio, or text, it predicts with high accuracy which regions of the human brain will activate and how strongly — without ever scanning an actual brain.
The system, called TRIBE v2, was trained on more than 1,000 hours of fMRI neuroimaging data from 720 subjects and processes three input types through pre-trained Meta models: Llama 3.2 for text, Wav2Vec-Bert-2.0 for audio, and Video-JEPA-2 for visual content. Its outputs are predictions for 70,000 brain voxels — the three-dimensional pixels that constitute an fMRI scan.
Performance That Exceeds Individual Scans
The headline result is striking. TRIBE v2's predictions correlate more strongly with group-average brain activity patterns than most individual subject scans do. On high-quality 7 Tesla scanner data — the gold standard for spatial resolution in neuroimaging — the model achieved correlation twice as high as the median individual subject.
This is not merely a quantitative achievement. It means the AI has, in some sense, captured the signal more cleanly than a direct measurement from most individual humans. The implications for neuroscience research methodology are significant: an AI model may be able to substitute for expensive, time-consuming fMRI experiments in many research contexts.
A Validation Engine for Neuroscience
Perhaps more immediately compelling for scientists: TRIBE v2 automatically replicated decades of established neuroscience findings when tested in controlled conditions. It correctly identified specialized brain regions for faces, places, bodies, and written characters. It localized language processing networks and distinguished emotional from physical pain processing. It showed the expected left-hemisphere dominance for complete sentences versus word lists.
These are findings that took the field years of human experiments to establish. The model reproduced them from first principles, without being explicitly trained on the research literature.
Five Functional Brain Networks Identified
The model's cross-modal architecture reveals something else: which sensory channel — audio, video, or text — is the primary driver of activity in each brain region. This capability led TRIBE v2 to identify five distinct functional networks: the primary auditory cortex, the language network, the motion recognition system, the default mode network, and the visual processing system.
Scaling laws follow patterns seen in large language models: accuracy improves with more training data and has not yet plateaued, suggesting further gains are achievable.
Important Limitations
The model treats the brain as a passive receptor of stimuli and cannot model active decision-making, motor actions, or the other sensory modalities — smell, touch, proprioception. fMRI's inherent temporal limitations, measuring indirect blood oxygen levels with multi-second delays, prevent TRIBE v2 from capturing millisecond-range neural dynamics that are critical for many neuroscientific questions.
Code, model weights, and an interactive demonstration are publicly available through Meta's platforms.