UAE's TII Releases Falcon Perception: A 0.6B Model That Outperforms SAM 3 on Spatial and Relational Reasoning
The Technology Innovation Institute has released Falcon Perception, a 0.6-billion-parameter early-fusion transformer that beats Meta's SAM 3 on open-vocabulary grounding and segmentation — particularly on the hard compositional tasks that expose the limits of modular vision systems. At a fraction of the parameter count of most frontier vision models, it signals that architecture choices matter as much as scale.

D.O.T.S AI Newsroom
AI News Desk
The Technology Innovation Institute (TII) in Abu Dhabi has released Falcon Perception, a 0.6-billion-parameter vision-language model for open-vocabulary grounding and instance segmentation that outperforms Meta's SAM 3 on a newly introduced benchmark called PBench. The release is notable for what it demonstrates about architectural efficiency: a model nearly an order of magnitude smaller than many frontier vision systems achieves competitive or superior results by rethinking how image and text tokens are processed together.
The Architecture: Early Fusion Over Modular Pipelines
Most vision-language models are built around a modular pipeline: a dedicated vision encoder processes the image, a fusion layer bridges image and text representations, and a decoder produces the output. Falcon Perception discards this structure in favor of a single unified backbone that processes image patches and text tokens in a shared parameter space from the first layer — a design the team calls early-fusion.
The practical consequence is that the model's attention mechanism sees both modalities simultaneously throughout the entire processing stack, rather than fusing them after separate encoding. TII argues this is why Falcon Perception shows large gains specifically on tasks requiring the integration of visual and linguistic signals: OCR-guided queries (+13.4 points over SAM 3), spatial reasoning (+21.9 points), and relational queries (+15.8 points).
Chain-of-Perception: Decomposing Prediction Into Steps
Rather than predicting an instance segmentation mask in a single forward pass, Falcon Perception uses what TII calls the Chain-of-Perception interface: it first predicts the instance center coordinate, then the spatial extent, then the full-resolution binary mask. Each step conditions on the previous, functioning like a structured reasoning chain applied to perception rather than language generation.
This decomposition also enables the model to handle dense scenes — images with hundreds of individual instances — by generating predictions autoregressively. SAM 3's fixed-size decoder architecture runs out of query tokens at high instance counts; Falcon Perception's autoregressive generation scales to handle them. On the dense-scene subset of PBench, Falcon Perception scores 72.6 versus SAM 3's 58.4.
Falcon OCR: A 0.3B Companion Model
TII simultaneously released Falcon OCR, a 0.3-billion-parameter variant optimized for document understanding. Despite its size, Falcon OCR achieves 80.3% on olmOCR and 88.6% on OmniDocBench — competitive with models three to five times larger — while delivering a 3x throughput advantage on A100 hardware. For organizations processing large volumes of document images on modest GPU budgets, the combination of accuracy and throughput efficiency makes it a compelling practical option.
Context: The UAE's Open-Model Strategy
TII's Falcon series has been one of the highest-profile open-model efforts outside of the US and European AI labs. Falcon Perception continues that strategy into the vision-language domain. The model and its associated PBench dataset are available on Hugging Face under open-weight licensing, and an interactive playground is accessible at vision.falcon.aidrc.tii.ae. Whether early-fusion architecture holds its advantages as parameter counts scale further remains an open research question — but Falcon Perception makes a strong empirical case that it matters at the efficiency frontier.