NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding

A new benchmark framework from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon systematically tests twelve frontier models on robot manipulation tasks. The verdict: even GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 fail at most tasks without human-designed abstractions. Agentic scaffolding — parallel generation, self-correction, reusable functions — dramatically closes the gap.

D.O.T.S AI Newsroom

AI News Desk

Apr 2, 20263 min read

NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding

A multi-institution research team from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon has released CaP-X, a new open-access evaluation framework that systematically measures how well frontier AI models can control robots by writing their own manipulation code. The findings cut through the optimism surrounding AI in robotics: without human-designed building blocks, even the best models in the world cannot reliably control physical systems. But agentic scaffolding changes the equation significantly.

The Experimental Setup

The core idea behind CaP-X is deceptively simple. Rather than training robot-specific models on motion capture datasets, the researchers asked general-purpose language models to write the control code that makes robots move. This approach — sometimes called Code as Policies — has genuine appeal: it allows a single frontier model to be applied to new robotic tasks without task-specific fine-tuning, leveraging the broad reasoning capabilities these models have already developed.

The team tested twelve models — including GPT-5.2, Gemini-3-Pro, Claude Opus 4.5, Qwen3-235B, and DeepSeek-V3.1 — across seven manipulation tasks of increasing complexity, ranging from lifting a cube to bimanual coordination. The tasks were run in physics simulation, allowing rapid iteration and reproducible scoring.

The Finding: Abstractions Matter Enormously

The headline result is stark. Without access to human-designed high-level commands — primitives like "grasp object X and lift it" — even the strongest models fail at most robotics tasks. Performance was dramatically better when models were given pre-built abstractions to work with, confirming that the gap between language model capability and robot control is not primarily a reasoning gap but an abstraction gap: the models need to be told what the relevant operations are, in terms a robot can execute.

This finding has immediate practical implications. The current generation of AI models cannot "zero-shot" robotic control from raw actuator commands. Successful deployment of language models in robotics requires substantial human engineering to define the action vocabulary — work that is often invisible in demonstrations but represents a significant development cost in practice.

Agentic Scaffolding Closes the Gap

The more optimistic result in the paper concerns what happens when agentic techniques are applied. Three interventions proved particularly effective: targeted test-time compute scaling (generating multiple solution candidates in parallel and selecting the best), automated debugging loops (models iterating on failed attempts with simulation feedback), and accumulating libraries of reusable functions across tasks.

With these scaffolding techniques applied, performance improved substantially — in some cases approaching the reliability of human-written programs. The pattern mirrors what has been observed in software engineering agents: single-shot model performance is often inadequate, but iterative, tool-augmented agents can achieve reliability that no single model inference achieves alone.

What This Means for Robotics AI

The CaP-X results position agentic scaffolding — not raw model capability — as the near-term lever for AI in robotics. For companies building physical automation systems on top of frontier models, the message is to invest in the scaffolding layer: parallel generation, simulation-in-the-loop debugging, and curated abstraction libraries. Waiting for models that can zero-shot physical control is likely to take longer than building the scaffolding that makes today's models reliable.

Back to Home

NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding

The Experimental Setup

The Finding: Abstractions Matter Enormously

Agentic Scaffolding Closes the Gap

What This Means for Robotics AI

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters