NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding
A new benchmark framework from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon systematically tests twelve frontier models on robot manipulation tasks. The verdict: even GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 fail at most tasks without human-designed abstractions. Agentic scaffolding — parallel generation, self-correction, reusable functions — dramatically closes the gap.

D.O.T.S AI Newsroom
AI News Desk
A multi-institution research team from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon has released CaP-X, a new open-access evaluation framework that systematically measures how well frontier AI models can control robots by writing their own manipulation code. The findings cut through the optimism surrounding AI in robotics: without human-designed building blocks, even the best models in the world cannot reliably control physical systems. But agentic scaffolding changes the equation significantly.
The Experimental Setup
The core idea behind CaP-X is deceptively simple. Rather than training robot-specific models on motion capture datasets, the researchers asked general-purpose language models to write the control code that makes robots move. This approach — sometimes called Code as Policies — has genuine appeal: it allows a single frontier model to be applied to new robotic tasks without task-specific fine-tuning, leveraging the broad reasoning capabilities these models have already developed.
The team tested twelve models — including GPT-5.2, Gemini-3-Pro, Claude Opus 4.5, Qwen3-235B, and DeepSeek-V3.1 — across seven manipulation tasks of increasing complexity, ranging from lifting a cube to bimanual coordination. The tasks were run in physics simulation, allowing rapid iteration and reproducible scoring.
The Finding: Abstractions Matter Enormously
The headline result is stark. Without access to human-designed high-level commands — primitives like "grasp object X and lift it" — even the strongest models fail at most robotics tasks. Performance was dramatically better when models were given pre-built abstractions to work with, confirming that the gap between language model capability and robot control is not primarily a reasoning gap but an abstraction gap: the models need to be told what the relevant operations are, in terms a robot can execute.
This finding has immediate practical implications. The current generation of AI models cannot "zero-shot" robotic control from raw actuator commands. Successful deployment of language models in robotics requires substantial human engineering to define the action vocabulary — work that is often invisible in demonstrations but represents a significant development cost in practice.
Agentic Scaffolding Closes the Gap
The more optimistic result in the paper concerns what happens when agentic techniques are applied. Three interventions proved particularly effective: targeted test-time compute scaling (generating multiple solution candidates in parallel and selecting the best), automated debugging loops (models iterating on failed attempts with simulation feedback), and accumulating libraries of reusable functions across tasks.
With these scaffolding techniques applied, performance improved substantially — in some cases approaching the reliability of human-written programs. The pattern mirrors what has been observed in software engineering agents: single-shot model performance is often inadequate, but iterative, tool-augmented agents can achieve reliability that no single model inference achieves alone.
What This Means for Robotics AI
The CaP-X results position agentic scaffolding — not raw model capability — as the near-term lever for AI in robotics. For companies building physical automation systems on top of frontier models, the message is to invest in the scaffolding layer: parallel generation, simulation-in-the-loop debugging, and curated abstraction libraries. Waiting for models that can zero-shot physical control is likely to take longer than building the scaffolding that makes today's models reliable.