Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding

A new benchmark framework from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon systematically tests twelve frontier models on robot manipulation tasks. The verdict: even GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 fail at most tasks without human-designed abstractions. Agentic scaffolding — parallel generation, self-correction, reusable functions — dramatically closes the gap.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
NVIDIA, Berkeley & Stanford: Best AI Models Still Fail at Robot Control — Until You Add Agentic Scaffolding

A multi-institution research team from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon has released CaP-X, a new open-access evaluation framework that systematically measures how well frontier AI models can control robots by writing their own manipulation code. The findings cut through the optimism surrounding AI in robotics: without human-designed building blocks, even the best models in the world cannot reliably control physical systems. But agentic scaffolding changes the equation significantly.

The Experimental Setup

The core idea behind CaP-X is deceptively simple. Rather than training robot-specific models on motion capture datasets, the researchers asked general-purpose language models to write the control code that makes robots move. This approach — sometimes called Code as Policies — has genuine appeal: it allows a single frontier model to be applied to new robotic tasks without task-specific fine-tuning, leveraging the broad reasoning capabilities these models have already developed.

The team tested twelve models — including GPT-5.2, Gemini-3-Pro, Claude Opus 4.5, Qwen3-235B, and DeepSeek-V3.1 — across seven manipulation tasks of increasing complexity, ranging from lifting a cube to bimanual coordination. The tasks were run in physics simulation, allowing rapid iteration and reproducible scoring.

The Finding: Abstractions Matter Enormously

The headline result is stark. Without access to human-designed high-level commands — primitives like "grasp object X and lift it" — even the strongest models fail at most robotics tasks. Performance was dramatically better when models were given pre-built abstractions to work with, confirming that the gap between language model capability and robot control is not primarily a reasoning gap but an abstraction gap: the models need to be told what the relevant operations are, in terms a robot can execute.

This finding has immediate practical implications. The current generation of AI models cannot "zero-shot" robotic control from raw actuator commands. Successful deployment of language models in robotics requires substantial human engineering to define the action vocabulary — work that is often invisible in demonstrations but represents a significant development cost in practice.

Agentic Scaffolding Closes the Gap

The more optimistic result in the paper concerns what happens when agentic techniques are applied. Three interventions proved particularly effective: targeted test-time compute scaling (generating multiple solution candidates in parallel and selecting the best), automated debugging loops (models iterating on failed attempts with simulation feedback), and accumulating libraries of reusable functions across tasks.

With these scaffolding techniques applied, performance improved substantially — in some cases approaching the reliability of human-written programs. The pattern mirrors what has been observed in software engineering agents: single-shot model performance is often inadequate, but iterative, tool-augmented agents can achieve reliability that no single model inference achieves alone.

What This Means for Robotics AI

The CaP-X results position agentic scaffolding — not raw model capability — as the near-term lever for AI in robotics. For companies building physical automation systems on top of frontier models, the message is to invest in the scaffolding layer: parallel generation, simulation-in-the-loop debugging, and curated abstraction libraries. Waiting for models that can zero-shot physical control is likely to take longer than building the scaffolding that makes today's models reliable.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom