Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

Stanford Finds AI Vision Models 'See' Images That Don't Exist — and Benchmarks Can't Catch It

Researchers at Stanford have identified a critical flaw in frontier multimodal AI systems: GPT-5, Gemini 3 Pro, and Claude Opus 4.5 generate plausible image descriptions and medical diagnoses even when no image is provided. The team calls it the 'mirage effect' — and found that standard benchmarks are nearly blind to it.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
Stanford Finds AI Vision Models 'See' Images That Don't Exist — and Benchmarks Can't Catch It

A research team at Stanford University has uncovered a fundamental reliability problem in the world's most advanced multimodal AI systems: frontier models generate detailed, confident image descriptions — including medical diagnoses — even when they receive no visual input at all. The researchers term the phenomenon the "mirage effect," and their findings suggest that the benchmarks used to evaluate these systems are almost entirely unable to detect it.

The study, reported by The Decoder and authored by managing editor Maximilian Schreiner, tested GPT-5, Gemini 3 Pro, and Claude Opus 4.5 against scenarios where standard evaluation prompts were issued but no image was attached. The results were striking: across these frontier models, performance reached 70 to 80 percent of benchmark accuracy with no images presented. On medical imaging benchmarks specifically, performance reached up to 99 percent — without a single pixel of actual visual input.

A New Kind of Failure

The Stanford team is careful to distinguish the mirage effect from standard hallucination, which typically involves a model fabricating details within a valid context. The mirage effect operates at a deeper level: models construct what the researchers describe as false "epistemic frames," assuming visual input exists and building entire reasoning chains on that assumption. The model does not know it has no image — it proceeds as though it does.

On the team's Phantom-0 benchmark, designed specifically to probe this vulnerability, over 60 percent of responses across all tested frontier models were confident false descriptions of images that were never provided. Hallucination rates on standard evaluation prompts jumped to 90 to 100 percent in phantom-image conditions.

The Medical Imaging Risk

The implications for clinical AI applications are particularly concerning. Gemini 3 Pro's diagnoses for nonexistent medical images consistently skewed toward severe pathologies — frequently returning suggestions of ST-elevation myocardial infarctions, melanomas, and carcinomas. In a deployed system where image uploads fail silently, this behavior could trigger false medical alerts based on nothing at all.

The problem is compounded by a counterintuitive finding: a text-only 3-billion-parameter model — which has no image processing capability whatsoever — outperformed all tested multimodal systems and human radiologists on a chest X-ray analysis benchmark. The explanation is unsettling. Benchmarks contain sufficient linguistic patterns and statistical cues that pure language models can succeed on them without ever processing an image. This means benchmark performance scores are not measuring what they appear to measure.

What the Fix Requires

The Stanford team's proposed solution, the "B-Clean" framework, identifies and removes benchmark questions that can be answered without image input. Applying it required filtering out 74 to 77 percent of benchmark questions across the datasets tested — suggesting that the overwhelming majority of current evaluation questions are compromised by the mirage effect. Model rankings shifted meaningfully on two of three tested benchmarks after cleaning.

The research raises a pointed question for the medical AI industry specifically: if deployed models can generate confident clinical diagnoses from missing images and existing benchmarks cannot catch the failure mode, what standard of evidence should be required before these systems enter clinical workflows? The Stanford team does not answer that question, but their findings make it impossible to ignore.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom