Stanford Finds AI Vision Models 'See' Images That Don't Exist — and Benchmarks Can't Catch It
Researchers at Stanford have identified a critical flaw in frontier multimodal AI systems: GPT-5, Gemini 3 Pro, and Claude Opus 4.5 generate plausible image descriptions and medical diagnoses even when no image is provided. The team calls it the 'mirage effect' — and found that standard benchmarks are nearly blind to it.

D.O.T.S AI Newsroom
AI News Desk
A research team at Stanford University has uncovered a fundamental reliability problem in the world's most advanced multimodal AI systems: frontier models generate detailed, confident image descriptions — including medical diagnoses — even when they receive no visual input at all. The researchers term the phenomenon the "mirage effect," and their findings suggest that the benchmarks used to evaluate these systems are almost entirely unable to detect it.
The study, reported by The Decoder and authored by managing editor Maximilian Schreiner, tested GPT-5, Gemini 3 Pro, and Claude Opus 4.5 against scenarios where standard evaluation prompts were issued but no image was attached. The results were striking: across these frontier models, performance reached 70 to 80 percent of benchmark accuracy with no images presented. On medical imaging benchmarks specifically, performance reached up to 99 percent — without a single pixel of actual visual input.
A New Kind of Failure
The Stanford team is careful to distinguish the mirage effect from standard hallucination, which typically involves a model fabricating details within a valid context. The mirage effect operates at a deeper level: models construct what the researchers describe as false "epistemic frames," assuming visual input exists and building entire reasoning chains on that assumption. The model does not know it has no image — it proceeds as though it does.
On the team's Phantom-0 benchmark, designed specifically to probe this vulnerability, over 60 percent of responses across all tested frontier models were confident false descriptions of images that were never provided. Hallucination rates on standard evaluation prompts jumped to 90 to 100 percent in phantom-image conditions.
The Medical Imaging Risk
The implications for clinical AI applications are particularly concerning. Gemini 3 Pro's diagnoses for nonexistent medical images consistently skewed toward severe pathologies — frequently returning suggestions of ST-elevation myocardial infarctions, melanomas, and carcinomas. In a deployed system where image uploads fail silently, this behavior could trigger false medical alerts based on nothing at all.
The problem is compounded by a counterintuitive finding: a text-only 3-billion-parameter model — which has no image processing capability whatsoever — outperformed all tested multimodal systems and human radiologists on a chest X-ray analysis benchmark. The explanation is unsettling. Benchmarks contain sufficient linguistic patterns and statistical cues that pure language models can succeed on them without ever processing an image. This means benchmark performance scores are not measuring what they appear to measure.
What the Fix Requires
The Stanford team's proposed solution, the "B-Clean" framework, identifies and removes benchmark questions that can be answered without image input. Applying it required filtering out 74 to 77 percent of benchmark questions across the datasets tested — suggesting that the overwhelming majority of current evaluation questions are compromised by the mirage effect. Model rankings shifted meaningfully on two of three tested benchmarks after cleaning.
The research raises a pointed question for the medical AI industry specifically: if deployed models can generate confident clinical diagnoses from missing images and existing benchmarks cannot catch the failure mode, what standard of evidence should be required before these systems enter clinical workflows? The Stanford team does not answer that question, but their findings make it impossible to ignore.