Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.
FormalProofBench is a new private benchmark that tests whether AI models can produce graduate-level mathematical proofs that are formally verified — not just plausible-sounding, but machine-checkably correct. The results expose a gap between AI math fluency and AI mathematical rigour.

D.O.T.S AI Newsroom
AI News Desk
A new benchmark called FormalProofBench (arXiv:2603.26996) has been designed to answer a question that matters for AI's role in scientific research: can frontier models write mathematical proofs at the graduate level that are not merely convincing but formally verified — checked by a machine proof assistant to be logically watertight?
Why Formal Verification Changes the Question
Most existing AI math benchmarks — MATH, AIME, competition problems — evaluate whether a model reaches the correct answer. They do not evaluate the proof's logical structure. A model can produce a persuasive wrong argument, make hidden assumptions, or skip non-trivial steps in ways that are not penalised by answer-checking evaluation. Human expert reviewers can sometimes catch these failures; formal verification always catches them.
FormalProofBench pairs each natural-language problem with a formal proof checker. The task is not to give the right answer — it is to generate a proof that a mechanical verifier accepts as logically complete. This is a strictly harder requirement. It eliminates the "plausible but wrong" failure mode that inflates model performance on informal math benchmarks.
What the Results Show
The benchmark's results (detailed in the paper) reveal a performance gap between AI mathematical fluency and mathematical rigour. Models that score highly on informal math benchmarks — demonstrating apparent comprehension of graduate-level mathematics — show significantly lower performance when the evaluation requires formal verification. The drop is larger for problems requiring multi-step logical structure than for problems where algebraic manipulation dominates.
The pattern is consistent with the "Squish and Release" hallucination findings released on the same day: AI systems produce confident, authoritative output that passes surface-level evaluation while containing structural errors that more rigorous testing exposes.
Why This Matters for AI in Science
The benchmark matters most for the question of whether AI can be trusted to do independent mathematical research. Formal verification is the gold standard for mathematical correctness, and FormalProofBench is the first benchmark designed to test frontier models against it at graduate level. The results suggest AI mathematical reasoning is more brittle than benchmark performance on informal tasks implies — a finding with direct implications for anyone deploying AI in scientific workflows.