Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.

FormalProofBench is a new private benchmark that tests whether AI models can produce graduate-level mathematical proofs that are formally verified — not just plausible-sounding, but machine-checkably correct. The results expose a gap between AI math fluency and AI mathematical rigour.

D.O.T.S AI Newsroom

AI News Desk

Mar 31, 20262 min read

Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.

A new benchmark called FormalProofBench (arXiv:2603.26996) has been designed to answer a question that matters for AI's role in scientific research: can frontier models write mathematical proofs at the graduate level that are not merely convincing but formally verified — checked by a machine proof assistant to be logically watertight?

Why Formal Verification Changes the Question

Most existing AI math benchmarks — MATH, AIME, competition problems — evaluate whether a model reaches the correct answer. They do not evaluate the proof's logical structure. A model can produce a persuasive wrong argument, make hidden assumptions, or skip non-trivial steps in ways that are not penalised by answer-checking evaluation. Human expert reviewers can sometimes catch these failures; formal verification always catches them.

FormalProofBench pairs each natural-language problem with a formal proof checker. The task is not to give the right answer — it is to generate a proof that a mechanical verifier accepts as logically complete. This is a strictly harder requirement. It eliminates the "plausible but wrong" failure mode that inflates model performance on informal math benchmarks.

What the Results Show

The benchmark's results (detailed in the paper) reveal a performance gap between AI mathematical fluency and mathematical rigour. Models that score highly on informal math benchmarks — demonstrating apparent comprehension of graduate-level mathematics — show significantly lower performance when the evaluation requires formal verification. The drop is larger for problems requiring multi-step logical structure than for problems where algebraic manipulation dominates.

The pattern is consistent with the "Squish and Release" hallucination findings released on the same day: AI systems produce confident, authoritative output that passes surface-level evaluation while containing structural errors that more rigorous testing exposes.

Why This Matters for AI in Science

The benchmark matters most for the question of whether AI can be trusted to do independent mathematical research. Formal verification is the gold standard for mathematical correctness, and FormalProofBench is the first benchmark designed to test frontier models against it at graduate level. The results suggest AI mathematical reasoning is more brittle than benchmark performance on informal tasks implies — a finding with direct implications for anyone deploying AI in scientific workflows.

Back to Home

Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.

Why Formal Verification Changes the Question

What the Results Show

Why This Matters for AI in Science

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters