Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.

FormalProofBench is a new private benchmark that tests whether AI models can produce graduate-level mathematical proofs that are formally verified — not just plausible-sounding, but machine-checkably correct. The results expose a gap between AI math fluency and AI mathematical rigour.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

2 min read
Can Frontier AI Write Formally Verified Graduate Math Proofs? A New Benchmark Has the Answer.

A new benchmark called FormalProofBench (arXiv:2603.26996) has been designed to answer a question that matters for AI's role in scientific research: can frontier models write mathematical proofs at the graduate level that are not merely convincing but formally verified — checked by a machine proof assistant to be logically watertight?

Why Formal Verification Changes the Question

Most existing AI math benchmarks — MATH, AIME, competition problems — evaluate whether a model reaches the correct answer. They do not evaluate the proof's logical structure. A model can produce a persuasive wrong argument, make hidden assumptions, or skip non-trivial steps in ways that are not penalised by answer-checking evaluation. Human expert reviewers can sometimes catch these failures; formal verification always catches them.

FormalProofBench pairs each natural-language problem with a formal proof checker. The task is not to give the right answer — it is to generate a proof that a mechanical verifier accepts as logically complete. This is a strictly harder requirement. It eliminates the "plausible but wrong" failure mode that inflates model performance on informal math benchmarks.

What the Results Show

The benchmark's results (detailed in the paper) reveal a performance gap between AI mathematical fluency and mathematical rigour. Models that score highly on informal math benchmarks — demonstrating apparent comprehension of graduate-level mathematics — show significantly lower performance when the evaluation requires formal verification. The drop is larger for problems requiring multi-step logical structure than for problems where algebraic manipulation dominates.

The pattern is consistent with the "Squish and Release" hallucination findings released on the same day: AI systems produce confident, authoritative output that passes surface-level evaluation while containing structural errors that more rigorous testing exposes.

Why This Matters for AI in Science

The benchmark matters most for the question of whether AI can be trusted to do independent mathematical research. Formal verification is the gold standard for mathematical correctness, and FormalProofBench is the first benchmark designed to test frontier models against it at graduate level. The results suggest AI mathematical reasoning is more brittle than benchmark performance on informal tasks implies — a finding with direct implications for anyone deploying AI in scientific workflows.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom