AI Benchmarks Are Built on a Flawed Assumption, Google Study Finds: That Humans Agree
A study from Google researchers finds that the standard practice of using 3-5 human raters per benchmark question systematically undercounts genuine human disagreement — producing benchmarks that overstate model confidence and misrepresent model quality. The finding has implications for every major AI leaderboard in current use.

D.O.T.S AI Newsroom
AI News Desk
The AI field has built its evaluation infrastructure on a quiet assumption: that a small number of human raters per question is enough to establish what the correct answer is. A new study from Google researchers directly challenges that assumption. Their analysis finds that 3-5 human raters — the standard across most major AI benchmarks — is systematically insufficient to capture the actual distribution of human opinion on contested questions, producing evaluation data that masks disagreement, inflates apparent consensus, and generates benchmark scores that do not accurately reflect how a model's outputs compare against human judgment in its full complexity.
The Disagreement That Gets Averaged Away
The core finding is about what happens when you increase the number of human raters per question from 3-5 to 20 or more. At the larger sample sizes, a substantial fraction of questions that looked like clear-consensus items turn out to have meaningful disagreement distributions — cases where a significant minority of reasonable human judges give a different answer from the majority. When benchmarks use small rater pools, they collapse that minority view into noise. The model being evaluated is then scored against what appears to be a clear human standard, when the actual human standard is genuinely contested. This is not a marginal statistical effect: the Google team finds it affects benchmark score rankings in ways that would change comparative model evaluations in multiple major leaderboard categories.
Why This Is Hard to Fix
The study's inconvenient implication is that fixing AI benchmarks requires spending significantly more on human evaluation — not a marginal increase but an order-of-magnitude increase in rater volume for contested question types. The economic structure of benchmark production, which relies on scale and speed, pushes against this. The research community has been aware of benchmark contamination and overfitting as problems for years; human rater disagreement is a subtler and harder-to-address issue because it lives inside the data generation process rather than in model training. It is also the kind of problem that is easy to deprioritize when every lab has an incentive to compare favorably on benchmarks as currently constituted.