Research2 min read
AI Benchmarks Are Built on a Flawed Assumption, Google Study Finds: That Humans Agree
A study from Google researchers finds that the standard practice of using 3-5 human raters per benchmark question systematically undercounts genuine human disagreement — producing benchmarks that overstate model confidence and misrepresent model quality. The finding has implications for every major AI leaderboard in current use.