Study: Half of AI-Written Code That Passes SWE-Bench Would Be Rejected by Real Developers
Research organization METR has exposed a critical disconnect between benchmark performance and real-world code quality: approximately 50% of AI-generated code that clears the widely-used SWE-bench industry test contains issues that would cause actual project maintainers to reject it, raising fundamental questions about how we measure AI coding capability.