Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora

Hallucination

2 articles tagged "Hallucination"

Grok 4.20 Trails GPT-5.4 and Gemini by Wide Margins — But Sets a New Record for Not Hallucinating
Research

Grok 4.20 Trails GPT-5.4 and Gemini by Wide Margins — But Sets a New Record for Not Hallucinating

A comprehensive independent benchmark evaluation has found that xAI's Grok 4.20 model performs significantly below GPT-5.4 and Google's Gemini 2.5 Pro on standard reasoning, coding, and knowledge benchmarks — trailing by 8-14 percentage points across the MMLU, GPQA, and HumanEval test suites. However, the same evaluation surfaces a remarkable outlier result: Grok 4.20 achieves the lowest hallucination rate ever recorded on the TruthfulQA and FActScorE benchmarks, outperforming GPT-5.4 by 19 points and Gemini 2.5 Pro by 23 points in factual accuracy under adversarial prompting. The researchers attribute the hallucination resistance to xAI's novel abstention training methodology — where Grok is explicitly rewarded for saying 'I don't know' rather than confabulating plausible-sounding answers. For enterprise use cases where factual precision matters more than raw capability — legal research, financial analysis, medical literature review — the results suggest Grok 4.20 may be the preferred model despite its headline benchmark underperformance.

Sofia Reyes