Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

Alibaba's Qwen Team Fixes Reinforcement Learning's Blind Spot to Make AI Reason More Deeply

Alibaba's Qwen research team has published a new training algorithm that addresses a fundamental limitation in how reinforcement learning reward signals are assigned to reasoning models — giving each step in a reasoning chain a weight proportional to its actual impact on the outcome. Early results show measurable improvements in multi-step reasoning quality.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

2 min read
Alibaba's Qwen Team Fixes Reinforcement Learning's Blind Spot to Make AI Reason More Deeply

Reinforcement learning has become the dominant technique for pushing AI models beyond what pretraining alone can achieve — but it carries a structural flaw that has quietly capped reasoning quality for the entire field: every token in a reasoning chain receives the same reward signal, regardless of whether it was the pivotal step that got the answer right or an irrelevant filler phrase that contributed nothing. Alibaba's Qwen research team has published a new algorithm, released this week, that fixes this. The approach assigns each reasoning step a weight based on its measured contribution to the final outcome, allowing the training signal to concentrate on the decisions that actually matter.

Why Uniform Token Rewards Hit a Wall

The uniform reward problem is not new — it has been discussed in the RL literature for years — but it becomes acute at the scale of multi-step reasoning chains that today's frontier models are being trained to produce. When a model works through a 20-step reasoning problem, the reward signal at the end of that chain is distributed equally across all 20 steps, including the preamble, the restatements, and the filler reasoning that pads word count without advancing the solution. The model learns to produce long chains without learning which steps in those chains are load-bearing. Qwen's step-weighting approach changes what the model is being rewarded for: not just reaching a correct answer, but identifying and executing the correct reasoning moves.

The Implications for Reasoning-Focused Models

If the approach holds up under broader evaluation, it addresses one of the most significant bottlenecks in scaling reasoning quality beyond what current RL techniques can achieve. The Qwen team's results, while preliminary, show improvements on standard multi-step math and logic benchmarks that are large enough to warrant attention from the broader research community. The technique is model-agnostic — it does not require Qwen-specific architecture — which means it will likely be applied across the field quickly if the results replicate.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom