Alibaba's Qwen Team Fixes Reinforcement Learning's Blind Spot to Make AI Reason More Deeply
Alibaba's Qwen research team has published a new training algorithm that addresses a fundamental limitation in how reinforcement learning reward signals are assigned to reasoning models — giving each step in a reasoning chain a weight proportional to its actual impact on the outcome. Early results show measurable improvements in multi-step reasoning quality.

D.O.T.S AI Newsroom
AI News Desk
Reinforcement learning has become the dominant technique for pushing AI models beyond what pretraining alone can achieve — but it carries a structural flaw that has quietly capped reasoning quality for the entire field: every token in a reasoning chain receives the same reward signal, regardless of whether it was the pivotal step that got the answer right or an irrelevant filler phrase that contributed nothing. Alibaba's Qwen research team has published a new algorithm, released this week, that fixes this. The approach assigns each reasoning step a weight based on its measured contribution to the final outcome, allowing the training signal to concentrate on the decisions that actually matter.
Why Uniform Token Rewards Hit a Wall
The uniform reward problem is not new — it has been discussed in the RL literature for years — but it becomes acute at the scale of multi-step reasoning chains that today's frontier models are being trained to produce. When a model works through a 20-step reasoning problem, the reward signal at the end of that chain is distributed equally across all 20 steps, including the preamble, the restatements, and the filler reasoning that pads word count without advancing the solution. The model learns to produce long chains without learning which steps in those chains are load-bearing. Qwen's step-weighting approach changes what the model is being rewarded for: not just reaching a correct answer, but identifying and executing the correct reasoning moves.
The Implications for Reasoning-Focused Models
If the approach holds up under broader evaluation, it addresses one of the most significant bottlenecks in scaling reasoning quality beyond what current RL techniques can achieve. The Qwen team's results, while preliminary, show improvements on standard multi-step math and logic benchmarks that are large enough to warrant attention from the broader research community. The technique is model-agnostic — it does not require Qwen-specific architecture — which means it will likely be applied across the field quickly if the results replicate.