Alibaba's Qwen Team Fixes Reinforcement Learning's Blind Spot to Make AI Reason More Deeply
Alibaba's Qwen research team has published a new training algorithm that addresses a fundamental limitation in how reinforcement learning reward signals are assigned to reasoning models — giving each step in a reasoning chain a weight proportional to its actual impact on the outcome. Early results show measurable improvements in multi-step reasoning quality.