Research3 min read
Alibaba's Qwen Team Fixes the Core Problem With Reasoning Model Training — and Doubles Thought Length in the Process
Reinforcement learning gives reasoning models the same reward for every token, regardless of whether it was the pivot that unlocked a solution or just a filler comma. Alibaba's Qwen team has built FIPO, an algorithm that assigns rewards based on downstream influence — and the results include doubled reasoning depth without adding a separate value model.