Alibaba's Qwen Team Fixes the Core Problem With Reasoning Model Training — and Doubles Thought Length in the Process
Reinforcement learning gives reasoning models the same reward for every token, regardless of whether it was the pivot that unlocked a solution or just a filler comma. Alibaba's Qwen team has built FIPO, an algorithm that assigns rewards based on downstream influence — and the results include doubled reasoning depth without adding a separate value model.

D.O.T.S AI Newsroom
AI News Desk
When a language model learns to reason through reinforcement learning, every token in a generated sequence receives the same reward signal regardless of its actual contribution to the outcome. The token that represents the critical logical pivot — the one that, if generated differently, would have sent the reasoning chain in a completely different direction — receives the same credit as the comma that followed it. Alibaba's Qwen team has identified this uniform credit assignment as a major reason why reasoning models hit a performance ceiling, and they have built an algorithm to fix it.
The Problem With GRPO
Group Relative Policy Optimization (GRPO), one of the most commonly used reinforcement learning methods for training reasoning models, assigns rewards at the sequence level and distributes them evenly across tokens. The result is that reasoning chains grow to a certain length during training and then stagnate. The model cannot learn to distinguish high-leverage reasoning moves from low-leverage ones because the reward signal treats them identically.
Previous attempts to fix this relied on PPO-based methods that use a separate value model to estimate the benefit of each token. The problem: that auxiliary model needs to be pre-trained on long chain-of-thought data, which means external knowledge contaminates the training signal. It becomes impossible to know whether performance improvements come from the algorithm or from knowledge inherited by the value model.
FIPO: Future-Influenced Reward Assignment
FIPO (Future-KL Influenced Policy Optimization) takes a different approach: instead of scoring tokens on their own, the algorithm looks ahead and measures how each token's generation changes the probability distribution over all subsequent tokens. Tokens that kick off a productive reasoning chain — that shift the model toward a different, better trajectory — receive larger rewards. Tokens that are locally uninformative receive less. No auxiliary value model is required.
To keep training stable, FIPO incorporates a discount factor (nearby tokens carry more predictive weight than distant ones) and a filter that excludes tokens where the model has drifted significantly between training steps. Without the filter, the researchers observed severe training instabilities: the model went off the rails around step 70 in their experiments and reasoning chain lengths collapsed.
Results on Qwen2.5-32B
The team tested FIPO on Qwen2.5-32B-Base — a 32-billion parameter model with no prior exposure to synthetic reasoning data. The results are significant in two dimensions. First, thought processes approximately doubled in length, indicating the model was learning more elaborate multi-step reasoning rather than reaching conclusions prematurely. Second, accuracy on the AIME-2024 mathematics benchmark improved substantially, outperforming the baseline, DeepSeek-R1-Zero, and o1-mini during training — while being comparable to PPO-based methods without requiring the auxiliary model those methods depend on.
The practical implication is that reasoning models trained with FIPO can solve harder problems through deeper deliberation, without the data contamination risk that comes with value model pre-training. For Alibaba's Qwen team, which has been among the most active open-source contributors to frontier model development, FIPO represents another methodological contribution that advances the field's ability to make reasoning models meaningfully more capable.