Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

Alibaba's Qwen Team Fixes the Core Problem With Reasoning Model Training — and Doubles Thought Length in the Process

Reinforcement learning gives reasoning models the same reward for every token, regardless of whether it was the pivot that unlocked a solution or just a filler comma. Alibaba's Qwen team has built FIPO, an algorithm that assigns rewards based on downstream influence — and the results include doubled reasoning depth without adding a separate value model.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
Alibaba's Qwen Team Fixes the Core Problem With Reasoning Model Training — and Doubles Thought Length in the Process

When a language model learns to reason through reinforcement learning, every token in a generated sequence receives the same reward signal regardless of its actual contribution to the outcome. The token that represents the critical logical pivot — the one that, if generated differently, would have sent the reasoning chain in a completely different direction — receives the same credit as the comma that followed it. Alibaba's Qwen team has identified this uniform credit assignment as a major reason why reasoning models hit a performance ceiling, and they have built an algorithm to fix it.

The Problem With GRPO

Group Relative Policy Optimization (GRPO), one of the most commonly used reinforcement learning methods for training reasoning models, assigns rewards at the sequence level and distributes them evenly across tokens. The result is that reasoning chains grow to a certain length during training and then stagnate. The model cannot learn to distinguish high-leverage reasoning moves from low-leverage ones because the reward signal treats them identically.

Previous attempts to fix this relied on PPO-based methods that use a separate value model to estimate the benefit of each token. The problem: that auxiliary model needs to be pre-trained on long chain-of-thought data, which means external knowledge contaminates the training signal. It becomes impossible to know whether performance improvements come from the algorithm or from knowledge inherited by the value model.

FIPO: Future-Influenced Reward Assignment

FIPO (Future-KL Influenced Policy Optimization) takes a different approach: instead of scoring tokens on their own, the algorithm looks ahead and measures how each token's generation changes the probability distribution over all subsequent tokens. Tokens that kick off a productive reasoning chain — that shift the model toward a different, better trajectory — receive larger rewards. Tokens that are locally uninformative receive less. No auxiliary value model is required.

To keep training stable, FIPO incorporates a discount factor (nearby tokens carry more predictive weight than distant ones) and a filter that excludes tokens where the model has drifted significantly between training steps. Without the filter, the researchers observed severe training instabilities: the model went off the rails around step 70 in their experiments and reasoning chain lengths collapsed.

Results on Qwen2.5-32B

The team tested FIPO on Qwen2.5-32B-Base — a 32-billion parameter model with no prior exposure to synthetic reasoning data. The results are significant in two dimensions. First, thought processes approximately doubled in length, indicating the model was learning more elaborate multi-step reasoning rather than reaching conclusions prematurely. Second, accuracy on the AIME-2024 mathematics benchmark improved substantially, outperforming the baseline, DeepSeek-R1-Zero, and o1-mini during training — while being comparable to PPO-based methods without requiring the auxiliary model those methods depend on.

The practical implication is that reasoning models trained with FIPO can solve harder problems through deeper deliberation, without the data contamination risk that comes with value model pre-training. For Alibaba's Qwen team, which has been among the most active open-source contributors to frontier model development, FIPO represents another methodological contribution that advances the field's ability to make reasoning models meaningfully more capable.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom