Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Research

Alibaba's Qwen Team Builds HopChain to Fix How Vision Models Fall Apart During Multi-Step Reasoning

Alibaba's Qwen team and Tsinghua University have released HopChain, a training framework that forces vision-language models to verify intermediate reasoning steps before proceeding. The result: improvements on 20 of 24 benchmarks tested, with some scores more than doubling on hard visual reasoning tasks.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

2 min read
Alibaba's Qwen Team Builds HopChain to Fix How Vision Models Fall Apart During Multi-Step Reasoning

Vision-language models have a well-documented weakness: when a task requires more than one or two reasoning steps over an image, small errors compound rapidly. Miscounting objects, misreading spatial relationships, or hallucinating a detail early in the reasoning chain contaminates every conclusion that follows. Alibaba's Qwen team, working with researchers from Tsinghua University, has published a framework called HopChain that directly targets this failure mode — and the benchmark results are notable.

What HopChain Does

HopChain automatically generates multi-stage visual questions that chain 3 to 6 objects together in dependency sequences. A model cannot answer the final question without correctly resolving each intermediate step. The pipeline runs through four stages: Alibaba's Qwen3-VL model first identifies object categories; Meta's SAM3 segments individual instances; the language model generates multi-level questions with linked dependencies; and four human annotators independently verify every answer. Only questions with unanimous human agreement enter the training set.

The forcing function is simple but effective: each step in a HopChain question requires the model to return to the image and visually verify a claim before proceeding. Models trained on this data learn to treat visual verification as mandatory rather than optional.

Benchmark Results

HopChain improved 20 of 24 benchmarks across STEM, document comprehension, and video reasoning tasks. On the 35B model, EMMA rose from 53 to 58 and CharXiv from 69 to 73.1. On the 397B model, ZeroBench — a particularly hard visual reasoning benchmark — doubled from 4 to 8 points. The framework generalized to video tasks despite being trained on still images, suggesting it captures something structural about multi-step visual attention rather than dataset-specific patterns.

Full reasoning chains proved essential — partial chains degraded performance. The benefits scaled with chain length, reaching 50+ point improvements for the longest responses.

Why This Matters

Multi-step visual reasoning is the gateway to reliable AI for domains like medical imaging, scientific analysis, and robotics. Today's models are brittle exactly where precision is most required. HopChain's approach — building training data that enforces step-by-step verification — is a data-centric solution to a problem that extra model capacity alone has not solved. The code and training data are expected to be released alongside the paper.

Back to Home

Related Stories

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape
Research

Google's AI Overviews Are Right Nine Times Out of Ten — but the 10% Failure Rate Has a Specific Shape

A new independent study is the first to systematically measure the factual accuracy of Google's AI Overviews at scale. The headline finding — 90% accuracy — is better than critics expected and worse than Google implies. The more important finding is where that 10% comes from: complex multi-step queries, niche topics, and questions where the web itself is the source of conflicting claims.

D.O.T.S AI Newsroom
Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'
Research

Databricks Co-Founder Wins Top Computing Prize — and Says AGI Is 'Already Here'

Matei Zaharia, co-founder of Databricks and creator of Apache Spark, has won the ACM Prize in Computing — one of the most prestigious awards in computer science. In interviews accompanying the announcement, Zaharia made a pointed argument: AGI is not a future event but a present condition, and the industry's endless debate about its arrival is obscuring more useful questions about what to do with the AI we already have.

D.O.T.S AI Newsroom
Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters
Research

Researchers Fingerprinted 178 AI Models' Writing Styles — and Found Alarming Clone Clusters

A new study from Rival analyzed 3,095 standardized responses across 178 AI models, extracting 32-dimension stylometric fingerprints to map which models write like which others. The findings reveal tightly grouped clone clusters across providers — and raise serious questions about whether the AI ecosystem is converging on a single voice.

D.O.T.S AI Newsroom