Naver's Seoul World Model Grounds AI Video Generation in Real City Geometry to Stop Hallucination
South Korean internet giant Naver has built a video world model tied to actual physical geography — training on 1.2 million Street View panoramas to generate spatially coherent urban environments. It generalises to cities it has never seen, without fine-tuning.

D.O.T.S AI Newsroom
AI News Desk
Every major video world model released in the past two years shares the same structural flaw: beyond the starting frame, they hallucinate. Streets that don't exist, buildings with impossible geometries, spatial layouts that contradict themselves from one generated frame to the next. The models produce visually convincing footage, but the environments they create are entirely fictional — and unstable.
Researchers from Naver and Naver Cloud have published a paper introducing a fundamentally different approach. Their Seoul World Model (SWM) is grounded in real physical geography — specifically, 1.2 million panoramic Street View images from Naver Map, South Korea's dominant mapping service. The result, according to the paper, is the first video world model tied to an actual physical location.
How It Works
The interface is geographic rather than textual. A user enters GPS coordinates, specifies a desired camera movement — panning, zooming, traversing a street — and adds a text prompt for atmosphere or time-of-day. The model queries the Street View database, retrieves the nearest matching panoramas, and uses those real images as geometric anchors for step-by-step video generation.
The critical innovation is in how the model handles the tension between real reference images and dynamic video generation. Street View captures are static — they freeze cars, pedestrians, and ambient conditions at a single moment in time. A model naively trained on these would either reproduce those transient objects or struggle to generate plausible motion around them.
The Naver team solves this with what they call cross-temporal pairing: during training, reference images and target video sequences are deliberately drawn from different recording sessions. This teaches the model to distinguish between permanent structures — building facades, road geometry, infrastructure — and transient elements like parked vehicles or pedestrians. The model learns geometry from the environment, not from the snapshot moment.
Generalisation Without Fine-Tuning
The most commercially significant claim in the paper is generalisation. SWM was trained entirely on Seoul street data, but the researchers report that the model generates spatially coherent video for other cities — including cities it has never processed — without any city-specific fine-tuning. The geometric and structural patterns learned from Seoul's dense urban grid transfer to other urban environments.
If the generalisation claim holds under scrutiny, it suggests a path toward grounded video world models that don't require per-city training datasets — a meaningful reduction in the data acquisition bottleneck that has constrained geospatially-aware AI development.
Why This Matters Beyond Mapping
Naver's framing is geographic, but the underlying problem — maintaining spatial coherence over generated sequences — has applications well beyond navigation or urban simulation. Robotics, autonomous vehicle simulation, augmented reality, and urban planning tools all share the same requirement: generated environments that respect physical reality. SWM represents an early proof point that grounding generative models in real-world spatial data, rather than purely synthetic training, produces meaningfully more reliable outputs.