Naver's 'Seoul World Model' Grounds AI in Reality With 1 Million Street View Images
South Korean technology giant Naver has built a video world model that uses actual Street View data to prevent AI from generating plausible-looking but physically impossible urban environments. The system generalizes across cities it has never been trained on — a significant step toward AI that can reason accurately about the physical world rather than confabulate it.

D.O.T.S AI Newsroom
AI News Desk
One of the most persistent problems in AI-generated video and spatial reasoning is what researchers bluntly call "hallucinating cities" — the tendency of generative models to produce urban environments that look convincing at a glance but violate basic rules of physics, geometry, and real-world spatial layout. Buildings float. Streets terminate without logic. Shadow angles contradict the position of the sun. The AI has learned the aesthetic of a city without learning the reality of one.
Naver's new Seoul World Model takes a different approach. Rather than learning to generate cities from image datasets alone, it grounds its spatial representations in over one million Street View images of Seoul — images that carry real-world geometric constraints, GPS coordinates, and temporal consistency that pure image generation datasets do not provide.
Why Street View Data Is Different
Street View imagery is inherently constrained in ways that make it uniquely valuable for world model training. Each image is timestamped and GPS-tagged. The sequence of images taken along a route creates an implicit 3D model of space — objects that appear in one frame must appear at the correct position and scale in the next. Shadow angles are consistent with the recorded time of day. Building facades, road markings, and signage are real, not generated approximations.
By training on this data, the Seoul World Model learns not just what cities look like but what makes them physically consistent — a form of grounded spatial reasoning that pure diffusion models, trained on unordered image corpora, do not naturally acquire.
Generalization Without Fine-Tuning
The more significant finding is the model's generalization capability. Despite being trained exclusively on Seoul Street View data, the system generalizes to other cities — including cities outside South Korea — without requiring additional fine-tuning on local imagery.
This suggests the model has learned something more fundamental than "what Seoul looks like." It has learned principles of urban spatial structure — how streets relate to intersections, how building setbacks follow road typologies, how pedestrian and vehicle infrastructure interact — that transfer across contexts.
The Naver team notes that this generalization is not unlimited. The model performs best in cities with similar urban density and street grid logic to Seoul. Dense European cities and high-density Asian cities transfer well; sprawling low-density suburban environments and rural settings are more challenging.
Applications in Autonomous Systems and Urban Planning
The immediate application target is autonomous navigation. Systems that can simulate urban environments with physical fidelity are essential for training and testing self-driving systems without requiring every edge case to be encountered in the real world. A world model that accurately represents the geometry of intersections, construction zones, and pedestrian behavior is more useful than one that merely looks photorealistic.
Urban planning and digital twin applications are also within scope. City governments are increasingly using AI-generated spatial models to simulate the effects of proposed infrastructure changes. A model grounded in real-world Street View data provides a more reliable substrate for those simulations than one trained on abstract urban imagery.
The Broader World Model Race
Naver's Seoul World Model enters a competitive field. Google DeepMind's Genie 2, Wayve's GAIA, and a range of research systems from academia are all pursuing world models with varying degrees of physical grounding. What distinguishes Naver's approach is the explicit use of structured geographic data — not just images — as the training foundation.
Whether this approach proves superior at scale remains to be demonstrated. But the principle it embodies — that AI systems reasoning about physical space should be trained on physically grounded data — is a commonsense corrective to the field's dominant generative approach.