Perspective

Two kinds of world models

Spatial intelligence and continual learning share a name and a diagram. They are not the same bet — and knowing which one you are buying changes everything.

Zak Data Solutions · June 3, 2026

“World model” has become one of the most important phrases in AI — and one of the most overloaded. Fei-Fei Li and the World Labs team argue that spatial intelligence is AI's next frontier, and that world models are the path to it. They are right about the frontier. But the term now stretches across at least two very different research programs that happen to share a name, a lineage, and a diagram. If you are evaluating AI for your organization, telling them apart is not academic. It decides what you should build, and what you should buy.

Start with the diagram, because both kinds descend from it. Decades ago, reinforcement-learning textbooks drew a simple loop: an agent takes an action; the action changes the state of the world; the agent never sees that state directly, only partial observations of it; new observations inform the next action; the loop continues. The idea is older still — in 1943, Kenneth Craik proposed that minds reason by running “small-scale models” of reality. A world model, in the original sense, is whatever an agent uses to anticipate how its world will respond. Everything now sold under the name is a projection of that one loop. The disagreement is about what the world is made of.

The spatial world model

World Labs splits the loop into three jobs. A renderer outputs observations — pixels for human eyes — and is judged on visual fidelity; a text-to-video model is a renderer. A simulator outputs state — geometry and physics faithful enough that both people and programs can compute on it; a physics engine or a factory digital twin is a simulator. A planner outputs actions — given an observation and a goal, what should the agent do next; a robot controller is a planner. All three are the same loop, viewed from a different end.

The most consequential of the three, and the least discussed, is the simulator. As Li puts it: if language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself.

Where language models learn the statistical structure of text, world models learn the statistical structure of space and time.
— Fei-Fei Li & the World Labs team

The hard part is that this kind of world is expensive. Faithful 3D data — with real material and physical properties — is orders of magnitude scarcer than the internet video a renderer trains on, and the gap between simulation and reality never fully closes. The prize is correspondingly enormous: robotics, autonomous vehicles, factory and supply-chain twins, drug discovery. NVIDIA alone sizes the addressable market past a trillion dollars. This is the spatial world model — a machine that learns the statistical structure of space and time.

The other world model

Now hold the diagram fixed and change one thing: what the world is made of. For most organizations, the world that matters is not a 3D scene. It is an operating environment — the data warehouse and the half-finished pipelines, the systems of record, the decisions made every day, and the institutional knowledge that lives in a handful of people's heads and walks out the door when they leave. There is a world model for that, too, and it runs the same loop. An agent observes the environment, acts in it, the action changes the environment's state, the next observation reflects the change — and, the part that matters, the agent keeps a persistent model of how that environment behaves.

That persistent model is a continually-learning knowledge base. It renders no pixels and simulates no physics. It does the one thing a renderer and a simulator do not: it accumulates. Every interaction either reinforces what the system already knows, surfaces a gap, or proposes an update. The model of your environment on Friday is sharper than the one on Monday, and sharper still next quarter. Call it an agentic, or operational, world model — a machine that learns the statistical structure not of space and time, but of how your organization actually works.

Same diagram, different substrate

The two programs are easy to conflate because they are genuinely the same shape. Both close the perception-action loop. Both build an internal model an agent consults to decide what to do next. Both are, at bottom, the same bet the field has been making since the late 1980s: that a sufficiently rich model of the world is what an agent needs to act well in it. The difference is the substrate. One masters geometry, physics, and dynamics; the other masters operations, context, and accumulated judgment. A simulator that predicts how a poured liquid settles is solving a different problem than a knowledge base that can tell you why this customer's contract carries a non-standard renewal clause, and which three other accounts share it.

What unites the two is also what separates both of them from the thing most people picture when they hear “AI” today. A stateless model answering one prompt at a time is running half a loop: an observation in, an answer out, and then it forgets. It holds no persistent model of any world, spatial or operational. Closing the loop and keeping the state is the architectural move in both programs. It is not a bigger language model. It is a different design.

Continuing a pattern, or deciding what to do

There is a deeper reason the stateless model falls short, and it predates this moment by decades. Richard Sutton — who wrote the field’s standard text on reinforcement learning — and Banafsheh Rafiee put it plainly in 2026: perception is not passive prediction, it is skillful action. A model trained to imitate patterns can extend them convincingly. But extending a pattern is not the same as knowing what to do when the pattern breaks.

A generative video model can continue a pattern, whereas an enactive system can determine what to do next when the pattern breaks.
— Banafsheh Rafiee & Richard S. Sutton, Toward Enactive AI (2026)

That sentence is the whole difference between a system that talks about your business and one that runs inside it. The exceptions are where the value lives — the renewal clause that doesn’t fit the template, the pipeline that fails in a way the runbook never anticipated. A model that has only seen the regular case continues the pattern off a cliff. An agent that has acted in your environment, kept the state, and learned from being wrong is the one positioned to notice the break and decide what to do about it. It is the same reason a frozen model is never enough: the world is far larger than any snapshot of it, so the agent has to keep interacting and keep learning. That is the bet we build on.

Which world are you buying?

So when a vendor says they are building world models, the useful question is: which world? If you need a robot to fold towels, a vehicle to drive itself, or a faithful twin of a production line, you need the spatial kind — and you should ask about geometry, physics fidelity, and the sim-to-real gap. If you need AI that gets measurably better at your business every week — that knows your data, remembers your exceptions, and captures the judgment your best people carry — you need the operational kind, and you should ask about state: what the system retains between sessions, how the knowledge base is curated, whether it lives on your infrastructure under your governance, and whether you can audit why it believes what it believes.

The two look almost identical on a slide. They could not be more different in what it takes to build one well.

Where we stand

We build the second kind. Zak Data Solutions stands up continually-learning agents that model how your organization actually operates — observing your environment, doing real work in it, and compounding what they learn into a knowledge base you own and can audit. We are not in the business of rendering scenes or simulating physics; the spatial frontier is someone else's excellent problem. Ours is the world your business already runs on — and the discipline of making an agent that gets sharper about it every week.

The operational kind, in practice.

If the second kind is the one your business needs, the next two pages are the concrete version of this argument — the architecture that keeps the state, and the reason it compounds.

How it works →Why continuous learning →