LLMs are world models
They just model a different world than you think.
I’m a physicist by training, and I think the way “world models” are contrasted with large language models in AI discussions rests on a confused assumption about what counts as “the real world.” LLMs model the world of written language, which is as a valid and useful world as the physical world on human scale.
What do people mean by world models?
In AI, a world model is a model that predicts the next state or observation, given some action. This is usually taken to be the state of the real, physical world, but could for example also be a world in a video game or a market.
A world model, thus, could be a model that predicts how physical objects behave, which would help a robot navigate and make decisions. They could also be a part of generating videos with realistic liquids, or predicting risks of avalanches.
Large language models, on the other hand, are predicting the next token in a sequence. They build streams of words.
Their lack of a world model is by some taken as a proof that they don’t have any “real” knowledge – they merely predict/generate words, they don’t have any sense of a world that can give these words meaning.
Many valid worlds
The fundamental problem with this separation of LLMs and world models is, as I see it, that our immediate physical surrounding is assumed to be the real world.
I have no trouble at all accepting the reality of a physical world, but I strongly object to the notion that the world we can feel and touch is the real world.
An example might make my point more clear.
Imagine an army of humanoid AI robots, collecting data about our real world by walking around, observing, pushing things, lifting things, and whatnot. The data collected by these robots would eventually be an excellent base for building a world model. But that world model would basically only cover Newtonian physics (which, for now, we assume also covers things like elasticity and fluid mechanics). It would be an effective theory — extremely useful within a certain regime — but still only a slice of reality.
The world model would not cover things like chemical processes, quantum mechanics, the theory of relativity, or the expansion of the cosmos. It would only cover a slice of all the things we over millennia deduced about our physical world. A really useful slice, if you care about navigating robots, but not the real world.
The world of written thoughts
A world model created with data from the humanoid robots above would also lack many other things. It would be useless for describing macro economics, composing music, and giving advice on how to break up with a boyfriend. The realms of these “worlds” are not physical in the same way quantum phenomena and the expansion of cosmos are, but in our everyday life they are way more present – and thus way more relevant.1
The realm of written thoughts, too, is a world. Like the world of music or macro economics, it is not physical. But it has structures that can be modeled, and it is a highly complex world – much more complex than the physical world on our mesoscopic/human scale (which can be described by a pretty limited number of equations).
It is also highly relevant for us. Not as important as food and shelter, but it contains our human history, a large part of how our societies work, knowledge about human relationships, and a fair portion about our understanding of the physical world, too.
A world model for the world of written thoughts is pretty useless when trying to steer a humanoid robot, but it is good for other things. I would argue that it is as much a world model as a world model describing the world of a video game. I would also argue that how useful a world model is for us is much more interesting than how close it is to the physical world.
Predicting the next state
But world models should predict the next state of the world, given some action? And LLMs just spit out tokens.
I think this distinction is false, too.
If we accept that LLMs model the world of written thoughts, a string of words represents a mental state. Just as the position and velocity of objects define a state in a physical world model, the accumulated sequence of tokens defines a state in the linguistic world — a state that carries meaning, context, and implication.
The auto-recursive language models we are used to, building messages by adding tokens one at a time, are predicting the next state, one at a time. In an LLM, a prompt is an intervention in the linguistic world, and the model predicts how that world evolves token by token.
We often just look at the finished output from an LLM – the end state – but the LLM is actually used to evolve the written mental state one step at a time. Given the text in this post so far, an LLM could predict the following evolution:
LLM
LLMs
LLMs pre
LLMs predict
LLMs predict the
LLMs predict the next
LLMs predict the next state.
A notable difference
Given all this, there are still at least one important difference between LLMs and a models of physical surroundings. It is possible to touch, poke and observe our physical surroundings in a way that makes reinforcement learning much more open-ended than for LLMs. A robot-relevant world model is based on anything physical we can collect data about, while the linguistic universe that is the training data for an LLM is much more static. This changes how the model training is done, but it does not change what should count as a world model.
What’s the point?
The core of this post is that the relevant question isn’t whether LLMs have a world model (or are world models), but what world they model. All models, including those intended to power humanoid robots, models incomplete slices of the full world. Some slices are closer to the physical ground truth – whatever that might be – and some model more emergent phenomena. None of them should be considered uniquely real.
In our physical world, it matters very little whether we make a (false) distinction between LLMs and world models. But in the world of thoughts, it does. How we describe things shape how we think about them, what we expect from them, and how we utilise them.
I think LLMs should be considered world models. If this post sparks any thoughts in your mind, I’d love to read about them in a comment.
One could argue that these phenomena in theory could be deduced using quantum chromo dynamics, and thus are physical. But describing macro economics in terms of quantum fields is, in any practical terms, impossible.


