Skip to content
Chimera readability score 0.5512 out of 100, reading level.

Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video games.
While a few have managed to beat a few games (for example, Gemini 2.5 Pro beat Pokemon Blue in May of 2025), these exceptions prove the rule. The eventually victorious AI completed games far more slowly than a typical human player, made bizarre and often repetitive mistakes, and required custom software to guide their interactions with the game.
Julian Togelius, the director of New York University’s Game Innovation Lab and co-founder of AI game testing company Modl.ai, explored the implications of LLMs’ limitations in video games in a recent paper. He spoke with IEEE Spectrum about what this lack of video games skills can tell us about the broader state of AI in 2026.
LLMs have improved rapidly in coding, and your paper frames coding as a kind of well-behaved game. What do you mean by that?
Julian Togelius: Coding is extremely well-behaved in the sense that you have tasks. These are like levels. You get a specification, you write code, and then you run it.
The reward is immediate and granular. The code has to compile, it has to run without crashing, and then it usually has to pass tests. Often, there’s also an explanation of how and why it failed.
There’s a theory from game designer Raph Koster that games are fun because we learn to play them as we play them. From that perspective, writing code is an extremely well-designed game. And in fact, writing code is something many people enjoy doing.
Unlike coding, LLMs struggle with video games. This feels surprising given their success in coding, as well as in games like chess and Go. What is it about video games that’s causing a problem?
Togelius: It’s not just LLMs that are bad at this. We do not have general game AI.
There’s a widespread perception that because we can build AI that plays particular games well, we should be able to build one that plays any game. I’m not sure we’re going to get there.
People will mention that Google’s AlphaZero [which is not an LLM] can play both Go and chess. However, it had to be retrained and re-engineered for each. And those are games that are similar in terms of input and output space. Most games are more different from each other. They have different mechanics and different input representations.
There’s also a data problem. Some of the games that AI can successfully play, like Minecraft and Pokémon, are among the most well studied games in the world with literally millions of hours of guides. For a less well-known game, there’s far less.
Video Game Benchmarks for LLM Performance
One factor that seems to help LLMs improve in coding is the proliferation of benchmarks. We have many benchmarks LLMs can try to solve, we can score the results, and then modify the LLM to improve performance. Developing a benchmark for playing a video game, though, is less clear-cut. Why is that?
Togelius: I’ve built many game-based AI benchmarks over the years. One, the General Video Game AI competition, ran for seven years. We tested an agent on our publicly available games, and every time we ran the competition we invented ten new games to test on.
One reason we stopped was that we stopped seeing progress. Agents got better at some games but worse at others. This was before LLMs.
Lately we’ve been updating this framework for LLMs. They fail. They absolutely suck. All of them. They don’t even do as well as a simple search algorithm.
Why? They were never trained on these games, and they’re separately very bad at spatial reasoning. Which shouldn’t be surprising, because that’s also not in the training data.
This brings us to what seems like a contradiction. LLMs are bad at playing games. Yet at the same time, they’re improving rapidly at coding, a skillset that can be used to create a game. How do these facts fit together?
Togelius: It’s super weird. You can go into Cursor or Claude, write one prompt, and get a playable game. The game will be very typical, because an LLM’s code writing abilities are better the more typical something is. So, if you ask it to give you something like Asteroids, it will work. That’s impressive.
However, it’s not going to give you a good or novel game. That does seem weird. The reason is that the LLM can’t play it. Game development is an iterative process. You write, you test, you adjust the game feel. An LLM can’t do that.
And to an extent, I don’t think it’s different when designing other software. Yes, you can ask an LLM to create a GUI with a bunch of buttons. But the LLM doesn’t know much about how to use it.
Companies like Nvidia and Google have talked about using simulations, including game-like environments, to improve AI performance. If AI can’t master games in general, how optimistic should we be about that approach?
Togelius: Games are both easier and harder than the real world. They’re easier because there are fewer levels of abstraction. They’re harder because games are much more diverse. The real world has the same physics everywhere.
One example is Waymo, which uses world models in its training loop. That makes sense because driving is much the same everywhere. It’s way less diverse than games.
That’s confusing for people. People see an LLM write an academic essay on quantum physics and wonder, “how can it not play both Halo and Space Invaders?” However, those games are more different from each other, in a sense, than two academic essays.
Matthew S. Smith is a freelance consumer technology journalist with 17 years of experience and the former Lead Reviews Editor at Digital Trends. An IEEE Spectrum Contributing Editor, he covers consumer tech with a focus on display innovations, artificial intelligence, and augmented reality. A vintage computing enthusiast, Matthew covers retro computers and computer games on his YouTube channel, Computer Gaming Yesterday.

Facts Only

Julian Togelius is the director of New York University’s Game Innovation Lab and co-founder of Modl.ai.
LLMs have shown rapid improvement in coding but struggle with video games.
Gemini 2.5 Pro beat *Pokémon Blue* in May 2025, but required custom software and performed slowly.
Togelius argues that coding is a "well-behaved game" with clear tasks, immediate feedback, and structured rewards.
Video games lack uniformity, with diverse mechanics, input systems, and spatial reasoning demands.
AlphaZero, a non-LLM AI, was retrained separately for Go and chess, which have similar input/output structures.
The General Video Game AI competition ran for seven years but was discontinued due to stagnant progress.
LLMs perform worse than simple search algorithms on unfamiliar games due to lack of training data and poor spatial reasoning.
LLMs can generate basic game code but cannot iteratively test or refine gameplay.
Companies like Nvidia and Google use simulations for AI training, but games are more diverse than real-world tasks like driving.
Togelius notes that two video games can be more different from each other than two academic essays.
Matthew S. Smith is a freelance technology journalist and IEEE Spectrum Contributing Editor.

Executive Summary

Large language models (LLMs) have made rapid progress in many domains, particularly in coding, where they excel due to the structured, rule-based nature of programming tasks. However, they struggle significantly with video games, despite successes in games like chess and Go. Julian Togelius, director of NYU’s Game Innovation Lab, highlights that LLMs lack general game-playing ability, often performing poorly even on simple games unless heavily guided by custom software. The challenge stems from the diversity of video games—each has unique mechanics, input systems, and spatial reasoning requirements—unlike coding, which follows predictable patterns with immediate feedback. Benchmarks for LLM performance in gaming are difficult to develop because progress is inconsistent; agents may improve on some games while regressing on others. While LLMs can generate playable game code, they cannot iteratively test and refine gameplay, limiting their ability to create novel or high-quality games. Togelius suggests that simulations and game-like environments may still be useful for AI training, but their effectiveness depends on the domain—driving, for example, is more uniform than the varied landscapes of video games.

Full Take

The narrative presents a compelling case for the limitations of LLMs in video games, contrasting their coding prowess with their gaming ineptitude. The strongest version of this argument—steelman—acknowledges that LLMs excel in structured, rule-based domains like coding but falter in the open-ended, diverse world of video games. This isn’t just an LLM problem; even specialized game AI struggles with generalization. The pattern scan reveals no overt manipulation, but there’s an implicit tension: the article frames gaming as a "harder" problem than coding, which may understate the role of training data and benchmark design. The root cause lies in the assumption that intelligence is domain-agnostic—yet LLMs thrive where patterns are consistent and fail where creativity and adaptability matter.
The implications for human agency are significant. If LLMs can’t master games, what does that say about their ability to navigate real-world complexity? The cost is borne by those expecting AI to replace human creativity, while the benefit accrues to researchers who recognize these limits. Second-order consequences include a potential shift in AI development toward more modular, task-specific systems rather than monolithic models.
Bridge questions: How might we design better benchmarks to measure adaptive intelligence? Could hybrid systems—combining LLMs with specialized agents—bridge this gap? What if the real limitation isn’t the models but our understanding of intelligence itself?
Counterstrike scan: A bad actor might use this narrative to dismiss AI progress entirely or to argue for over-specialization. However, the article’s balanced tone and focus on technical constraints don’t align with such a playbook. It’s a healthy critique, not a coordinated attack.
Patterns detected: none

Sentinel — Human

Confidence

The article shows strong human authorship signals, including expert interviews, idiosyncratic phrasing, and contextual depth, with no detectable AI-generated patterns.

Signals Detected
low severity: Varied sentence length and natural digressions (e.g., personal anecdotes, tangential explanations).
low severity: Presence of idiosyncratic emphasis (e.g., 'super weird,' 'absolutely suck') and personal voice in quotes.
low severity: No evidence of template-matching or verbatim talking points across sources.
low severity: Specific attributions (e.g., Julian Togelius, IEEE Spectrum) and verifiable context (e.g., Modl.ai, NYU).
Human Indicators
Expert interview with direct quotes containing informal phrasing ('super weird').
Author's byline includes specific credentials and niche interests (vintage computing).
Narrative flow includes natural asides and rhetorical questions.