It's a tradition for hackers to run DOOM in crazy places: thermostats, "smart" toasters, even ATMs. Now they run DOOM purely in a diffusion model. Every pixel here is generated. A while ago, I said "Sora was a data-driven physics engine". Well, not quite, because Sora could not be interacted with. You set the initial condition (a text or initial frame) and may only watch the simulation passively.
GameNGen is a proper neural world model. It takes as input past frames (states) and an action from a user (keyboard/mouse), and outputs the next frame. The quality is by far the most impressive I've seen on DOOM.
However, this comes with significant caveats. Let's deep dive:
1. GameNGen overfits to the extreme on a single game by training on 0.9B frames (!!). This is a HUGE number, almost 40% of the dataset used to train Stable Diffusion v1. At this point, it's likely memorizing how DOOM renders from every corner of the game in all scenarios. DOOM doesn't have that much content anyway.
2. GameNGen is more like a glorified NeRF than a video gen model. A NeRF takes images of a scene from different view angles, and reconstructs the 3D representation of the scene. The vanilla formulation has no generalization capability, i.e. it could not "imagine" new scenes. GameNGen is not like Sora: by design, it could not synthesize new games or interaction mechanics.
3. The hard part of this paper is not the diffusion model, but the dataset. Authors trained RL agents to play the game first, at various different skill levels, and collected 0.9B (frame, action) pairs for training. Most of the video datasets online do NOT come with actions, which means this method wouldn't extrapolate. Data is always the bottleneck for action-driven world models.
4. There're two practical use cases for game world models in my mind: (1) write a prompt to create playable worlds that would otherwise take game studios years to make; (2) use the world model to train better embodied AI. Neither use cases can be realized. Use case (2) doesn't work because there's no advantage to use GameNGen for training agents than directly using DOOM simulator itself. It'd be more interesting if a neural world model simulates scenes that traditional hand-crafted graphics engines cannot.
What's an example of a truly useful neural world model? Elon said in a reply that "Tesla can do something similar with real world video". Not surprising: Autopilot team likely has trillions of (camera feed, steering wheel action) pairs. Again, data is the hard part! With such rich real-world data, it's entire possible to learn a general driving sim that covers all kinds of edge cases, and use that to deploy & verify a new FSD build without physical cars.
GameNGen is still a really great proof of concept. At least we know by now that 0.9B frames is the upper bound to compress high-res DOOM into a neural network.
Paper: Diffusion Models Are Real-Time Game Engines
https://lnkd.in/g9aW_uUK