What if the screen you’re staring at right now wasn’t being rendered by an operating system, but was being fabricated frame-by-frame by a video model? Could you tell the difference?
In April 2026, a research team from Meta and KAUST published a paper called Neural Computers, proposing an idea that sounds audacious: fold computation, memory, and I/O entirely into the internal state of a video generation model, making the model itself a “computer.” They built two prototypes on top of Wan2.1, a video diffusion model. One simulates a terminal command line, the other a desktop GUI. The input is a screen frame plus user actions (keyboard or mouse); the output is the next screen frame. In other words, the model is performing a computer.
The bottom line first: this direction has zero practical impact on most tech practitioners today. The Neural Computer prototype can’t even do two-digit addition correctly. But it points toward a technical path fundamentally different from AI Agents. If you care about what “software” will become in the next decade, this paper provides a serious entry point.
The terminal prototype (NCCLIGen) was trained on roughly 1,100 hours
of terminal recordings. Its visual rendering quality is surprisingly
high: terminal frame PSNR reaches 40.8 dB and SSIM reaches 0.989 (for
reference, PSNR above 40 dB typically means the difference is
imperceptible to the human eye). Cursor blinking, window scrolling, text
wrapping, and full-screen TUI progress bars all render correctly. The
GUI prototype (NCGUIWorld) achieves 98.7% cursor position accuracy. For
simple commands like pwd, date, and
echo, the model produces plausible-looking output (MarkTechPost).
But once the task requires logical reasoning, the model hits a hard capability boundary. Two-digit addition results are almost always wrong. This isn’t a training data problem. It reflects a deeper limitation: video models excel at learning visual patterns (pixel-level statistical regularities) but struggle with formal symbolic computation.
The project page describes it precisely: what the model learns first is “the appearance of runtime,” not the logic of runtime (Neural Computer project page). The screen looks like a working computer, but the model hasn’t truly “understood” computation itself.
An interesting data point: 110 hours of carefully scripted terminal data (standard input/output sequences generated via scripts inside Docker containers) significantly outperforms 1,400 hours of random terminal recordings for training. Data quality matters far more than quantity for these systems, consistent with broader experience in the world model field.
The Neural Computer concept seems wild in isolation, but placed in the context of the past five years, it’s the natural extension of a clear technical trajectory. The core question driving this trajectory: to what extent can neural networks replace traditional software end-to-end?
The starting point was Ha and Schmidhuber’s World Models in 2018, which simulated simple racing game environments in latent space for policy planning. In 2020, NVIDIA’s GameGAN went further: training a GAN on 50,000 Pac-Man gameplay sessions and corresponding controller inputs, the model learned to reconstruct the entire game. It learned not just how Pac-Man moves, but also game rules like ghosts turning purple and fleeing after a power pellet is consumed. Four NVIDIA GP100 GPUs, four days of training, and you get a playable Pac-Man (Engadget).
This phase proved one thing: game rules can emerge from pure visual observation, without access to the game engine’s source code. But the complexity was limited to 2D arcade games.
2024 was the breakout year. Multiple projects simultaneously pushed complexity up by an order of magnitude.
Google’s GameNGen used a diffusion model to simulate the classic FPS game DOOM in real time, running above 20fps on a single TPU with a PSNR of 29.4. Human evaluators could barely distinguish real gameplay clips from model-generated ones. Stable runtime exceeded five minutes (AI CERTs).
Decart and Etched’s OASIS was more radical: a Transformer-based model generating a Minecraft-style open world in real time at 20fps, with no physics engine, no game code. You can walk, jump, place blocks, chop trees. Of course, if you turn around, the block you just placed might have morphed into something else, because the model hallucinates (MIT Technology Review).
Google DeepMind’s Genie series moved from academic research toward product. Genie 2 (December 2024) could generate interactive quasi-3D environments from a single image, maintaining consistency for about one minute. By Genie 3 (August 2025), resolution reached 720p at 24fps with roughly 150ms latency, still with about one minute of coherence. Environmental consistency emerged from the model rather than being enforced by explicit 3D representations.
This phase proved two things: diffusion models and transformers can simulate complex 3D environments in real time; but all systems share a common coherence ceiling of roughly one minute before visual drift begins.
Neural Computer and NeuralOS appeared almost simultaneously, shifting the simulation target from game environments to computer interfaces themselves.
NeuralOS, from the University of Waterloo and the National Research Council of Canada, takes a slightly different architectural approach: an RNN maintains the operating system’s internal state (which applications are running, window stacking order, recent actions), while a diffusion renderer generates desktop images. Training data consists of Ubuntu XFCE desktop recordings, including both random interactions and human-like interactions generated by AI agents. The model correctly renders cursor movement, double-clicking to open folders, and closing windows. But like Neural Computer, precise keyboard input modeling remains challenging (Hugging Face).
Runway’s GWM-1 (December 2025) took a more commercial route, splitting its world model into three branches: environment simulation (Worlds), robot training (Robotics), and virtual characters (Avatars), all built on autoregressive real-time generation atop the Gen-4.5 video model.
Google/MIT/Berkeley’s UniSim tried a different path: mixing multiple data sources (rich objects from images, dense actions from robotics data, diverse movements from navigation data) to train a universal simulator that responds to both high-level instructions (“open the drawer”) and low-level controls (“move by x pixels”).
The core advance in this phase is expanding the simulation scope from “specific game environments” to “general computer interfaces” and “the physical world.” But the fundamental difficulty remains unchanged: models learn the appearance and simple interaction patterns of environments, not the logic behind them.
Another trajectory worth noting is Tesla FSD v12. In March 2024, Tesla completely replaced 300,000 lines of hand-written C++ control code with a single end-to-end video transformer model. This is the largest real-world deployment of “neural networks replacing traditional software.” But its success rests on extremely demanding conditions: over 8.4 billion miles of real driving data, 100 petaflops of dedicated Dojo training compute, and the fact that autonomous driving as a task can be repeatedly practiced and automatically scored.
From Pac-Man to DOOM to Minecraft to Ubuntu desktop to terminal command line, the simulation targets along this trajectory grow more complex, but one pattern never changes: models always learn the visual layer first, and the logical layer is always the hardest.
GameGAN renders convincing Pac-Man visuals, but ghost pursuit strategies are only approximately correct. OASIS generates Minecraft terrain and block operations, but world consistency can collapse when you turn around. Genie 3’s environmental consistency is an emergent property that begins to drift after about one minute. Neural Computer renders perfect terminal frames, but gets 23 + 45 wrong.
This pattern reveals a fundamental issue: video models are trained to
minimize pixel-level prediction error, an objective that naturally
favors learning visual statistical regularities (color distributions,
spatial layouts, motion patterns) over abstract logical relationships.
On a terminal screen, 23 + 45 = 68 and
23 + 45 = 71 differ by almost nothing in pixel space, but
one is correct and one is wrong. The video model’s loss function doesn’t
distinguish between them.
Andrej Karpathy’s early 2026 observation provides a useful evaluation framework: if a task can be practiced, scored, reset, and rewarded, AI will get very good at it. Applying this framework to Neural Computer’s tasks:
Rendering terminal frames? All four conditions are satisfied. That’s why visual quality can be very high.
Executing correct arithmetic? “Scoring” requires exact symbolic matching rather than pixel similarity, which is incompatible with the video model training paradigm. “Practice” requires not more recordings but the formal structure of arithmetic itself. Neither condition is currently met.
This means that without addressing the separation of “visual rendering” and “logical reasoning” at the architectural level, simply adding more training data and model parameters is unlikely to push Neural Computer past the “correct computation” threshold. Interestingly, NeuralOS chose a different architectural path, using an RNN specifically for logical state and diffusion specifically for visual rendering, which may be a response to this exact problem.
The value of understanding Neural Computer lies not in what this prototype can do, but in how it reveals two fundamentally different paths for AI’s relationship with software, and what the trajectory of these two paths means for technology investment and product design.
Path A: AI learns to use software. This is the dominant narrative today. AI Agents run on top of traditional software stacks, invoking existing tools through APIs, command lines, or even simulated mouse and keyboard inputs. Software itself doesn’t change; AI is simply a smarter user. The implicit premise: the traditional software stack is a stable infrastructure layer.
Path B: AI learns to become software. This is the direction Neural Computer represents. Not having AI operate a terminal, but having AI directly generate terminal frames. Not having AI call a physics engine, but having AI directly produce physically plausible frames. GameNGen simulates DOOM, OASIS simulates Minecraft, Genie 3 simulates arbitrary interactive environments, and Neural Computer attempts to simulate general computation. The implicit premise: software itself can be learned.
These two paths aren’t mutually exclusive, but they carry different judgment value for different audiences.
If you’re building AI products, Path B’s progress means the choice between “building better APIs for AI” and “having AI directly generate interactive experiences” is opening up. Runway’s GWM-1 is already pushing toward the latter. This doesn’t mean APIs will disappear, but in certain scenarios, generating an entire interactive experience may be more natural than assembling API calls.
If you’re building infrastructure, Path B implies that a portion of computing is migrating from deterministic instruction execution to neural network inference. This has already happened in autonomous driving (Tesla FSD replaced 300K lines of C++ with an end-to-end model), and is happening in video generation and robot training. This migration changes GPU/TPU economics and deployment architectures.
If you care about AI’s long-term trajectory, the fork between these two paths points to a more fundamental question: is AI’s endgame an intelligent agent skilled at using tools, or a system that can directly simulate how the world runs? The ceiling of the former depends on the toolchain’s own capabilities; the ceiling of the latter depends on learning efficiency and precision. Neural Computer’s experiments show exactly where the latter’s ceiling currently sits: visual learning works, logical learning doesn’t.
The most valuable part of Neural Computer may not be its prototype, but the cognitive framework it provides: treating the boundary between “model” and “computer” as something that can be redefined.
Under this framework, the past five years of progression from Pac-Man to Ubuntu desktop are no longer a collection of independent demos, but a progressive attack on the same problem: to what extent can a neural network internalize the complete behavior of an interactive system? GameGAN internalized the rules of a 2D arcade game, GameNGen internalized DOOM’s real-time rendering and physics feedback, OASIS internalized Minecraft’s open-world generation, Genie internalized the mapping from arbitrary images to interactive environments, and Neural Computer attempted to internalize general computation itself.
Each step expands the range of “things that can be replaced by learning.” Each step also hits the same wall: learning appearance is far easier than learning logic.
Whether this wall can be broken determines whether this trajectory ultimately produces a batch of useful world simulators (for gaming, robot training, content generation), or truly leads to a new computing paradigm. The former is already happening today. The latter may require something we don’t yet know how to build.
Sources