Something counterintuitive has been happening in robotics over the past year.
Unitree’s G1 completes complex motions across a wide range of scenarios, Figure AI’s robots have been working continuously in BMW factories for ten months, and Physical Intelligence’s π₀ can fold laundry and assemble objects. Even Boston Dynamics’ Atlas — painstakingly refined over thirty years — began incorporating reinforcement learning and foundation model components in 2024.
The common thread across this new generation of robots: they no longer try to understand physics.
The core assumption of traditional robot control is “model the physics first, then control the robot.” You write out rigid body dynamics equations, model joint friction, ground contact forces, and actuator nonlinearities, then use MPC (Model Predictive Control) or trajectory optimization to solve for the optimal control sequence. This approach is elegant, interpretable, and comes with mathematical guarantees.
VLA (Vision-Language-Action) took a completely different path. It takes a pretrained vision-language model (such as Google’s PaLI or Meta’s Llama), fine-tunes it on robot manipulation data, and lets the model directly predict the next action from camera images and language instructions. At its core, this is just next token prediction — no different in principle from how ChatGPT generates text, except the output tokens are joint angles instead of words.
An obvious question follows: how can a model that has never learned Newton’s laws control a robot?
To answer that question, we need to look at where each approach has been, and where each has hit its limits.
The physics track traces back to the 1970s. Vukobratović proposed the ZMP (Zero Moment Point) criterion: as long as the net external moment falls within the support polygon, the robot won’t fall over. Honda’s ASIMO was designed on this principle. The cost was extremely slow walking, since ZMP essentially requires quasi-static balance.
Marc Raibert at MIT Leg Lab in 1986 broke through this limitation. His key insight was that momentum itself is a balancing resource to be exploited, not avoided. He decomposed locomotion into three decoupled subproblems (hopping height, forward speed, body attitude), each solved with a simple PD controller. This pioneered dynamic balance research, but the control laws relied on hand-derived simplified models that were difficult to scale to high-DOF robots.
In the 2010s, Todorov and others turned trajectory optimization into real-time MPC, using DDP (Differential Dynamic Programming) to perform second-order expansion around the current state — enabling 50Hz real-time control of a 28-DOF humanoid for the first time. MIT’s Di Carlo and Kim further reformulated MPC as a convex quadratic program, compressing computation to under one millisecond. This became the standard tool in industry, including Boston Dynamics’ Spot.
This is where the first bottleneck of the physics track emerged: modeling accuracy. MPC effectiveness depends directly on the accuracy of the physical model, and there are too many real-world phenomena that resist precise modeling. Motor nonlinearities, drivetrain friction, and foot-ground contact forces all require extensive manual tuning, and model parameters shift with temperature and wear.
ETH Zürich’s Hwangbo et al. (2019) offered a landmark response: use neural networks to replace hand-modeled actuator dynamics. They learned the motor’s input-output relationship directly from real hardware data, swapped out the idealized model in the simulator, and trained RL policies in the more realistic simulation. The ANYmal quadruped thereby achieved zero-shot sim-to-real transfer for highly dynamic motions like backflips — for the first time.
This was a subtle but important inflection point: the physics track’s own evolution had already started pointing toward data-driven methods, as hand-modeled components hit their accuracy ceiling.
NVIDIA’s Isaac Gym (2021–2022) pushed this trend to the extreme: parallelizing 4096 robots in simulation on GPU, compressing training time from days to 20 minutes. ETH’s Lee and Miki (2020–2022) used a teacher-student architecture to enable ANYmal to complete real mountain hikes. By this point, the physics track had effectively become “train RL inside a physics simulator” — the role of the physical model had been demoted from “the controller itself” to “infrastructure for the training environment.”
The VLA track started with behavior cloning: given expert demonstration observation-action pairs, train a mapping. The approach is simple, but carries three fundamental problems: distribution shift causes compounding errors; MSE loss cannot handle multimodal action distributions (picking up with the left hand and picking up with the right hand are both correct answers, and averaging them produces a wrong answer); and generalization is extremely poor.
In 2023, TRI’s Diffusion Policy replaced direct regression with a diffusion model, naturally resolving the multimodality problem and improving on the then-best methods by 46.9% on manipulation tasks (per the paper’s claim). But it had no understanding of language instructions and was still a single-task policy.
The real turning point was Google’s RT-1 (2022) and RT-2 (2023). RT-1 first demonstrated that scaling laws apply to robot control: 13 robots collecting 130,000 real demonstrations over 17 months produced a Transformer policy that generalized across more than 700 task types. But RT-1’s knowledge came entirely from robot data, leaving it helpless against instructions requiring commonsense reasoning, like “bring me the drink that’s appropriate for a child.”
RT-2 did something that looked audacious: take a 55-billion-parameter pretrained vision-language model (PaLI-X) and fine-tune it directly on robot data, so it outputs action tokens alongside text tokens. Actions are quantized into 256 discrete values, encoded as strings, and predicted alongside ordinary words. In other words, the mechanism by which the model generates the token “0.312” is exactly the same as when it generates the token “apple.”
RT-2’s results far exceeded expectations. It substantially improved on manipulation tasks (Google DeepMind’s official blog reported a 3x improvement in zero-shot generalization), and emergent capabilities appeared: it could handle instruction combinations never seen during training. Internet pretraining had given it semantic understanding of the physical world — a dimension entirely absent from traditional physical models.
Subsequent development advanced along three main lines. Octo (2024) and OpenVLA (2024) addressed the openness problem: OpenVLA used a 7B-parameter open-source VLM (DINOv2 + Llama-2) to outperform the 55B-parameter RT-2 on public benchmarks (Stanford paper claims +16.5% absolute success rate, though the evaluation setups differ and this should be interpreted carefully). Physical Intelligence’s π₀ (2024) addressed precision and frequency: it replaced discrete tokenization with flow matching to predict continuous actions directly, outputting the next 50 steps at once, enabling 50Hz high-frequency control and dexterous manipulation.
At this point, the stories of both tracks have been told. But the surface-level narrative is only part of it. The deeper question is: how can a model with no physical understanding beat carefully physics-modeled approaches on physical control tasks?
I believe the answer has nothing to do with physics per se, and everything to do with information theory.
Physical modeling is fundamentally a form of compression. Newton’s laws compress the entire motion behavior of macroscopic objects into three equations. Rigid body dynamics equations compress multi-joint robot state transitions into matrix operations. MPC compresses the definition of “good motion” into a cost function and constraints. This compression is highly efficient for simple systems: rocket rigid body mechanics + thrust equations + aerodynamics can predict trajectories with high accuracy, and SpaceX’s rocket recovery still uses convex optimization for powered descent guidance today.
But compression means losing information. The moment you choose a rigid body model, you discard information about flexible deformation. The moment you choose Coulomb friction, you discard the temperature dependence and anisotropy of friction forces. The moment you simplify full-body dynamics to a single rigid body, you discard the influence of limb inertia. For simple systems, this lost information genuinely doesn’t matter. But for complex systems — a robot picking up a water-filled cup in a kitchen — the physical phenomena involved (cup material, water sloshing, tabletop friction, finger deformation) far exceed what any hand-built model can cover.
More critically, this compression has a ceiling: its accuracy is bounded by human modeling capability. More compute only lets you solve equations faster; it cannot make the equations themselves more accurate. You can raise MPC’s computation frequency from 50Hz to 500Hz, but if your friction model is wrong, 500Hz just means executing the wrong actions faster.
What VLA does is fundamentally abandon this compression. It uses an extremely large general-purpose function approximator (a transformer) to directly learn the mapping from perception to action. This mapping is not defined by human-specified equations — it is defined by data. So its accuracy ceiling is not some engineer’s modeling ability, but data volume and compute. As long as both can continue to grow, accuracy can continue to improve. This is what “unsaturated” means: the accuracy curve of traditional methods flattens out (capped by modeling precision), while the accuracy curve of learned approaches can keep rising given sufficient data and compute.
This logic is exactly the same as how deep learning won in other domains. Traditional NLP tried to first understand syntax and semantics (compression); LLMs abandoned that intermediate step and went straight to next token prediction (no compression). Traditional computer vision tried to first extract edges, textures, and shape features (compression); CNNs and ViTs learn end-to-end from pixels to labels (no compression). Every time “no compression” beat “compression,” it was fundamentally the same story: when data and compute cross a certain threshold, the accuracy of general-purpose function approximators starts to exceed that of human-designed compression algorithms.
But this also implies something important: in data-scarce domains, physical modeling remains the better choice. SpaceX’s rocket recovery is the canonical example — rocket dynamics can be precisely modeled (low system complexity), and each launch provides only one data point (extremely low data density). So convex optimization is more appropriate than any learned approach. Deciding which track fits a control problem comes down to two variables: system complexity (how much information can hand-modeling compress away without losing critical dimensions?) and data abundance (how much data is available to let a function approximator fill the state space?).
Interestingly, if you look carefully at the evolution of the physics track itself, its own development direction validates this judgment.
In 2019, ETH’s Hwangbo replaced hand-modeled actuator dynamics with neural networks, admitting that human modeling of actuator dynamics was no longer sufficient. In 2022, Isaac Gym made large-scale RL training practical, and control policies themselves shifted from hand-designed to learned from data. In 2024, Boston Dynamics officially introduced RL on Spot, with the official blog writing: “These strategies work well when the controller’s model behaves similarly to the physical system…[RL gives] improved reliability on slippery and irregular surfaces.” That same year, the electric Atlas was announced with RL and foundation model components. In 2025, Spot using RL reached a speed of 5.2m/s — more than three times the maximum speed of the original MPC controller (1.6m/s).
The direction of the physics track’s evolution is unambiguous: increasingly more components are transitioning from hand-modeled to data-learned. The role of physical models has been progressively demoted from “the controller itself” to “infrastructure for the training environment.” VLA simply takes this trend to its logical extreme: no physics simulator is needed at all — learning proceeds directly from real-world video and demonstration data.
It should be noted that the actual industrial landscape in 2025 is not “VLA has fully replaced physical methods.” The real picture is a continuous spectrum from pure physics to pure learning.
Boston Dynamics still retains extensive traditional MPC/WBC (whole-body control) infrastructure; RL and foundation model components are new additions layered on top of thirty years of accumulated physics-based control. Unitree’s G1/H1 primarily uses RL for locomotion, and the manipulation side is progressively incorporating VLA. Figure AI’s Helix architecture is a typical layered design: System 2 (7B VLM, 7–9Hz) handles scene understanding and language instructions, while System 1 (visuomotor policy, 200Hz) handles specific limb execution. Physical Intelligence’s π₀ is currently the closest to a pure VLA solution, with no traditional physical control modules at all.
This spectrum also corroborates the framework above. Locomotion physics is relatively modelable (four-point contact, primarily rigid body dynamics), so RL + sim-to-real is already good enough. Manipulation tasks involve complex contact, diverse objects, and flexible materials — the compression loss from physical modeling is too large, and VLA’s advantage is much more pronounced. So most vendors have settled on a layered approach: locomotion with RL (trained in physics simulators), manipulation with VLA (learned from demonstration data).
Tesla’s FSD v12+ is another interesting reference point. At the end of 2023 it transitioned from a traditional C++ rule engine to an end-to-end neural network: multi-camera input → transformer → direct output of steering/throttle/braking. No explicit maps, rule engines, or physical models in between. Strictly speaking it is a VA (no language input), but its core logic is identical to VLA. The reason it could take this path is precisely: urban driving has extremely high environmental complexity (behavior of other vehicles, pedestrian intent, varied road conditions), meaning hand-modeling inevitably loses critical information; simultaneously, Tesla has millions of vehicles on the road, providing extremely high data abundance. Both conditions are met, making end-to-end the natural choice.
VLA is not a universal solution. The current approach has several recognized bottlenecks that are also the key dimensions for judging its applicable scope.
Precision remains a hard constraint. VLA’s performance on millimeter-level accuracy and fine force control is still insufficient. The GR-RL (2024) paper analyzed failure cases of large-scale VLA on long-horizon fine manipulation tasks, pointing to two core problems: uneven quality in human demonstration data, and distribution shift between demonstration and inference. For industrial scenarios requiring 6-sigma reliability (precision assembly, medical procedures), VLA cannot currently meet requirements.
Safety and interpretability are another dimension. Traditional MPC behavior is deterministic, constraint satisfaction has mathematical guarantees (KKT conditions), and formal verification is possible. VLA behavior is emergent — small input perturbations can cause behavioral discontinuities, there is no introspection interface, and industrial safety certification frameworks have not prepared a pathway for black-box neural networks. An important reason Boston Dynamics continues to retain traditional control infrastructure is that 500Hz low-level servo control and joint constraint guarantees are still handled by classical controllers; RL/VLA only operates at higher levels.
Long-horizon planning capability is also limited. Current VLA excels at single-step or short-horizon tasks (pick up a cup, fold one piece of clothing), but has limited capability for composite tasks requiring dozens of steps (tidy a room, cook a meal). This direction is being addressed through integration with world models: using a world model to predict possible future states, providing VLA with planning capability. Works like UniSim (Google/MIT, 2023) are also exploring the use of generative models to synthesize training data and reduce the cost of real data collection. The two tracks are complementary here.
To summarize: judging whether a robot control problem should follow the physics track or the learning track comes down to two variables.
System complexity: how much can hand-modeling compress system behavior while keeping information loss low? If the system’s physics can be precisely described with a small number of equations (rockets, industrial manipulator kinematics), physical modeling is the more efficient choice. If the system involves a large number of interactions that are hard to model (flexible contact, diverse objects, unstructured environments), hand-compression necessarily loses critical dimensions and the learned approach’s upper bound is higher.
Data abundance: how much data is available to let a general-purpose function approximator learn the state space? If data is extremely scarce (rocket launches, deep space exploration), physical knowledge as a prior is irreplaceable. If data is abundant or can be generated at scale (autonomous vehicle fleets, robot teleoperation, simulation environments), learned approaches can fully exploit their unsaturated advantage.
What the robotics field is currently experiencing is a progressive migration toward learned approaches as applications move from low to high system complexity. Industrial manipulator kinematic control (low complexity) requires almost no learning. Quadruped locomotion (medium complexity) has already migrated from MPC to RL + sim-to-real. General manipulation (high complexity) is migrating toward VLA. As data infrastructure matures and model capabilities improve, the problem domain suited to learned approaches continues to expand, and the boundary between the two tracks keeps moving.
This framework also explains why the physics track looks elegant but hits walls, while the brute-force approach wins. The elegance of physical modeling comes from its compression efficiency — but compression efficiency is precisely its bottleneck in complex systems. When data and compute are sufficient, “no compression” is a better strategy than “compression,” even if it looks less elegant. This pattern has repeated itself in NLP, computer vision, and autonomous driving; robot control is simply the latest validation.