Part I: The Old Stack and Its Legacy

Chapter 3: Paradigm Shift Overview

Written: 2026-04-24 Last updated: 2026-04-24

3.1 The thesis and the reading path

This chapter is the map. If you work at a humanoid company, the rest of the book is a navigation exercise: your stack appears somewhere on this map, and the interesting question is where you sit relative to the four-catalyst dependency graph and the three-layer architecture. The chapters that follow are not tutorials; they are evidence that the map is drawn accurately.

The thesis the map illustrates is this. Humanoid control has undergone a regime change from prescriptive model-based optimization to distribution-covering learned implicit models, and the new regime has now earned the right to absorb manipulation and language through the System 0/1/2 stack. Four catalysts drove that change — Quasi-Direct-Drive (QDD) actuators, GPU-parallel simulation, teacher-student reinforcement learning with history encoders, and the sim-to-real correction toolkit. Each is necessary. None is sufficient alone. Their conjunction collapsed the orthodox LIPM/ZMP/QP pipeline that Chapter 1 audited and opened the architecture that Chapter 9 formalizes.

Chapter 3 does four things. First (§3.2), it draws the interdependency graph among the four catalysts — why no single catalyst would have sufficed, and in what specific way each one underwrites the others. Second (§3.3), it lays out the historical timeline from 2015 to 2026, showing when each catalyst crossed its viability threshold and why 2019–2021 and 2023–2026 are the two decisive windows. Third (§3.4), it summarizes the Part II Catalyst Verdicts — the one-line "solved / partially solved / still open" judgment for each catalyst that was extracted from this book's gap analysis and will be cited back at the close of Chapters 4 through 7. Fourth (§3.5), it provides four reader trajectories, one per persona (engineer, researcher, manufacturing strategist, informed technical reader), so that the rest of the book can be read profitably on any of four paths.

3.2 The four catalysts and their interdependencies

Each of the four catalysts has a self-contained intellectual history. But the regime change required all four at once, and the ordering of dependencies is tight enough that any three of them without the fourth would have been insufficient.

Catalyst 1 — QDD actuators (Chapter 4). Outer-rotor BLDC motors paired with low-ratio planetary gears, instrumented with motor-current torque estimation, combining backdrivability, high control bandwidth, and proprioceptive ground-reaction-force sensing ^[1]. Without QDD, a learned policy would run on hardware that does not honestly execute its commanded joint torques; the PD-tracking assumption inside every modern System 0 would fail, and the training-deployment gap would be unclosable. QDD is the honest substrate. Chapter 4 develops this claim in detail.

Catalyst 2 — GPU-parallel simulation (Chapter 5). Physics simulators that share GPU memory with the policy network, enabling thousands to millions of environments to step in parallel and eliminating CPU-GPU transfer bottlenecks ^[4]. Without GPU parallelism, the sample complexity of deep RL on humanoids is wall-time-prohibitive; the policy either is too narrow to matter or trains for weeks before each ablation. GPU sim is the sample engine. Chapter 5 develops this claim.

Catalyst 3 — Teacher-student RL with history encoders (Chapter 6). The training recipe that lifts an RL policy from simulation-only to hardware-deployable: train a teacher with privileged access to ground-truth terrain, payload, and dynamics; distill its behavior into a student that has only proprioceptive observations and uses a learned history encoder (TCN → LSTM → Transformer) to infer the privileged context from recent state-action history ^[8]. Without the teacher-student recipe, the policy cannot close the sim-to-real loop; without the history encoder, it cannot adapt online without retraining. Teacher-student with history encoders is the adaptive policy class. Chapter 6 develops this claim.

Catalyst 4 — Sim-to-real correction (Chapter 7). The portfolio of techniques that let a simulated policy execute on hardware: wide-distribution domain randomization, system identification plus actuator networks, and residual correction via learned delta-action models ^[14]. Without sim-to-real correction, even a GPU-trained policy with a teacher-student structure is a laboratory artifact. Sim-to-real is the deployment contract. Chapter 7 develops this claim.

The dependencies among the four read cleanly in both directions. Read forward: QDD provides the honest substrate that GPU-simulated policies can train against; GPU sim provides the sample volume that makes teacher-student distillation tractable; teacher-student with history encoders produces a policy that is robust enough for sim-to-real correction to close the remaining gap; sim-to-real correction is what makes the learned policy actually run on the physical robot. Read backward: if you remove any one catalyst, the whole chain breaks at a specific point. Remove QDD: policies learn torques that the hardware cannot honestly execute. Remove GPU sim: the teacher has insufficient sample coverage to train, and the student inherits its gaps. Remove teacher-student: the policy has no adaptive context, and DR alone collapses on the long tail of the environment distribution. Remove sim-to-real correction: the policy simulates well and deploys poorly.

This is why the transition took from 2015 to 2023: the four catalysts matured on independent trajectories, and only their simultaneous presence produced a paradigm that could actually ship humanoid locomotion. The Hwangbo et al. 2019 paper ^[8] is the first rigorous demonstration of the combined stack on a legged robot (ANYmal, quadruped); Radosavovic et al.'s 2024 work on Digit ^[13] is the first rigorous demonstration on a full-size humanoid, together with its companion Berkeley-EECS narrative ^[13]. The thesis-level account of the paradigm shift in Radosavovic's dissertation framing is a useful historical primary source for readers who want to read the transition as a single lab's point of view.

3.3 The historical timeline

The regime change is more legible as a timeline than as a dependency graph. Two windows dominate:

Window A — 2015–2021: catalysts become independently viable. In 2015, none of the four catalysts was commonly available. MIT's Cheetah ^[3] was the QDD reference but had not yet spread beyond a handful of labs; GPU-parallel simulation was a research prototype; deep RL for robotics lived in the narrow-task regime catalogued by Kober 2013 ^[16]; sim-to-real was aspirational. The 2015 DARPA Robotics Challenge (Chapter 1) ended with the orthodox stack's public ceiling. Between 2015 and 2021, each catalyst crossed its own viability threshold, roughly in this order: QDD became reproducible in academia with the Mini Cheetah platform ^[2] and Unitree's early quadrupeds; GPU sim became standard with Isaac Gym ^[4]; teacher-student with history encoders became the dominant RL recipe via Hwangbo 2019, Lee 2020, Kumar 2021, Siekmann 2021; sim-to-real correction consolidated across DR and actuator-network approaches.

Window B — 2021–2026: integration and humanoid deployment. Rudin et al.'s 2021 CoRL paper ^[5] was the demonstration that all four catalysts could combine into a single legible pipeline — "ANYmal in minutes" — at which point the field's attention pivoted from "can this work?" to "how far can this scale?". Between 2021 and 2023, the recipe moved up the embodiment hierarchy: from quadrupeds (ANYmal, Go1, A1) to small bipedal platforms (Cassie, Digit) to full-size humanoids. Radosavovic et al.'s 2024 Science Robotics paper ^[13] is the signal event — fully learned humanoid locomotion on Digit, including a 1 km outdoor walk on unfamiliar terrain. By 2024–2026, the stack had reached its current maturity: Figure's Helix and Helix 02 ^[18] production systems, AgiBot's GO-1 and GO-2 ^[23], Agility's Motor Cortex ^[25], Unitree's open unitree_rl_gym ^[26], and the extension from locomotion into the loco-manipulation frontier (Chapter 10).

What did not happen during either window is equally instructive. No single academic group "invented" the paradigm shift; it is the product of at least a dozen groups moving partially independently, with the Berkeley-Malik-Sreenath-Darrell axis, the ETH Hutter group, the IHMC Hurst-Pratt lineage, the MIT Kim lab, the NVIDIA Isaac team, and the Boston Dynamics internal program each contributing irreplaceable pieces. Recent survey work from the Annual Reviews ^[17] and from Gu et al.'s 2025 arXiv survey ^[18] documents this multi-group convergence at the appropriate academic altitude; this chapter's contribution is not the timeline itself but the interdependency map that explains why the timeline had to take the shape it did.

3.4 The Part II catalyst verdicts

This section consolidates the "solved / partially solved / still open" verdict for each of the four catalysts. Each verdict is drawn from the gap analysis that frames this book's Part II closing sections; the same verdicts appear in compressed form at the end of Chapters 4, 5, 6, and 7. The table below is quote-ready for use in industrial-strategy discussions.

Verdict 1 — QDD actuators: Solved (commodity). The MIT Cheetah design principles — outer-rotor BLDC, low-ratio planetary, motor-current torque estimation, Impact Mitigation Factor (IMF) as metric ^[1] — are now reproduced across commercial platforms (Unitree G1 at US$16,000; Berkeley Humanoid; academic ToddlerBot at under US$6,000). The hardware primitive is commoditized. What remains open is thermal and current calibration at industrial duty cycles, and custom integration with tactile sensing at the fingertip (Figure 03's 3-gram fingertip load cell is an instance). No open architectural question remains. Chapter 4 closes on this verdict.

Verdict 2 — GPU massively parallel simulation: Solved (standardized). Isaac Gym ^[4], Rudin's legged_gym recipe ^[5], Isaac Lab / Orbit ^[6], MuJoCo MJX and Playground ^[7], Humanoid-Gym [Gu et al., 2024], Booster Gym, and Genesis collectively reduce training from days to minutes at million-environment-steps-per-second throughput. Zakka et al. 2025 report sim-to-real walking within 15 minutes of single-GPU training. What remains open is contact-accuracy fidelity for manipulation and deformable/fluid regimes — not the RL throughput problem. Differentiable simulation ^[19] is the adjacent research frontier that may unlock additional sample efficiency. Chapter 5 closes on this verdict.

Verdict 3 — Teacher-student RL with history encoders: Partially solved. The canonical recipe — Hwangbo 2019 actuator network, Lee 2020 teacher-student, Kumar 2021 RMA, Siekmann 2021 Cassie, Radosavovic 2024 causal Transformer — is reproducible and deployable across Cassie, Digit, Unitree G1 and H1, Booster T1, and Berkeley Humanoid. The history encoder moved TCN → LSTM → Transformer. What remains open: (a) the context-length and latency trade-off for on-board inference, (b) multi-skill unification beyond Radosavovic's single-transformer distillation scope, (c) language-conditioning of the System 0/1 policy without losing 1 kHz real-time guarantees. Chapter 6 closes on this verdict.

Verdict 4 — Sim-to-real correction: Partially solved. Three strategies co-exist and stack: domain randomization (the dominant default), system identification plus actuator networks (Hwangbo 2019 style), and residual / delta-action correction (ASAP ^[15] — 53% RMSE reduction on agile motions via 20-minute real-data fine-tuning). For bounded-contact locomotion, the problem is closed. For contact-rich manipulation, it is partially addressed: ManipTrans ^[28] demonstrates residual-learning-based bimanual dexterous-manipulation transfer as one concrete 2025 counter-example to earlier claims that bimanual residual-action was entirely unexplored; OmniRetarget-style interaction-mesh approaches [OmniRetarget, 2025] remain strongest when object meshes are available. The remaining frontier is contact-rich bimanual manipulation without prior object meshes and without per-task demonstration data — still the key bottleneck that Chapter 7 and Chapter 15 point at, and the central open problem that Part V argues Korean manufacturing is unusually well-positioned to attack. Chapter 7 closes on this verdict.

Readers who want the strategic upshot can take the four verdicts as a single sentence: locomotion is a solved primitive as of 2024–2026; the remaining frontier is dexterous manipulation, and the remaining frontier's economic value is concentrated in industrial deployment. Chapter 15 develops this strategic upshot through the Manufacturing Physical AI lens; Chapter 16 converts the upshot into a staged diffusion scenario.

3.5 Reader paths by persona

The book is written simultaneously for four audiences. Each persona gets a distinct entry trajectory and a distinct take-away. The four trajectories are deliberately non-linear through Parts II–IV, because the value proposition differs per persona.

Persona A — the humanoid engineer (practitioner at Figure, Agility, Unitree, AgiBot, Boston Dynamics, 1X, Fourier, Rainbow, or the next-tier frontier integrators): read Chapter 3 (this map), then Chapter 9 (System 0/1/2 architecture), then Chapters 11–13 (frontier-company deep-dives for competitive intelligence), then return to Chapters 4–7 (catalyst re-grounding), then Chapter 15 (four differentiation axes as a strategic menu). Take-away: a calibrated judgment of which internal architectural choices are on the convergent path vs off-trend and needing justification, plus a structural vocabulary for internal architecture discussions.

Persona B — the robotics researcher (robotics or embodied-AI PhD student, postdoc, or faculty; probably CoRL/RSS/ICRA/IROS/Science Robotics/NeurIPS): read Chapter 3 (overview), then Chapter 8 (modern theory primer — RL, Transformers, diffusion policy, VLA), then Chapter 6 (learning canon), then Chapter 7 (sim-to-real), then Chapter 10 (VLA), and use the gap analysis as a companion for scoping thesis topics. Take-away: a reasoned assessment of which of the four catalysts are solved, partially solved, or still open, plus a gap inventory with short / medium / long-term tags, plus exposure to Korean ecosystem work that is systematically under-covered in English-language reviews.

Persona C — the manufacturing strategist (strategy, policy, or investment professional at a Korean conglomerate, government agency, VC firm, or major research hospital; reads technical literature selectively; lives with P&L, five-year capex plans, and technology-readiness assessments): read the front-matter executive summary, then Chapter 3 (this chapter, one sitting), then Chapters 14–16 (Korea + differentiation + diffusion), then Chapters 11–13 (frontier companies as competitive context), then Parts II–III on demand for technical depth. Take-away: a diagnosis that the Korean ecosystem is hardware-credible but VLA-data-behind and onboard-foundation-model-behind; a Manufacturing Physical AI framework that prescribes which four differentiation axes Korea must own to avoid being a commodity hardware supplier; and a staged diffusion scenario that sequences the investment.

Persona D — the informed technical reader (software engineer, ML practitioner, science journalist, or technically literate decision-maker who has read popular AI coverage but wants one level deeper): read linearly front to back. Parts I–III are narrative; Part IV is case-study; Part V is policy and future. Take-away: a coherent account from 2003 Kajita to 2026 Figure Helix 02 in one sitting, enough vocabulary to read the next NVIDIA GR00T blog post or Figure announcement without losing the thread, and an honest distinction between demonstrated, plausibly extrapolated, and marketing.

3.6 What is not on the map

The map has boundaries. Three kinds of work are not catalysts of the paradigm shift, even though they feature prominently in the literature, and understanding the exclusions sharpens the thesis.

Imitation learning from teleoperation is not catalyst 5. It is a data-acquisition strategy — essential for manipulation (Chapter 10), central to Figure's and AgiBot's production stacks — but it does not displace the learned-policy paradigm that the four catalysts already define. Teleoperation data enters the stack as behavior-cloning initialization or as reward shaping; it is an input to the catalysts, not an alternative.

Diffusion policy is not catalyst 5. It is an action-decoder architecture (Chapter 8) that happens to be a productive choice for manipulation specifically. It sits inside System 1; it does not restructure the three-layer architecture or displace the catalysts. The Chi et al. 2023 work ^[20] is a notable contribution to this decoder family.

Vision-Language-Action (VLA) models are not catalyst 5. They are what the paradigm shift enabled. Once locomotion is a solved primitive and the three-layer architecture is in place, VLA fits at System 2 and begins to absorb manipulation and language. Without the four catalysts there would be no System 0/1 substrate for VLAs to sit on. Chapter 10 develops this.

The distinction matters because it clarifies what the book is for. This book is not a catalogue of every technique in 2020s humanoid research; it is an argument that a specific four-catalyst conjunction produced a regime change, and that the consequences of the regime change are still unfolding in three specific directions (architecture, company strategy, Korean deployment). Chapter 2 audited what the old regime bequeathed; Chapter 3 has now mapped the new regime. From Chapter 4 onward the book examines each catalyst in turn and then turns, in Parts III–V, to what the catalysts enable rather than what they rested on.

3.7 Open questions

Three meta-questions close this chapter and reappear throughout the book.

First, is there a fifth catalyst we are missing? Candidates include differentiable physics (see Schwarke et al.'s 2024 work demonstrating 10–100× fewer simulation steps than PPO baselines on quadrupeds ^[19]), neural physics / reward modelling, self-play or RL from human feedback at scale, or the maturation of whole-body foundation models trained on internet-scale human motion. Each is a credible candidate; none has yet demonstrated the cross-cutting necessity that the four original catalysts did. Chapter 15 argues that onboard VLA, fleet learning, and cross-embodiment transfer are the three axes most likely to produce a fifth catalyst in the 2026–2030 window.

Second, how does the paradigm change travel across domains? The catalysts were developed for locomotion. Manipulation — and especially dexterous contact-rich manipulation — shares some structure with locomotion (teacher-student, DR, history encoders, QDD-equivalent hand actuators) but differs in others (contact distributions are not samplable from a parametric prior, object geometry is unbounded, grasp topology is discrete). Chapter 10 engages this question for VLA; Chapter 15 engages it for manufacturing deployment; neither finds a complete answer. Gap 1 in our analysis is the crispest statement of the open problem.

Third, what does the paradigm leave unsolved at the foundational level? Safety and formal guarantees (Gap 6), benchmark fragmentation (Gap 7), energy efficiency (Gap 8), reproducibility (Gap 11), and architectural-interface standardization (Gap 13) are all unsolved at the time of writing. None of these is a consequence of any single catalyst; all of them are consequences of the paradigm shift's rapid and uncoordinated industrial adoption. Chapters 11–13 engage these questions per-company; Chapters 14–16 engage them per-country.

With the map in hand, Part II now dives into the catalysts individually. Chapter 4 starts with QDD hardware, because QDD is the substrate that makes everything above it possible.

References

Wensing, P. M., Wang, A., Seok, S., Otten, D., Lang, J., & Kim, S. (2017). Proprioceptive actuator design in the MIT Cheetah: Impact mitigation and high-bandwidth physical interaction for dynamic legged robots. IEEE Transactions on Robotics. (Full details in Chapter 4.)
Katz, B., Di Carlo, J., & Kim, S. (2019). Mini Cheetah: A platform for pushing the limits of dynamic quadruped control. Proc. IEEE ICRA.
Seok, S., et al. (2013). Design principles for highly efficient quadrupeds and implementation on the MIT Cheetah robot. Proc. IEEE ICRA.
Makoviychuk, V., et al. (2021). Isaac Gym: High performance GPU-based physics simulation for robot learning. NeurIPS.
Rudin, N., Hoeller, D., Reist, P., & Hutter, M. (2021). Learning to walk in minutes using massively parallel deep reinforcement learning. Proc. CoRL.
Mittal, M., et al. (2023). Orbit: A unified simulation framework for interactive robot learning environments. (Now Isaac Lab.)
Zakka, K., et al. (2025). MuJoCo Playground: A unified platform for robot learning.
Hwangbo, J., et al. (2019). Learning agile and dynamic motor skills for legged robots. Science Robotics.
Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., & Hutter, M. (2020). Learning quadrupedal locomotion over challenging terrain. Science Robotics.
Kumar, A., Fu, Z., Pathak, D., & Malik, J. (2021). RMA: Rapid motor adaptation for legged robots. Proc. RSS.
Siekmann, J., Godse, Y., Fern, A., & Hurst, J. (2021). Blind bipedal stair traversal via sim-to-real reinforcement learning. Proc. RSS.
Radosavovic, I., Xiao, T., Zhang, B., Darrell, T., Malik, J., & Sreenath, K. (2024). Real-world humanoid locomotion with reinforcement learning. Science Robotics.
Radosavovic, I. (2024). From catalysts to convergence: A paradigm shift in humanoid robotics. UC Berkeley EECS dissertation and public talks.
Tobin, J., et al. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. Proc. IROS.
He, T., et al. (2025). ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.
Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research. doi:10.1177/0278364913495721.
Tang, C., Abbatematteo, B., & Hu, J. (2025). Deep reinforcement learning for robotics: A survey of real-world successes. Annual Review of Control, Robotics, and Autonomous Systems. doi:10.1146/annurev-control-030323-022510. arXiv:2408.03539.
Gu, Z., Li, J., & Shen, W. (2025). Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv preprint 2501.02116.
Schwarke, C., Klemm, V., & Tordesillas, J. (2024). Learning quadrupedal locomotion via differentiable simulation. Proc. CoRL. arXiv:2403.14864.
Chi, C., et al. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. Proc. RSS.
Figure AI. (2025). Helix: A vision-language-action model for generalist humanoid control. Figure AI tech blog, February 2025. https://figure.ai
Figure AI. (2026). Helix 02: Fully-onboard VLA with System 0. Figure AI announcement, January/February 2026. https://figure.ai
AgiBot. (2025). AgiBot World Colosseo: A large-scale manipulation platform. arXiv preprint.
AgiBot. (2026). GO-2: Asynchronous dual-system humanoid control. ACL 2026.
Agility Robotics. (2025). Motor Cortex: An always-on safety layer for Digit. https://agilityrobotics.com
Unitree Robotics. (2024). Unitree G1 humanoid platform and `unitree_rl_gym`. Unitree product release.
Yang, H., et al. (2025). OmniRetarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction. arXiv preprint 2509.26633.
Li, K., et al. (2025). ManipTrans: Efficient dexterous bimanual manipulation transfer via residual learning. Proc. CVPR. arXiv:2503.21860.