Part I: The Old Stack and Its Legacy

Chapter 2: Foundations That Still Matter

Written: 2026-04-24 Last updated: 2026-04-24

2.1 Why revisit the old foundations

Chapter 1 ended with a claim: the orthodox LIPM/ZMP/QP stack was insufficient, not wrong. The distinction is not rhetorical. If the orthodox stack had been wrong, a new humanoid engineer in 2026 could safely ignore Kajita 2003, treat Westervelt's hybrid zero dynamics as a curiosity, and train policies directly against pixels. In fact, that engineer cannot. Every production humanoid stack in 2026 contains orthodox primitives in three load-bearing places: at the joint level as System 0's PD or torque tracker, in reward design for the System 1 policy, and in the safety monitor that arbitrates when the learned policy becomes untrustworthy. This chapter is a working inventory of those primitives.

Two other motivations justify the chapter. First, it is a bridge for the reader. Chapter 8 will later provide the "new foundations" — policy gradients, Transformers, diffusion policy, VLA. Readers who arrive in this book already knowing one of those two foundations (classical control, or modern RL) will lean on this chapter and Chapter 8 as a remedial pair. Second, it is an honest separation of what persists from what does not. The field has too often framed the transition as a clean break; the truth, read from the code that actually ships, is that the orthodox primitives persist in precisely characterized roles.

The chapter proceeds as an audit of five survivors: (1) LIPM as a reward-shaping template, (2) the whole-body QP and its modern MPC descendants, (3) the capture point as an implicit safety envelope, (4) hybrid zero dynamics as the conceptual parent of clock-phase reward structures, and (5) the centroidal dynamics framework as the language of whole-body objectives. A sixth section discusses the classical state of reinforcement-learning-in-robotics circa 2013 — the world immediately before Parts II and III — so the reader can calibrate what the four catalysts actually changed. The chapter closes with the operational rule that organizes the rest of Parts II–V: use the old primitives where they are provably correct; use the new primitives where distributional coverage is the only available guarantee.

2.2 LIPM as a reward-shaping template

The LIPM's canonical use in the orthodox era was as a generative model: given a ZMP reference, compute a CoM trajectory. That use has largely been retired. The LIPM's second role — as a low-dimensional template that a higher-dimensional controller can be shaped to respect — has not. In modern humanoid RL, LIPM survives inside the reward function.

Consider a typical reward decomposition in humanoid locomotion RL, as the teacher-student canon of Chapter 6 develops it. There is a velocity-tracking term, a smoothness term, an energy-use term, and a template-tracking term. The template-tracking term rewards the policy for keeping the commanded CoM trajectory close to what a LIPM-based controller would produce for the same footstep command. The policy is not required to obey this template; it is nudged toward the template, and the optimizer, given a wide training distribution, discovers deviations from the template that the template itself could not have predicted — a foot scuff that recovers from a misfit contact, an arm swing that compensates for a payload shift. The template seeded the exploration; the policy improved on it.

This pattern — classical model as a learnable prior — has a formal analogue in the learning literature: the DeepMimic reward structure [10], in which a reference motion (e.g., a mocap trajectory) enters the reward as a tracking term and the policy is free to deviate. Applying that pattern with LIPM instead of mocap is what several production stacks do for the locomotion portion of the reward, and references to this pattern appear across Ch6's canon. The modern update to this approach — OmniRetarget 2025, which generates reference motions with interaction-preserving retargeting — is itself an evolution of the same template-conditioning idea, just with richer templates. The underlying logic is that a well-chosen low-dimensional template makes exploration tractable in a high-dimensional action space; LIPM is still a well-chosen template for humanoid balance.

Kajita et al.'s textbook [3] remains the standard reference for the LIPM derivation, and the preview-control construction of [1] still appears in appendices of modern papers that explain what the policy is being shaped toward. The textbook material is not obsolete; it is prerequisite.

2.3 The whole-body QP and its MPC descendants

The whole-body QP from Chapter 1 — the object that computes joint torques at 1 kHz subject to friction cones, joint limits, and centroidal constraints — has not disappeared. It has been promoted to two distinct roles and partially generalized in a third.

Role one: System 0 torque tracker. In every System 0/1/2 humanoid stack of 2024–2026, the lowest-frequency layer emits desired joint positions or desired joint torques, and a classical tracker realizes those at 1 kHz. When the tracker is position-based, it is a PD controller; when it is torque-based, it is a per-joint regulator that lives inside the motor driver; when it is whole-body, it is a QP that distributes desired task-space wrenches across available joints while respecting friction cones and torque limits. Figure Helix 02's System 0 is a learned 10M-parameter network running at 1 kHz [17], but that learned S0 can be re-read as a function approximator for a whole-body QP with randomized constraints — it absorbs the QP's structure into weights. The recent survey by Wensing, Posa, and Hu [8] documents how modern optimization-based control for dynamic legged robots retains the QP primitive alongside its learned generalizations.

Role two: safety filter. When a learned policy proposes an action that would violate joint limits, friction cones, or self-collision, a classical QP-based safety filter can project the learned action onto the feasible set. This pattern — policy proposes, filter disposes — is explicit in Agility Robotics's public description of Motor Cortex as an "always-on safety layer" [15], and it is the architectural pattern under which Boston Dynamics frames its hybrid MPC + RL work [16]. The filter is a QP because a QP is the provably correct thing to put beneath a policy whose own guarantees are distributional. Chapter 6 will show that certain learned policies can be trained to tend toward feasibility; Chapter 11 argues that a guarantee is still worth having beneath them.

Role three: whole-body MPC. The generalization of the whole-body QP to a receding-horizon problem with dynamics constraints is whole-body model predictive control (MPC). Koenemann, Del Prete, and Tassa's 2015 work [6] implemented the first real-time whole-body MPC on the 27-DoF HRP-2 humanoid using a differential dynamic programming (DDP) solver, with each DDP iteration taking roughly 50 ms on an offboard 12-core desktop (≈20 Hz fast-time replanning, feeding a 200 Hz onboard position-control loop over a 20 ms trajectory time step) — a late-orthodox-era demonstration of what MPC could do when the dynamics model was allowed to be nonlinear. That line of work has continued (Crocoddyl, Pinocchio's DDP extensions, and the whole-body MPC toolchains used internally at Boston Dynamics and by academic groups at Inria and ETH). In 2026, whole-body MPC is the specific technology Boston Dynamics claims to be augmenting, not replacing, with RL [16]. The MPC produces a high-quality action; a learned policy produces a candidate; a fusion layer — sometimes a QP, sometimes a learned arbiter — chooses. The ETH survey by Wensing et al. [8] maps the 2022–2024 state of optimization-based legged control with roughly 200 references, and it is the canonical modern successor to the Kajita 2014 textbook for readers who want the control-theoretic side of the modern stack.

The implication for book-writer and reader alike is that Chapter 11's Boston Dynamics deep-dive cannot be read without Chapter 2's QP and MPC inventory. BD's hybrid philosophy is not nostalgia; it is a bet that the QP's provable correctness is worth keeping, and that the RL's distributional coverage is a complement rather than a replacement. Whether that bet pays off versus Figure's end-to-end learning bet is, effectively, the Part IV contest.

2.4 The capture point as an implicit safety envelope

The capture point from Section 1.4, derived from LIPM energy analysis, has two modern lives. Explicitly, it appears inside MPC cost functions as a terminal-state penalty: the receding-horizon optimization is encouraged to end each horizon in a state where the capture point lies inside the support polygon, which is a principled near-balance condition even when the horizon is short. Koenemann et al. 2015 illustrates this use directly — the whole-body reaching experiment on HRP-2 includes a "keep the capture point at the center of support polygon" cost term and plots the residual capture-point cost over time [6]. The pattern survives in several production MPC implementations, including those documented in the Wensing survey [8].

Implicitly, the capture point is what learned policies tend toward. Consider the Siekmann et al. 2021 Cassie stair-climbing RL paper [13], which Chapter 6 discusses in detail: the reward included a clock-phase term that, combined with a velocity-tracking term, effectively penalized states from which the LIPM-derived capture point would lie outside the support polygon. The learned policy was not explicitly told about capture points, but the reward surface was shaped to penalize deviations from capture-point-feasibility, and the resulting behavior looked like capture-point reactive stepping because the underlying dynamics selected for it. Similarly, the post-2024 transformer-history-encoder policies on Digit [14] recover from disturbances with foot placements that are, statistically, close to what a capture-point planner would have produced — not because the policy computed the capture point, but because the training distribution was wide enough that capture-point-feasible policies achieved higher reward.

This is the first example of a pattern that recurs throughout the book: the orthodox primitive becomes the training objective rather than the runtime algorithm. LIPM is used to shape rewards rather than to produce CoM trajectories; capture point is used to penalize near-fall states rather than to place footsteps; ZMP is used to train the policy toward support-polygon feasibility rather than to issue ZMP tracking commands. The primitives are curricular rather than operational.

Chapter 11 will argue that Boston Dynamics's hybrid stack occupies the middle of this spectrum: capture-point reasoning is retained operationally inside MPC, and then RL is layered above to handle the cases MPC cannot. Chapter 12 will argue that Figure's stack sits further toward the learning end: capture-point reasoning is retained only in training, not in runtime. Both are coherent designs. The orthodox literature is necessary to articulate either.

2.5 Hybrid zero dynamics and clock-phase rewards

One orthodox primitive has had a particularly underappreciated second life: hybrid zero dynamics (HZD). Westervelt, Grizzle, and Chevallereau's 2007 book [4] and the Reher and Ames 2021 survey [5] document HZD as the formal framework for controlling underactuated bipeds. The central idea is to reduce the high-dimensional full-robot dynamics to a lower-dimensional "zero dynamics" by imposing a virtual constraint — typically that certain joint angles track functions of a phase variable that progresses through the gait cycle. The controller forces those virtual constraints to hold; the resulting closed-loop motion lives on a low-dimensional manifold whose dynamics are stable under a specific condition.

HZD is the intellectual parent of Oregon State's Cassie platform [11] — developed in Jonathan Hurst's lab — and through Cassie, of Agility Robotics's Digit. The bipedal humanoids that can walk efficiently with passive-knee dynamics (Cassie, Digit, and related academic platforms) mostly descend from HZD-informed design. The designs' energy efficiency — passive ankles, energetically tuned gaits — is HZD-informed.

HZD's second life is in reward design for RL. Siekmann et al. 2021 showed that adding a clock-phase variable to the RL observation, and rewarding the policy for tracking reference joint trajectories parameterized by that clock phase, yields sim-to-real transfer on Cassie stair traversal. The clock-phase reward is, in disguise, a learned relaxation of HZD's virtual constraints: instead of enforcing the constraint exactly, the policy is rewarded for approximately satisfying it, and deviations are penalized softly. The resulting gait is HZD-like in its periodicity but adaptive in its detailed joint profiles.

This is the second recurrent pattern of the chapter: the orthodox framework is relaxed into a reward. Classical HZD enforces exact virtual constraints; the RL version softens enforcement into a penalty and lets the policy choose its own equilibrium between constraint-satisfaction and task performance. The outcome has been empirically successful across Cassie, Digit, and several humanoid platforms. It has also been invisible to casual readers of the RL papers, because the HZD lineage is rarely cited; the result is that practitioners import a classical framework without recognizing its name. This chapter is partly an attempt to remedy that invisibility.

2.6 Centroidal dynamics as the language of whole-body objectives

A thread ran through Chapter 1 without being named: centroidal dynamics. Centroidal dynamics describe the evolution of the robot's linear and angular momentum about its center of mass, and they form the right language for whole-body control objectives at the 100–200 Hz scale. The whole-body QP of Section 2.3 imposes centroidal-momentum constraints; the MPC of Koenemann et al. [6] tracks centroidal trajectories; the LIPM is a degenerate special case of centroidal dynamics under the constant-CoM-height assumption.

Why does this language matter for the learning era? Because the observation space of a modern humanoid RL policy typically includes a subset of centroidal quantities — base linear and angular velocity, gravity vector in the base frame, commanded forward velocity. Those observations are precisely the quantities that centroidal dynamics identify as the informationally sufficient statistics for whole-body balance. The observation design inherits from the orthodox framework even when the policy's inference is learned. A reader who does not recognize this is prone to mis-diagnose what the policy is using; a reader who does recognize it reads the architectural choice as inheriting from a half-century of controls intuition.

This observation-space inheritance has a practical consequence: it explains why RL policies trained on base-velocity-centric observations transfer across small humanoid variations. The centroidal quantities are approximately invariant under scaling of link masses and lengths, so a policy trained on one robot can be fine-tuned for a slightly different robot without discarding what it learned. Cross-embodiment transfer (Gap 4 in our analysis) begins here — the observation space's centroidal basis makes transfer possible, though, as Chapter 10 will discuss, it does not make transfer automatic.

2.7 Reinforcement learning before 2015

A brief but essential detour: what was the pre-deep-RL state of reinforcement-learning-in-robotics, and what did the four catalysts of Part II actually change? Kober, Bagnell, and Peters's 2013 survey in IJRR [7] is the canonical reference. The state of the art before 2015 was:

  • Policy search with expert-designed features. Representative methods — PILCO, PI², relative entropy policy search — operated on low-dimensional parameterized policies (typically a few dozen weights) using carefully engineered features. Deep-network policies were experimental.
  • Sample efficiency was the dominant constraint. Most algorithms needed tens to hundreds of real-robot trials per learned skill. Simulation-to-reality transfer was unreliable, so most learning happened on the real robot. That constraint capped skill complexity severely.
  • Tasks were narrow. Successful demonstrations were single-skill (ball-in-cup, peg-in-hole, specific manipulator trajectories). Whole-body humanoid learning was not credible.

What changed between 2015 and 2019 — and continued changing into 2024 — is that every one of these constraints lifted: deep networks became the default policy class (via TRPO, PPO, SAC, and successors), GPU-parallel simulation dropped the per-trial cost by several orders of magnitude, and domain-randomization-plus-history-encoders made sim-to-real transfer routine for locomotion. The gap from 2013's survey to 2019's first real-world humanoid RL demonstration [12] is exactly the gap that Parts II and III span. Reading Kober 2013 alongside those parts is the fastest way to calibrate what the four catalysts genuinely introduced.

2.8 The operational rule: provable where possible, distributional elsewhere

The chapter's inventory supports a single operational rule that organizes the rest of the book: use orthodox primitives where they are provably correct; use learned primitives where distributional coverage is the only available guarantee. Expanded concretely:

  • System 0 (1 kHz joint tracking): classical PD or whole-body QP, with the orthodox guarantees intact. A learned version is acceptable only when trained to approximate the QP with randomized constraints (Figure's 10M-param S0 [17] is a learned version of this layer).
  • Safety monitor: classical, QP-based, provable. The Motor Cortex "always-on safety layer" [15] is the commercial instance.
  • System 1 (100–200 Hz policy): learned, with template-tracking rewards derived from LIPM, capture point, HZD, and centroidal-dynamics primitives. The templates make exploration tractable; the learning makes the policy adaptive.
  • System 2 (7–10 Hz language/planning): learned, pretrained on web-scale data, fine-tuned for the embodiment. No orthodox primitive lives here; this is purely the post-2022 foundation-model contribution.

This partition is the architectural thesis that Chapter 9 develops formally. Chapter 2's role is to establish why the partition has the shape it does: the orthodox primitives are correct at 1 kHz joint scale, available as curricular signals at 100–200 Hz scale, and absent above that.

The Gu et al. 2025 humanoid survey [9] is the most direct academic counterpart to this book's Parts I–III, and it provides a valuable external validation of the partition: Gu et al. organize their review around control, planning, and learning, and in their integration sections they arrive at an essentially compatible architectural picture. What this book adds beyond Gu et al. is Part IV's frontier-company technical analysis and Part V's Korean-manufacturing-physical-AI lens; what this book inherits from the broader academic corpus including Gu et al. is the partition that Chapter 2 has now articulated.

2.9 Open questions

Three questions survive into the rest of the book. First, how deeply does the orthodox scaffolding penetrate into what System 1 learns? A learned policy trained with LIPM-template rewards is not the same as a learned policy trained from scratch; the orthodox prior may leave fingerprints that turn out to be counterproductive for new tasks (e.g., dynamic manipulation, where constant-CoM-height is a strictly false assumption). Chapter 7's discussion of sim-to-real strategies will re-encounter this question, and Chapter 10's VLA chapter will confront it head-on when locomotion and manipulation share a single policy.

Second, is the QP truly irreplaceable at System 0? Figure's 10M-param learned S0 is an empirical existence proof that at least one end-to-end-learned system can replace the QP for locomotion. Whether this replacement generalizes — to manipulation, to cross-embodiment — is an open question. If the learned S0 turns out to need per-embodiment retraining, the QP's provable correctness will continue to justify its place; if the learned S0 generalizes, the QP may retreat to the role of safety filter only.

Third, what is the right formal language for what the learned policy is doing? Classical control provides Lyapunov theory, passivity, reachability. Learned policies resist these tools. Hybrid systems theory, control barrier functions, safe RL, and differentiable MPC are each candidate bridges; none is yet the dominant framework. The answer to this question determines whether the next decade of humanoid deployment is permitted in safety-critical domains or confined to caged fixtures (Gap 6 in our analysis). Chapter 15's discussion of Korean manufacturing deployment will re-engage this question from the regulatory side.

Chapter 3 now maps the four catalysts and their inter-dependencies. With Chapter 2's inventory in hand, the reader can see Chapter 3 not as the claim that the orthodox stack is obsolete, but as the claim that the orthodox stack's role has shifted — from carrying the entire humanoid on its shoulders, to being one of three indispensable layers in a stack where each layer is answerable to a different kind of correctness.

References

  1. Kajita, S., Kanehiro, F., Kaneko, K., Fujiwara, K., Harada, K., Yokoi, K., & Hirukawa, H. (2003). Biped walking pattern generation by using preview control of zero-moment point. Proc. IEEE ICRA. doi:10.1109/ROBOT.2003.1241826.
  2. Feng, S., Whitman, E., Xinjilefu, X., & Atkeson, C. G. (2014). Optimization-based full body control for the DARPA Robotics Challenge. Journal of Field Robotics. doi:10.1002/rob.21559.
  3. Kajita, S., Hirukawa, H., & Harada, K. (2014). Introduction to Humanoid Robotics. Springer. doi:10.1007/978-3-642-54536-8.
  4. Westervelt, E. R., Grizzle, J. W., & Chevallereau, C. (2007). Feedback Control of Dynamic Bipedal Robot Locomotion. CRC Press.
  5. Reher, J., & Ames, A. D. (2021). Algorithmic foundations of dynamic bipedal robots with an emphasis on underactuated locomotion. Annual Review of Control, Robotics, and Autonomous Systems. doi:10.1146/annurev-control-071020-032422.
  6. Koenemann, J., Del Prete, A., & Tassa, Y. (2015). A whole-body model predictive control framework for humanoid robots. Proc. IEEE/RSJ IROS. doi:10.1109/IROS.2015.7353596.
  7. Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research. doi:10.1177/0278364913495721.
  8. Wensing, P. M., Posa, M., & Hu, Y. (2024). Optimization-based control for dynamic legged robots. IEEE Transactions on Robotics. arXiv:2211.11644.
  9. Gu, Z., Li, J., & Shen, W. (2025). Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv preprint 2501.02116.
  10. Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics 37(4). arXiv:1804.02717.
  11. Hurst, J. W. (2019). Cassie bipedal robot and the ATRIAS lineage. Agility Robotics / Oregon State University technical report.
  12. Hwangbo, J., Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., & Hutter, M. (2019). Learning agile and dynamic motor skills for legged robots. Science Robotics 4(26). doi:10.1126/scirobotics.aau5872. arXiv:1901.08652.
  13. Siekmann, J., Godse, Y., Fern, A., & Hurst, J. (2021). Blind bipedal stair traversal via sim-to-real reinforcement learning. Proc. RSS. arXiv:2105.08328.
  14. Radosavovic, I., Xiao, T., Zhang, B., Darrell, T., Malik, J., & Sreenath, K. (2024). Real-world humanoid locomotion with reinforcement learning. Science Robotics 9(89). doi:10.1126/scirobotics.adi9579. arXiv:2303.03381.
  15. Agility Robotics. (2025). Motor Cortex: An always-on safety layer for Digit. Agility Robotics technical announcement. https://agilityrobotics.com
  16. Boston Dynamics & RAI Institute. (2025). Electric Atlas reinforcement learning pipeline. BD–RAI partnership announcement, February 2025. https://bostondynamics.com
  17. Figure AI. (2026). Helix 02: Fully-onboard VLA with System 0. Figure AI announcement, January/February 2026. https://figure.ai