Part III: The 2026 Standard Stack

Chapter 8: Modern Theory Primer

Written: 2026-04-24 Last updated: 2026-04-24

8.1 The purpose of this chapter

If you already know PPO, Transformers, diffusion policies, and VLAs from other domains, this chapter is a 30-minute recalibration. If you don't, it is a bridge: enough formal structure to read Parts II and III without flipping to other textbooks, and no more. The humanoid literature since 2017 is a story about importing these four machine-learning families into a 30–56-DoF, 1-kHz, 2-kW hardware reality — and the rest of this book is that importation audited.

Chapter 2 handled the old foundations: LIPM, ZMP, whole-body QP, MPC. Chapter 8 handles the new foundations and is its paired companion. Reader paths: a classical-controls-trained engineer should pair Ch02 (already familiar) with Ch08 (new material); a deep-learning-trained researcher should pair Ch08 (already familiar) with Ch02 (new material). Either way, Chapters 9 and 10 assume both.

The chapter is deliberately not a technique catalog. Readers who want comprehensive technique coverage of the post-2017 learning-based humanoid stack should consult Gu, Li, and Shen's 2025 arXiv survey Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning [18] — it is the closest academic companion to Chapters 4–10 of this book, with roughly 300 references across control, planning, and learning. The present chapter aims at a different object: the minimum theoretical vocabulary a reader needs to follow Parts II and III. Six sections suffice: RL preliminaries (§8.2), policy gradient and PPO (§8.3), off-policy methods TD3 and SAC (§8.4), Transformer attention as in-context adaptation (§8.5), diffusion and flow-matching policies (§8.6), and the VLA concept (§8.7). A seventh short section (§8.8) discusses the privileged-learning patterns (teacher-student, DAgger, asymmetric critics, HER) that Chapter 6 invoked. Open questions (§8.9) close.

8.2 Reinforcement learning preliminaries

Reinforcement learning is framed as a Markov Decision Process (MDP): a tuple (S, A, P, r, \gamma) where S is the state space, A the action space, P(s'|s,a) the transition kernel, r(s,a) the reward, and \gamma \in [0,1) a discount factor. The agent's policy \pi(a|s) produces actions; the objective is to maximize the expected discounted return J(\pi) = \mathbb{E}_\pi [\sum_t \gamma^t r(s_t, a_t)]. The value function V^\pi(s) is the expected return starting from s under \pi; the action-value Q^\pi(s,a) adds the first action. The advantage A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) measures how much better a is than average at state s.

For humanoid control, S is a composite of proprioception (joint positions and velocities), base state (linear and angular velocity, gravity vector in the base frame), command input (velocity command or goal pose), and optionally vision. A is a continuous vector of desired joint positions or torques (typically 20–56-dimensional). P is the robot-plus-environment dynamics. r is a multi-term composite that rewards velocity-tracking, penalizes energy use and joint-limit proximity, and (for reference-motion policies) rewards tracking against a reference trajectory. \gamma is typically 0.99–0.995.

Kober, Bagnell, and Peters's 2013 IJRR survey [1] documents the pre-deep-RL state of reinforcement-learning-in-robotics — policy search with expert-designed features, typical sample requirements of tens to hundreds of real-robot trials per learned skill, and narrow single-skill task coverage. Tang, Abbatematteo, and Hu's 2025 Annual Review survey [19] is the canonical modern successor, documenting the 2018–2024 explosion of deep-RL successes in robotics including legged locomotion, manipulation, and mobile robotics. A reader who wants the full RL-in-robotics arc should read Kober 2013 and Tang 2025 back-to-back.

8.3 Policy gradient and PPO

The policy-gradient theorem states \nabla_\theta J(\pi_\theta) = \mathbb{E}_\pi [\nabla_\theta \log \pi_\theta(a|s) \, A^\pi(s,a)]: the gradient of the expected return equals the expected product of the log-policy's gradient and the advantage. REINFORCE is the simplest estimator (use Monte-Carlo return in place of A); actor-critic methods use a learned baseline to reduce variance; PPO adds a clipping trick to keep gradient steps from destabilizing the policy.

Proximal Policy Optimization (PPO) [3] is the algorithmic workhorse of humanoid RL. The core object is the ratio r_\theta(s,a) = \pi_\theta(a|s) / \pi_{\theta_{\text{old}}}(a|s) — the density ratio between the updated and behavior policies. PPO optimizes the clipped surrogate objective

L(\theta) = \mathbb{E} \left[ \min \big(r_\theta A^\pi, \text{clip}(r_\theta, 1-\epsilon, 1+\epsilon) A^\pi \big) \right]

with \epsilon \approx 0.1–$0.2$. Clipping prevents the ratio from moving too far, which would invalidate the importance-sampled gradient estimate. PPO is on-policy (it only uses rollouts from the current policy), straightforward to implement, numerically stable under large-batch GPU training, and the default in Isaac Gym / legged_gym / Humanoid-Gym and in nearly every paper in the Chapter 6 canon.

The practical reason PPO won 2018–2024 is its simplicity-per-sample-efficiency ratio under large-batch simulation. When the simulator produces millions of transitions per second (Chapter 5), PPO's on-policy constraint — that the data must come from the current policy — is not expensive. When the simulator is slow, PPO's on-policy constraint is prohibitive and off-policy methods (the next section) dominate.

8.4 Off-policy methods — TD3 and SAC

Off-policy methods train from a replay buffer of historical transitions rather than only from on-policy rollouts. The trade-off is increased sample efficiency at the cost of algorithmic complexity. Two algorithms define the 2018–2024 off-policy landscape for continuous control.

TD3 (Twin Delayed DDPG) [4] attacks the Q-value overestimation bias that made the predecessor DDPG unreliable. Three tricks: (1) clipped double-Q learning — train two Q-networks and take the minimum in the target computation; (2) delayed policy updates — update the policy at a slower rate than the Q-networks; (3) target policy smoothing — add small noise to the target action to prevent the policy from exploiting sharp Q-value peaks. TD3 outperforms DDPG on several MuJoCo benchmarks and reduces overestimation bias by approximately 50% relative to DDPG [4].

SAC (Soft Actor-Critic) [6] adds a maximum-entropy term to the RL objective: maximize expected return plus a temperature-weighted entropy bonus. The entropy bonus encourages exploration by pushing the policy to be as uniform as possible while still achieving reward. SAC is automatically-tuned (the temperature is learned), handles continuous actions natively, and is among the most sample-efficient continuous-control algorithms. SAC's theoretical convergence in continuous control has been analyzed (e.g., [10] on Soft-Actor-Critic with entropy bonus); in practice, SAC and TD3 are close competitors, with SAC preferred when exploration matters more and TD3 when stability matters more.

Both TD3 and SAC have returned to the humanoid-RL mainstream via FastTD3 [20] (Chapter 6 §6.8). FastTD3 combines parallel simulation, large-batch updates, a distributional critic, and tuned hyperparameters; on HumanoidBench, FastTD3 solves the locomotion-manipulation suite in under 3 hours on a single A100, 2–5× faster than PPO on rough-terrain domain-randomization tasks. The return of off-policy reflects the field's newly-affordable algorithmic choice: with million-environment-step-per-second throughput, the sample-efficiency differences between algorithms reappear.

Two important off-policy variants deserve mention. Yarats et al.'s DrQ-v2 [Yarats et al., 2021] pushes data-augmented RL for visual continuous control, showing that off-policy RL plus image augmentations dominates on visual-RL benchmarks. For manipulation, Zhao et al.'s ACT (Action Chunking with Transformers) [14] combines imitation learning with a Transformer-over-action-chunks policy; ACT demonstrates fine-grained bimanual manipulation on low-cost hardware and influenced the subsequent diffusion-policy work.

8.5 Transformers and in-context adaptation

Vaswani et al.'s Attention Is All You Need [2] is the origin of the Transformer architecture that now dominates modern machine learning, including the humanoid-RL policy architectures of Chapter 6 §6.7. The core mechanism is scaled dot-product attention: given queries Q, keys K, and values V derived from input tokens, the output is \text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) V. Multi-head attention runs this mechanism in parallel across multiple learned projections; a Transformer layer combines multi-head attention with a position-wise feedforward network and layer normalization.

For humanoid policies, the Transformer's relevance is its ability to attend over a long context of past observations and actions. Chapter 6 §6.6 documented Radosavovic 2024's finding that longer attention windows monotonically improve deployment robustness [23]. The architectural mechanism behind this finding is that Transformer attention is functionally equivalent to in-context system identification: the attention weights over the history encode "how similar is each past timestep to the current situation," and the weighted value aggregate becomes the adaptation signal that RMA (Chapter 6 §6.4) computed via explicit extrinsics regression.

The practical constraint is inference cost. Transformer attention has O(L^2) cost in sequence length L; doubling the context quadruples the compute. For System 1 inference at 100–200 Hz (Chapter 9), the context length is bounded by the hardware inference budget rather than the accuracy ceiling. Engineering long-context on-board Transformers is an active research frontier as of 2026; KV-caching (reusing past attention computations across timesteps) and linear-attention variants (FlashAttention-family, Mamba-family) are the dominant optimization approaches.

A subtle property of causal Transformers that matters for humanoid policies: the position embedding encodes the temporal ordering of tokens, but the model does not need to treat the observation stream as strictly periodic. This is part of what made Radosavovic et al.'s 2024 follow-up [23] work: reframing control as autoregressive next-token prediction over a multiplexed stream of proprioception, commands, and actions is natural under the Transformer's sequence-modeling abstraction.

8.6 Diffusion and flow-matching policies

Diffusion models [8] denote a class of generative models that learn to reverse a gradual noising process. The training objective has the model predict the noise added at a given timestep; at generation time, the model iteratively denoises from pure noise back to a sample. Denoising Diffusion Probabilistic Models (DDPM) are the canonical reference; the key insight for robotics is that actions can be the quantity generated.

Diffusion Policy [13] applies this mechanism to visuomotor control: the policy learns to generate a sequence of future actions conditioned on the current observation, by reversing a noising process over action trajectories. The architecture is typically a 1D convolutional U-Net or a Transformer that conditions on observation features and diffusion timestep. Compared to direct regression (where the policy outputs a single action), diffusion policies capture multi-modal action distributions naturally — useful for manipulation tasks where multiple valid solutions exist (grasp this corner or that corner; left hand or right hand). Diffusion policies have become the default action decoder for manipulation foundation models, including several of the VLA systems in §8.7.

Flow matching [12] is the 2022–2024 evolution of the diffusion framework. Rather than learning to predict noise, flow-matching models learn a velocity field that transports a simple base distribution to the data distribution via an ordinary differential equation. Flow matching is often simpler to train, faster to sample from, and produces competitive or better quality than diffusion. Black et al.'s π0 [16], one of the VLA systems in §8.7, uses a flow-matching action head.

Wolf and colleagues' 2025 survey Diffusion Models for Robotic Manipulation [21] covers this space at comprehensive depth for readers who want technique catalog. The present chapter's framing is minimal: diffusion-family policies are the default action decoder for manipulation where action multi-modality matters, and flow matching is the 2024+ preferred formulation for efficiency reasons.

A practical consequence for humanoids: diffusion policies are typically run at 50–100 Hz (limited by the denoising iterations per inference), which places them at the System 1 (Chapter 9) frequency tier. A System 1 policy that includes a diffusion action head cannot directly run at 1 kHz; that is why System 0 (joint-level control at 1 kHz) remains a separate layer below.

8.7 Vision-Language-Action (VLA) models

A Vision-Language-Action (VLA) model is a single network that maps an image observation and a language command to a sequence of robot actions. The architectural recipe is: take a pretrained vision-language model (typically a VLM built on a LLaMA-family, PaLM, or PaLI backbone), replace its text-generation head with an action decoder (either a discretized token head or a diffusion / flow-matching head), and fine-tune the composite network on a robot-action corpus.

Chapter 10 develops the VLA story in detail. This section provides the minimum vocabulary. Three 2024 systems anchor the category:

OpenVLA [15] is the canonical open-source VLA. A 7-billion-parameter model built on Llama-2, with a vision encoder and a tokenized-action head, OpenVLA is trained on the Open X-Embodiment dataset spanning 970,000 episodes from 22 robot embodiments. OpenVLA runs at approximately 100 ms per action on a consumer GPU and was the first open VLA to demonstrate cross-embodiment generalization at scale.

π0 (pi-zero) [16] from Physical Intelligence uses a different architectural choice: a 3-billion-parameter VLM backbone with a flow-matching action head running at 50 Hz. π0 trains on a mixed corpus of 7 robot platforms and 68 tasks; its specific contribution is showing that a VLA can run at manipulation-relevant frequencies (rather than the sub-Hz rates typical of early VLAs).

NVIDIA GR00T N1 / N1.5 [24] frames the VLA as a dual-system architecture: a low-frequency System 2 VLM (1.34 billion parameters) plus a higher-frequency System 1 diffusion-transformer action decoder (860 million parameters), totaling 2.2 billion parameters. GR00T is trained on a mixture of simulated rollouts, human-video, and real-robot teleoperation data, and is deployed in the paper's real-world evaluations on the Fourier GR-1 humanoid robot (with simulation benchmarks extending to Franka Panda arm variants).

The VLA concept's key property — the one Chapter 10 will develop — is cross-embodiment generalization: a single policy that operates across multiple robot bodies. The economic premise of the entire frontier-company VLA program (Chapter 10, Chapters 11–13) depends on this property holding at commercial scale. Whether it does is the major open question.

8.8 Privileged learning and imitation patterns

Several patterns recur across the 2018–2026 humanoid-RL literature and are worth naming explicitly because Chapter 6's canon assumed them without always spelling them out.

Teacher-student with privileged information (Lee 2020 in Chapter 6 §6.3) is the most important. The teacher has access to ground-truth environment state; the student operates on only-the-observation-the-real-robot-sees. The student imitates the teacher. This structure appears in almost every frontier humanoid-RL paper.

DAgger (Dataset Aggregation) is the online variant: rather than collect teacher demonstrations once and train the student via behavioral cloning, DAgger interleaves student rollouts with teacher queries, adding mistakes-the-student-made to the training set. The Lee 2020 pipeline uses a DAgger-style update for the student.

Asymmetric actor-critic is the privilege-on-the-critic-only variant: the actor network sees only proprioceptive observations (what the real robot will have), while the critic network sees privileged information (what the simulator knows). The advantage is only the critic needs to be discarded at deployment. Asymmetric actor-critic is widely used in production humanoid-RL stacks.

Hindsight Experience Replay (HER) retrofits failed rollouts by relabeling their rewards as if the failed state had been the intended goal. HER is most relevant for goal-conditioned tasks and is less common in locomotion, where velocity commands are already goal-conditioned.

Behavioral cloning from mocap or teleoperation provides the starting point for many humanoid policies (Chapter 6 §6.9 and §6.10). A learned policy is pretrained to imitate human motion, then fine-tuned with RL. The combination is typically more sample-efficient than RL from scratch.

These patterns are not the substance of Chapter 8, but they are the vocabulary. A reader who encounters "teacher-student" or "asymmetric critic" in Chapter 6 or Chapter 9 can return here for a one-paragraph refresher.

8.9 Open questions

Three meta-questions close this chapter and surface periodically through Parts III and IV.

First, which theoretical framework is the right language for learned humanoid policies? Classical control provides Lyapunov theory, passivity, and reachability. RL provides expected-return optimization. Neither is a complete description of what a deployed policy must do. Emerging candidates — control barrier functions, safe RL, differentiable MPC — are each a partial bridge. Chapter 15 revisits this question from the regulatory side when discussing Korean manufacturing deployment.

Second, what is the right scaling law for humanoid policies? Language models follow Chinchilla-style scaling; vision models follow their own. For humanoid policies, the scaling axes are training data, model parameters, context length, sim-diversity, and real-data ratio. No scaling law has yet been established; Radosavovic 2024-nexttoken's 27-billion-token result is suggestive but not systematic. The first credible humanoid-policy scaling-law paper is an open research opportunity.

Third, how does the theory bridge the sim-to-real gap? Chapter 7 described three engineering strategies. The theoretical question — what does it mean for a policy to generalize from simulator to reality, and what guarantees can one give — remains open. Tang et al.'s 2025 Annual Review survey [19] surveys the empirical landscape; the theoretical answer is still a research frontier.

8.10 Bridge to Chapter 9

With Chapter 2's classical foundations and Chapter 8's modern foundations in hand, the reader has the vocabulary needed for Chapter 9's System 0/1/2 architecture. Chapter 9 takes up two questions: what do the three layers look like structurally, and what are the interface contracts between them? The answers compose the classical primitives (System 0 PD or QP) with the modern primitives (System 1 policy networks, System 2 VLMs) into the lingua franca of 2026 humanoid control.

References

  1. Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research. doi:10.1177/0278364913495721.
  2. Vaswani, A., et al. (2017). Attention is all you need. Proc. NeurIPS.
  3. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint 1707.06347.
  4. Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods (TD3). Proc. ICML. arXiv:1802.09477.
  5. Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM SIGGRAPH. arXiv:1804.02717.
  6. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. ICML.
  7. OpenAI et al. (2019). Learning dexterous in-hand manipulation. IJRR. arXiv:1808.00177.
  8. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Proc. NeurIPS.
  9. Kroemer, O., Niekum, S., & Konidaris, G. (2021). A review of robot learning for manipulation: Challenges, representations, and algorithms. JMLR.
  10. Pang, B., et al. (2021). Convergence analysis of soft-actor-critic + entropy bonus in continuous control.
  11. Yarats, D., Fergus, R., Lazaric, A., & Pinto, L. (2022). Mastering visual continuous control: Improved data-augmented reinforcement learning (DrQ-v2).
  12. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. Proc. ICLR.
  13. Chi, C., et al. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. Proc. RSS.
  14. Zhao, T. Z., Kumar, V., Levine, S., & Finn, C. (2023). Learning fine-grained bimanual manipulation with low-cost hardware (ACT). Proc. RSS.
  15. Kim, M. J., et al. (2024). OpenVLA: An open-source vision-language-action model. arXiv preprint.
  16. Black, K., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint.
  17. Luo, Z., et al. (2024). Universal humanoid motion representations for physics-based control (PHC/PULSE). Proc. ICLR.
  18. Gu, Z., Li, J., & Shen, W. (2025). Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv preprint 2501.02116.
  19. Tang, C., Abbatematteo, B., & Hu, J. (2025). Deep reinforcement learning for robotics: A survey of real-world successes. Annual Review of Control, Robotics, and Autonomous Systems. doi:10.1146/annurev-control-030323-022510. arXiv:2408.03539.
  20. Seo, H., et al. (2025). FastTD3: Simple, fast, and capable reinforcement learning for humanoid control. arXiv preprint 2505.22642.
  21. Wolf, R., Shi, Y., Liu, S., & Rayyes, R. (2025). Diffusion models for robotic manipulation: A survey.
  22. Radosavovic, I., et al. (2024). Real-world humanoid locomotion with reinforcement learning. Science Robotics. arXiv:2303.03381.
  23. Radosavovic, I., et al. (2024). Humanoid locomotion as next token prediction. NeurIPS. arXiv:2402.19469.
  24. Bjorck, J., et al. (2025). GR00T N1: An open foundation model for generalist humanoid robots. NVIDIA technical report and arXiv preprint.
  25. NVIDIA. (2025). GR00T N1.5: Improved foundation model for generalist humanoid robots. NVIDIA technical announcement.