Part III: The 2026 Standard Stack

Chapter 9: The 3-Layer System 0/1/2 Architecture

Written: 2026-04-24 Last updated: 2026-06-18

9.1 The 2025–2026 industry lingua franca

The phrase "System 1" and "System 2" came into popular use not through robotics but through cognitive science. Daniel Kahneman's 2011 book Thinking, Fast and Slow ^[1] organized decades of psychological research around a dual-process model of human cognition: System 1 is fast, parallel, pattern-matching, intuitive; System 2 is slow, sequential, deliberate, analytic. The naming was a synthesis, not an invention — Stanovich and West, Evans, and others developed dual-process theories earlier — but Kahneman's formulation became the canonical reference.

Figure 9.1: Async three-layer architecture — System 2 (7–10 Hz, 7 B VLA) → System 1 (100 Hz, 1 B motion policy) → System 0 (1 kHz, 10 M WBC). Different rates and scales layered together. — illustration by author (Gemini assisted)

By 2025, three frontier-company humanoid stacks had independently converged on Kahneman's dual-process framing as the organizing architecture for their learned humanoid policies. Figure AI's Helix ^[8] described its architecture as "System 1" (fast reactive visuomotor at 200 Hz) plus "System 2" (onboard VLM at 7–9 Hz). NVIDIA's GR00T N1 ^[10] described its architecture with the same naming: System 2 VLM (1.34 B parameters) plus System 1 diffusion transformer (900 M parameters). AgiBot's GO-2 ^[11] described its architecture as "asynchronous dual-system," again mapping cleanly onto Kahneman's S1/S2. Each of the three companies arrived at the naming roughly independently over a period of twelve months. The convergence reflects both Kahneman's intellectual reach and the underlying structural pressure: a learned humanoid stack wants this decomposition.

What Figure AI's Helix 02 ^[9] added in January 2026 was a third layer — System 0 — underneath the System 1 visuomotor policy: a 10-million-parameter whole-body controller running at 1 kHz with explicit bounded-latency contracts to the layer above. The three-layer stack with named frequency tiers and interface contracts is specifically Helix 02's contribution; the System 1 / System 2 naming is convergent inheritance from Kahneman 2011.

Chapter 9's contribution is cross-stack formalization — the comparison of how Figure's Helix, AgiBot's GO-2, NVIDIA's GR00T, and Boston Dynamics's hybrid architecture instantiate the three-layer decomposition, and what their interface contracts have and do not have in common. The chapter does not claim credit for the three-layer naming, and its value is neither historical ^[1] nor marketing (Figure 2025/2026) but structural: a working inventory of how the architecture varies across four production humanoid stacks in 2026. critical-analyst's novelty_matrix §3 provides the 5-axis table (coupling / timing / failure mode / adaptation source / interface contract) this chapter pulls into §9.7.

The chapter proceeds through the three layers (§§9.2–9.4), then discusses asynchronous communication (§9.5), parameter scales (§9.6), the structural comparison with the old preprocessing pipeline (§9.7), fault tolerance and fallback semantics (§9.8), onboard vs cloud inference (§9.9), and the per-company mapping (§9.10). The Behavior Foundation Model (BFM) framing, a 2025 academic organization of these ideas ^[14], appears at §9.11 as a companion perspective.

9.2 System 0 — whole-body controller at 1 kHz

System 0 is the torque- or position-level controller that physically executes joint commands on the robot hardware. Its structural role is the honest substrate (Chapter 4 named QDD hardware as the physical precondition; System 0 is the software atop that hardware). The 1 kHz frequency is set by motor-driver hardware constraints: modern BLDC drivers update current commands at 10–40 kHz, but the controller writing those commands lives at 1 kHz where the policy's desired joint torques can be realized with minimal lag.

Three instantiations of System 0 exist in 2026 production:

Classical PD or whole-body QP (Boston Dynamics, Agility Robotics, most academic stacks). The controller is provably-correct by construction: given the desired task-space wrench from the layer above, compute joint torques that realize it while respecting friction cones, joint limits, and centroidal dynamics constraints (Chapter 2 §2.3). The Agility Motor Cortex "always-on safety layer" ^[12] is a classical whole-body QP in this slot.

Learned 10M-parameter network (Figure Helix 02). Figure 03 ships a learned System 0 trained in Isaac Lab with 200,000+ parallel environments and extensive domain randomization ^[9]. The network absorbs the whole-body QP's structural role into weights — function approximation of the QP over a randomized distribution of physical configurations.

Learned multi-mode controller (HOVER ^[7]). The 1.5-million-parameter HOVER controller supports 15+ distinct control modes (joint PD, torque, inverse kinematics, footstep command, root velocity, etc.) and runs at 200 Hz on Jetson-class edge hardware. HOVER occupies an ambiguous position in the three-layer taxonomy — its frequency is closer to System 1's, but its interface (direct joint-level outputs across many modes) makes it function as a System 0 for the policies above. We return to this ambiguity in §9.7's interface-contract discussion.

System 0's role is not to be intelligent. It is to realize the layer-above's commands honestly, quickly, and provably. Its primary failure modes are unmodeled hardware drift (the classical QP's gap from Chapter 1), numerical conditioning, and (for learned variants) distribution-shift artifacts when the hardware is outside the training distribution. The Chapter 7 sim-to-real toolkit is what closes the learned-System-0's distribution gap.

9.3 System 1 — visuomotor policy at 100–200 Hz

System 1 is the fast reactive layer that converts observations (vision, proprioception, System-2 conditioning) into desired joint-level commands. Its frequency range of 100–200 Hz is set by two competing pressures: higher is better for disturbance rejection, but diffusion-policy inference (Chapter 8 §8.6) and Transformer history-encoder inference (Chapter 6 §6.7) both cost more at higher frequencies. 100–200 Hz is the 2026 compromise.

System 1 is structurally where the Chapter 6 canon lives. The five-paper canonical recipe — actuator network → teacher-student → RMA implicit adaptation → biped LSTM → full-size causal Transformer ^[6] — produces System 1 policies. Modern production systems add three things on top of the canon:

Vision conditioning. System 1 typically receives depth or RGB images directly, alongside proprioception. This is the main difference between the Ch06 canonical policies (which were proprioception-only) and production VLA-integrated systems.
Language / latent conditioning from System 2. When a VLM runs above (next section), System 1 must consume whatever latent representation that VLM emits. The consumption interface is typically a low-dimensional latent vector that the S2 produces every 100–150 ms and S1 attends over across its 100–200 Hz control ticks.
Diffusion or flow-matching action heads. For manipulation-heavy tasks, the output distribution matters as much as the output point estimate; diffusion policy ^[3] or flow matching ^[4] generate multi-modal action sequences at the cost of additional inference steps per action.

The Figure Helix System 1 ^[8] is the public reference architecture: a visuomotor policy running at 200 Hz, consuming all sensors (cameras, proprioception, S2 latent) and outputting all joint commands. The "all sensors in, all joints out" framing is deliberate marketing shorthand for no intermediate abstraction; the trade-off is that the policy must be trained to handle every combination of inputs rather than specializing per task. Helix 02 retains the 200 Hz S1 and adds the 1 kHz S0 beneath it.

AgiBot's GO-2 S1 ^[11] is the complementary point in the design space: an "asynchronous" S1 that consumes S2 conditioning at its own schedule and emits actions at 1 kHz to Genie Sim 3.0's physics-rendering-decoupled pipeline (Chapter 5 §5.3). The asynchrony is what lets GO-2 run a full VLM-backbone S2 without blocking S1's real-time loop.

HOVER ^[7] represents the generalist direction: one learned multi-mode policy that subsumes the role of System 1 across tasks. The per-task experts-then-generalist pipeline ^[13] is a similar pattern at academic scale. Whether the production frontier (Figure, AgiBot, GR00T) converges on HOVER-style generalism is the open question §9.10 returns to.

9.4 System 2 — VLM backbone at 7–10 Hz

System 2 is the slow deliberate layer: a pretrained vision-language model that sees the scene, understands the task, and produces high-level conditioning for System 1. Its frequency range of 7–10 Hz reflects both VLM inference cost (for a 7-billion-parameter model on embedded GPU) and the frequency at which language-level decisions change. A humanoid loading a dishwasher does not need task-plan updates at 100 Hz; 7–10 Hz is ample for scene reasoning.

Figure 9.2: Figure AI Helix — the company's public scaling-law chart showing how compute and data budgets map to performance in the Helix S1/S2 architecture. — source: Figure AI press page (figure.ai/news/helix), fair use for academic review

Three 2025–2026 instantiations anchor System 2:

Figure Helix S2 (2025): an onboard internet-pretrained VLM running at 7–9 Hz for scene understanding and language ^[8]. The specific backbone is not publicly disclosed; the public claim is that S2 runs on the same embedded GPU as S1, not a cloud or external compute unit.

NVIDIA GR00T N1 S2 (2025): a 1.34-billion-parameter VLM on top of an 860-million-parameter System 1 diffusion transformer, totaling 2.2 billion parameters ^[10]. GR00T's S2 runs at a lower frequency than Figure's — the paper reports approximately 63.9 ms per 16-action chunk on an NVIDIA L40, with the VLM effectively operating near 10 Hz while the diffusion head runs near 120 Hz. GR00T N1's public evaluations use an L40-class GPU (the robot-side deployment configuration is not detailed in the report), whereas Figure's S2 is stated to run onboard.

AgiBot GO-2 S2 (2026): low-frequency semantic planning with asynchronous communication to S1 at 1 kHz; specific parameter counts and backbone are not publicly disclosed. The "asynchronous dual-system" framing is an architectural claim: S2 does not block S1's real-time loop, and communication is event-driven rather than clock-synchronous.

Figure Helix 02 added capabilities at S2 (7-billion-parameter VLM at 7–9 Hz) while moving the computational burden onboard ^[9]. The specific claim — "7 B VLM at 7–9 Hz on a low-power embedded GPU" — has not been third-party reproduced, and reproducibility at this scale is one of the open frontiers discussed in gaps.md Gap 2.

System 2's role is to provide task-level context that System 1 could not infer from sensors alone. "Task-level" here means concept-bearing information — "the user wants the red cup" or "this task is cleaning the counter, not wiping the spill" — not joint-space specifications. The conditioning interface from S2 to S1 is the architecturally critical joint; §9.7 returns to it.

9.5 Asynchronous communication

The three-layer architecture is not three sequential stages. It is three concurrent processes with different clocks. This asynchrony is the single most important structural difference between the modern architecture and the pre-2020 planner-then-controller pipeline (§9.7).

Clock ratios. System 0 at 1 kHz, System 1 at 100–200 Hz, System 2 at 7–10 Hz. The ratios are 5–20× between layers. These ratios are not coincidental; they approximately match the timescales on which different kinds of physical events happen. A joint torque change propagates in milliseconds; a task-relevant visual event propagates in tens of milliseconds; a task-plan change propagates in hundreds of milliseconds to seconds.

Communication semantics. S2 emits a latent representation to S1 whenever it finishes an inference pass. S1 attends over the most recent S2 latent across many of its own control ticks. S1 emits joint commands to S0 at its own 100–200 Hz rate. S0 tracks those commands at 1 kHz. The communication between layers is asynchronous in the strict sense: neither side waits for the other to finish before proceeding with its own loop.

Buffering and interpolation. Between S1 at 200 Hz and S0 at 1 kHz, there are five S0 ticks per S1 tick. Does S0 hold the S1 command constant for five ticks, or interpolate, or extrapolate? Production stacks typically interpolate (smooth transitions) and/or run a short local model predicting the next S1 command. The engineering of this buffering is under-disclosed publicly; it is one place where reproducing Figure or AgiBot results from publications alone is difficult.

Backpressure. If S2 slows down (longer inference, or a contested thermal budget), does S1 degrade gracefully? Figure's published description suggests yes — S1 falls back to using stale S2 conditioning, which for many tasks is sufficient. AgiBot's "asynchronous" framing makes this explicit as an architectural choice. The failure mode is if S2 stops producing any updates for extended periods and the task drifts out of S2's last known distribution.

9.6 Parameter scales

The three layers differ by roughly one order of magnitude in parameter count each:

Figure 9.3: Frequency × parameter-scale map — each industry system (Atlas WBC, Agility Motor Cortex, Figure Helix S1/S2, Tesla Optimus, OpenVLA, π₀) placed on a log-log plane of control rate × model size. — illustration by author (Gemini assisted)

System 0: 10M parameters (Figure Helix 02's learned S0) or ~0 parameters (Boston Dynamics's whole-body QP). The smallest or absent, because System 0's role is honest realization, not reasoning.
System 1: 100M to 1B parameters. HOVER is 1.5M (small); Figure Helix S1 is not disclosed but implied at hundreds of millions; GR00T N1's diffusion transformer is 860M; GR00T N1.5 reports ~10% improvement over N1 with similar parameter count.
System 2: 1B to 10B parameters. Figure Helix 02's S2 is 7B; GR00T N1's S2 is 1.34B; π0 (Chapter 8 §8.7) is 3B total with flow-matching head; OpenVLA is 7B.

The progression follows a reasonable logic. System 0's physical realization does not benefit from language-scale parameters. System 1's visuomotor coordination benefits from middling parameter counts — enough to handle vision plus proprioception plus action multimodality. System 2's task reasoning needs language-scale parameters to match the knowledge domain it draws from.

A consequence of the parameter-scale progression: the three layers place different demands on onboard compute. System 0 fits comfortably on any embedded controller. System 1 at 100–200 Hz fits on a Jetson-class or small accelerator. System 2 at 7B+ parameters is the compute frontier; Figure's claim of onboard 7B at 7–9 Hz is itself the load-bearing architectural bet.

9.7 Old preprocessing pipeline vs new asynchronous layers

Chapter 1 §1.6 described the orthodox planner-controller split: a footstep planner chose discrete foot placements at a slow rate, and a whole-body QP executed them at 1 kHz. The architecture looked superficially like a two-layer system, and readers may wonder how the 2026 System 0/1/2 stack differs structurally.

Figure 9.4: Old sequential pipeline vs new async 3-layer — strict hand-offs on the left, asynchronous parallel rates on the right. The cognition-control boundary dissolves along the time axis. — illustration by author (Gemini assisted)

critical-analyst's novelty_matrix §3 provides a 5-axis comparison table. The axes are:

Coupling: orthodox = sequential (planner produces plan, controller executes); modern = concurrent (three layers run on different clocks).

Timing: orthodox = plan-to-commitment (planner commits, controller executes until next plan); modern = rolling latent (S2 latents are continuously updated and consumed; nothing is strictly committed).

Failure mode: orthodox = stale plan (controller executes a plan that no longer matches reality; stutter step, fall); modern = distribution drift (policy encounters inputs outside training distribution; graceful degradation or safety-filter intervention).

Adaptation source: orthodox = replanning (detect failure, re-run planner); modern = history encoder (policy adapts implicitly from state-action history within the learned weights).

Interface contract: orthodox = typed task-space trajectory (planner emits trajectories in Cartesian or joint space that the QP directly follows); modern = latent conditioning (S2 emits task-level latents that S1 interprets through learned attention, not a fixed type signature).

The 5-axis comparison clarifies that the change is not simply "add a VLM on top." It is a re-architecture of how information flows between levels of the system. The orthodox pipeline was sequential and strongly-typed; the modern architecture is concurrent and latent-typed. This matters for deployment: the orthodox pipeline's failure modes were diagnostic (you could tell what part failed); the modern architecture's failure modes are statistical (the policy produced a bad action because its input was outside distribution).

A structural consequence: BD's hybrid MPC+RL architecture (Chapter 11) preserves orthodox-style typed interfaces at the MPC layer while stacking RL above for adaptability. The hybrid architecture accepts the orthodox pipeline's diagnostic clarity at the cost of some adaptability; the end-to-end learned architectures (Figure, AgiBot, GR00T) accept the latent-typed opacity at the benefit of adaptability. Whether one architectural choice dominates is a live Part IV question.

9.8 Fault tolerance and fallback semantics

Interface contracts are only half the architecture. The other half is what happens when an interface degrades. Three patterns are documented:

S2 unavailable: S1 continues with stale S2 conditioning. Production stacks design S1 to be robust to S2 latent delays up to several seconds; tasks that require live S2 (novel scene reasoning) fail gracefully by defaulting to a conservative mode.

S1 unavailable: S0 drops back to a classical fallback (PD on commanded joint positions, or a short-horizon MPC). Agility's Motor Cortex "always-on safety layer" ^[12] is this fallback; Boston Dynamics's MPC+RL hybrid makes the fallback explicit. Figure has not publicly disclosed its S1-unavailable protocol.

S0 unavailable: the robot cannot respond at kHz rates, and hardware-level interlocks (torque limits, watchdog timers, E-stop) take over. This is the lowest-level safety layer and is almost always classical.

The fallback hierarchy is not a luxury but a deployment precondition. Industrial humanoids operating in shared workspaces with humans (Chapter 15) need diagnostic failure modes to satisfy safety certification (ISO 10218, ISO/TS 15066). A learned System 1 that can degrade gracefully through a classical S0 is qualitatively different for safety purposes from a learned System 1 with no fallback. Chapter 15's Manufacturing Physical AI discussion returns to this point in the context of Korean regulatory posture.

9.9 Onboard vs cloud inference

An architectural choice that is often under-discussed: where does System 2's inference physically happen? Three options:

Fully onboard: S0, S1, and S2 all run on the robot's local compute. Figure Helix 02 is the public exemplar — 7B VLM onboard at 7–9 Hz ^[9]. Constraint: embedded GPU with aggressive power and thermal budgets, typically 50–200 W sustained.

Split onboard / cloud: S0 and S1 onboard, S2 in a nearby edge compute node or cloud. GR00T N1's published evaluations on an L40-class GPU are consistent with a split configuration for at least some experiments. Constraint: network latency to the cloud, which breaks down in low-connectivity environments (factories, disaster sites).

Fully cloud except S0: only S0 onboard. Not publicly documented for any humanoid system but present in some teleoperation-bridge concepts.

The onboard-vs-cloud choice is not a neutral engineering decision; it shapes which deployment environments are viable. Factories with 5G / WiFi connectivity can support split onboard/cloud. Offshore oil platforms, hospitals with HIPAA constraints, or factories with data-residency regulations (Korean semiconductor fabs are typical) require fully-onboard deployment. Chapter 15's argument that specialized Korean chips (Rebellions, DEEPX) are strategic is grounded in this constraint: the Korean regulatory and industrial environment favors fully-onboard deployment, which in turn creates a market for specialized accelerators.

9.10 Per-company mapping

A concise mapping of the four architectures discussed in Part IV onto the three-layer framework:

Company	System 0	System 1	System 2	Coupling
Figure (Helix 02)	10M learned, 1 kHz	visuomotor, 200 Hz	7B VLM, 7–9 Hz onboard	end-to-end learned, tight
NVIDIA (GR00T N1)	classical PD / external	860M diffusion, ~120 Hz head	1.34B VLM	modular, research-oriented
AgiBot (GO-2)	unspecified low-level	high-freq asynchronous S1	low-freq semantic planning	asynchronous, company vertical
Boston Dynamics	whole-body QP + MPC	RL layer on MPC	TRI Large Behavior Model	hybrid MPC+RL (Chapter 11)
Agility (Motor Cortex)	whole-body QP "always-on"	learned whole-body, unspecified	task-level external	safety-first layered

The table is a reading scaffold for Part IV, not a claim that Chapter 9's view determines how each company should be analyzed. Each row has its own Part IV chapter that develops the architecture in context; the mapping here is a promise that the chapters will cohere.

9.11 The Behavior Foundation Model framing

A concurrent academic framing of the same territory is the Behavior Foundation Model (BFM) line of work, surveyed by Yuan et al. ^[14]. The paper was first submitted in June 2025 and revised in February 2026, which makes it the freshest review-level anchor for whole-body humanoid control at the time of this update. BFM is the analog of the foundation-model framing from language: a pretrained whole-body controller that learns reusable primitive skills and broad behavioral priors, then adapts zero-shot or with light downstream tuning.

The relationship between BFM and the System 0/1/2 architecture is precise. BFM describes how the System 1 / whole-body policy is pretrained and adapted: data pipelines, task specifications, downstream adaptation, and real-world application constraints. System 0/1/2 describes where that policy sits at deployment time: a fast policy between a low-level controller and a slower semantic or VLA layer. In short, BFM is a training-and-adaptation taxonomy; System 0/1/2 is an interface-and-latency taxonomy.

This distinction matters because the 2025–2026 Unitree G1 method cluster now provides empirical examples of the BFM premise. The 15-minute FastSAC/FastTD3 recipe ^[15], OmniRetarget and perceptive parkour, APEX high-platform traversal, SteadyTray residual payload stabilization, RobotDancing residual tracking, and diffusion-based motion-generation/tracking papers all follow the same shape: start from reusable locomotion or motion priors, adapt them to a downstream whole-body task, then close the hardware gap with residual correction, perception filtering, or task-specific distillation. None of these papers alone is a general BFM; together they show why BFM is becoming the right review lens for the next wave of humanoid whole-body control.

Chapter 9's contribution is complementary to, not competitive with, the BFM survey. The BFM survey's strength is its taxonomy of pre-training pipelines, task specifications, and downstream adaptation methods. This chapter's strength is the cross-company interface-contract comparison in §9.7 and §9.10 — the structural question of how the three layers talk to each other in the four dominant production stacks. Readers who want the BFM-centric view should read the survey alongside Chapter 9; the two perspectives do not disagree on substance.

9.12 Open questions

Three questions close the chapter.

First, is the three-layer decomposition the right abstraction, or is it an artifact of 2025–2026 compute budgets? If embedded GPU inference improves by 10× in the next five years, the frequencies and parameter counts could all shift. System 0 at 1 kHz might become superfluous if System 1 can run at 1 kHz directly; System 2 at 7–10 Hz might accelerate into the 50+ Hz range, changing what tasks it can serve. The architecture's durability across compute generations is unknown.

Second, what is the right interface-contract standard for S1 ↔ S2? Each company uses its own format; cross-embodiment portability (Gap 4 in the gap analysis) depends on some convergence. IEEE / ISO standardization efforts are beginning; Chapter 15's Korean-industrial-strategy argument returns to this as a standards-ownership question.

Third, how does the three-layer architecture accommodate reasoning systems with iterative / tool-using behavior? Current S2 is predominantly a feedforward VLM call; task-level reasoning that requires extended multi-step computation (tool use, planning-with-search, simulation rollouts) does not fit the 7–10 Hz budget. Whether a fourth "System 3" for reasoning emerges — or whether S2 simply expands its scope — is the natural successor question for the 2027–2030 window.

9.13 Bridge to Chapter 10

Chapter 9 established the architecture; Chapter 10 examines what fills it. Specifically, the VLM-VLA family that occupies System 2 and (in some architectures) also System 1. OpenVLA, GR00T N1/N1.5, Figure Helix, AgiBot GO-1/GO-2, and π0 are the 2024–2026 state-of-the-art entries. Chapter 10 compares them stack-by-stack, develops the cross-embodiment generalization question, and closes with the Part III verdict on VLA-driven loco-manipulation integration.

References

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Vaswani, A., et al. (2017). Attention is all you need. Proc. NeurIPS. arXiv:1706.03762.
Chi, C., et al. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. Proc. RSS.
Lipman, Y., et al. (2022). Flow matching for generative modeling. Proc. ICLR.
Radosavovic, I., et al. (2024). Real-world humanoid locomotion with reinforcement learning. Science Robotics. arXiv:2303.03381.
Radosavovic, I., et al. (2024). Humanoid locomotion as next token prediction. NeurIPS. arXiv:2402.19469.
He, T., et al. (2024). HOVER: Versatile neural whole-body controller for humanoid robots. Proc. IEEE ICRA 2025. arXiv:2410.21229.
Figure AI. (2025). Helix: A vision-language-action model for generalist humanoid control. Figure AI tech blog, February 2025.
Figure AI. (2026). Figure 03 + Helix 02: General-purpose humanoid system. Figure AI announcement, January/February 2026.
Bjorck, J., et al. (2025). GR00T N1: Open foundation model for generalist humanoid robots. NVIDIA technical report and arXiv preprint.
AgiBot. (2026). GO-2 asynchronous dual-system humanoid control architecture. Link
Agility Robotics. (2025). Motor Cortex: Whole-body control foundation model for Digit. Link
Cheng, X., et al. (2025). From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint 2506.12779.
Yuan, M., et al. (2025). A survey of behavior foundation model: Next-generation whole-body control system of humanoid robots. arXiv preprint 2506.20487.
Seo, Y., et al. (2025). Learning sim-to-real humanoid locomotion in 15 minutes. arXiv preprint 2512.01996.