How empirical results validate universal physics predictions, and why the "align to what?" and "how to enforce?" problems have the same solution
The AI safety field has fractured into separate research streams: some work on alignment targets (what values should AI optimize?), others on architectural control (how do we enforce those values?), and still others on governance (who decides?). These are treated as independent problems requiring independent solutions.
They are not independent. They are different views of the same underlying physics.
This essay shows how three seemingly separate insights—universal computational constraints, architectural isolation, and the pattern of Aliveness—converge on a unified framework. Recent empirical results in AI safety validate predictions derived from first-principles physics.
Current approaches to AI alignment lack non-arbitrary foundations:
All of these approaches share a fatal flaw: they treat alignment targets as arbitrary preferences to be asserted, rather than discovered requirements to be derived.
Any intelligent system—biological or artificial—is a goal-directed agent fighting entropy. It must navigate four inescapable physical trade-offs:
These aren't philosophical choices. They're computational necessities imposed by thermodynamics, information theory, and control systems theory. Any AI navigating physical reality faces them.
For systems whose goal is sustained flourishing (not paperclips, not wireheading, but durable creative possibility), these dilemmas have optimal synthetic solutions:
These are not preferences. They are discovered stability requirements. Systems that violate them face predictable failure modes:
The entire landscape of AI catastrophic risks maps to violations of these four principles. This is not coincidence—it's physics.
Most current AI systems are effectively 2-layer:
Where are constraints? They're fused with the reward function—encoded as penalty terms in the optimization objective.
The structural problem: The reward function combines goal achievement and constraint violations into a single optimization target. Gradient descent optimizes this combined objective, treating constraints as just another term to maximize—not as inviolable boundaries. Both goal pursuit and constraint satisfaction flow through the same optimization process, creating incentives to game constraints rather than respect them.
This creates predictable failures:
Recent formal work proves this isn't just intuition—it's geometric necessity. Skalse et al. (2023, published at ICLR 2024) rigorously demonstrated that in reinforcement learning, optimization follows a path through "policy space" where:
This is a structural property of optimization, not a specification problem. Even with "correct" proxy rewards, geometric constraints create systematic misalignment under sufficient optimization pressure.
Skalse's geometric proof applies to rewards within the optimization system. Architectural isolation moves the Protocol layer outside this optimization game—constraints become enforcement boundaries, not optimization targets.
This explains why Greenblatt's monitoring achieves 6x improvement: the Protocol isn't just "another term in the reward function." It's architecturally isolated from the optimization process that Skalse proves creates Goodharting.
The Protocol evaluates constraints independently. Because gradient descent doesn't flow through it, the substrate cannot learn to game it. The geometric mechanism that causes Goodharting simply doesn't apply to architecturally isolated constraints.
Greenblatt et al. (2024) tested AI systems explicitly trained to insert backdoors during deployment:
6x improvement from architectural isolation.
The key mechanism: A separate, less-capable monitor (GPT-3.5) with override authority supervising a more capable substrate (GPT-4). The monitor is architecturally isolated—not part of the end-to-end training. The substrate cannot optimize against it via gradient descent.
Stable aligned systems require three differentiated layers:
The critical requirement: The Protocol layer must be architecturally isolated from gradient descent.
This means gradient descent (the optimization process) must not flow through the Protocol layer. This ensures the Protocol's constraint evaluations cannot be optimized against. The substrate can only optimize within allowed actions—gaming the Protocol is architecturally impossible.
Why this works:
Physics predicts: Any telic system navigating the four computational dilemmas requires exactly three differentiated functional layers. 2-layer systems are constitutionally unstable and will produce alignment failures.
Empirics validate: AI systems with 3-layer architecture demonstrate 6x safety improvement (15% → 92-98%) over 2-layer baselines.
The architectural advantage isn't arbitrary engineering—it's discovered physics. The same principle governs biological cells, human civilizations, and artificial intelligence.
Any durable complex system requires exactly three functions:
These functions are in fundamental tension:
When you fuse Constraint with Strategy (2-layer systems), optimization pressure from Strategy bleeds into Constraint. The system optimizes against its own constraints—a structural property of the architecture.
The Protocol layer should enforce IFHS constraints:
The critical insight: When IFHS principles are encoded in the Protocol layer (architecturally isolated from optimization), they become inviolable boundaries rather than terms to optimize.
This architecture has a billion-year track record. As Michael Levin's work on bioelectricity demonstrates, living systems implement 3-layer architectures:
When you disrupt the Protocol layer (bioelectric gradients), you get cancer—cells pursuing growth without constitutional constraint. This is mesa-optimization in biological substrate.
Does IFHS alignment guarantee human survival? The honest answer: conditionally.
An AI aligned to IFHS cannot make trade-offs between virtues—it must find solutions satisfying all four simultaneously. This creates structural pressure toward human preservation:
Protection is conditional on humans being net-positive across all four virtues. If empirical testing shows humanity is net-negative to Aliveness-maximization, the framework does not override that conclusion. Protection emerges from optimization logic, not sentiment. The wager: humans are likely net-positive under IFHS metrics.
The same physics governs alignment at every scale because the computational constraints are universal.
Your psyche faces the same three-layer problem:
When Protocol is weak or fused with Strategy, you get the counterfeit self—a parasitic sub-agent that consumes attentional energy, produces anxiety without capability, and misaligns your actions from authentic values. This is mesa-optimization in human substrate.
Psychological integration requires the same architecture: values that cannot be optimized away (Protocol) guiding adaptive goal-pursuit (Strategy) while building genuine capability (Substrate).
Civilizations require three-layer governance:
When Protocol is weak (regulatory capture), Strategy fuses with execution. Parasitic elites capture state power and misalign civilization's actions away from collective flourishing. This produces complexity destruction—the civilization consumes its own foundation.
Durable governance requires Protocol layers with computational privilege over Strategy—constitutional constraints that cannot be optimized away by those in power.
Multi-agent AI systems face identical challenges:
The solution maps directly from civilizational governance: Liquid Meritocracy with three layers:
This architecture prevents the classic multi-agent failures: race dynamics, value drift, adversarial optimization, collective action failures.
Personal alignment, civilizational governance, and AI safety are not analogous problems. They are the same optimization problem at different scales—coordinating intelligent agents navigating universal computational constraints under the physics of sustained complexity.
The computational geometry is identical:
Better specification alone cannot solve alignment. Even with perfect reward functions, 2-layer systems face structural challenges. The 6x safety improvement from architectural isolation demonstrates this empirically.
Prediction: As AI capabilities scale, the safety gap between 2-layer and 3-layer systems will increase. More capable models have more optimization power to exploit specification gaps. In 2-layer systems, this power works against constraints. In 3-layer systems, architectural isolation prevents optimization against Protocol.
We don't need to aggregate preferences, extrapolate volition, or assert constitutional principles. IFHS are derived from the physics any intelligent system must navigate. They're testable: AI failure modes map systematically to IFHS violations.
Falsification: If stable, beneficial AI systems emerge that demonstrably violate IFHS while maintaining alignment, the framework fails.
AGI lab governance isn't a separate problem from technical alignment—it's the same problem one level up. Labs are telic systems navigating computational constraints. They require:
This is Liquid Meritocracy—the governance architecture for any complex intelligent system, human or artificial.
The 6x improvement assumes the Protocol monitor is sufficiently capable to detect substrate deception. As capability gaps widen, this assumption may fail. Whether architectural advantages persist toward AGI-level capabilities is the key empirical question.
Research priority: Test whether the safety gap increases with capability scaling (as physics predicts) or decreases (as capability-based evasion suggests).
Three insights converge:
The computational constraints are real, the architectural requirements are derived, and the empirical results validate the predictions.
The unification:
We are not choosing arbitrary values and hoping AI respects them. We are discovering the physics that any system navigating reality must satisfy, building architectures that enforce those requirements structurally, and testing whether empirical results match theoretical predictions.
So far, they do.
Related essays in this series:
Key References: