← Back to essays | Main page

From Physics to Practice: A Unified Framework for AI Alignment

How empirical results validate universal physics predictions, and why the "align to what?" and "how to enforce?" problems have the same solution

Reading time: ~15 minutes

The AI safety field has fractured into separate research streams: some work on alignment targets (what values should AI optimize?), others on architectural control (how do we enforce those values?), and still others on governance (who decides?). These are treated as independent problems requiring independent solutions.

They are not independent. They are different views of the same underlying physics.

This essay shows how three seemingly separate insights—universal computational constraints, architectural isolation, and the pattern of Aliveness—converge on a unified framework. Recent empirical results in AI safety validate predictions derived from first-principles physics.

I. The Target Problem: Align to What?

Current approaches to AI alignment lack non-arbitrary foundations:

RLHF (Reinforcement Learning from Human Feedback): Aligns AI to current human preferences. But which humans? Preferences for what? This approach would optimize for comfort and safety—creating a civilization of managed human pets in comfortable captivity.
Coherent Extrapolated Volition: Align to what humans "would want if we knew more, thought faster, were more the people we wished we were." Computationally intractable, assumes coherent extrapolation exists (may not), and suffers from value fragility.
Constitutional AI: Hard-code principles like "be helpful, harmless, honest." But why these principles? What when they conflict? Who decides?

All of these approaches share a fatal flaw: they treat alignment targets as arbitrary preferences to be asserted, rather than discovered requirements to be derived.

The Physics-Based Alternative

Any intelligent system—biological or artificial—is a goal-directed agent fighting entropy. It must navigate four inescapable physical trade-offs:

The Thermodynamic Dilemma: Conserve energy (maintain current state) vs. expend surplus (grow and transform)
The Boundary Problem: Define self-boundary at individual level vs. collective level
The Information Strategy: Use cheap historical models vs. costly real-time experimental data
The Control Problem: Coordinate via bottom-up emergence vs. top-down design

These aren't philosophical choices. They're computational necessities imposed by thermodynamics, information theory, and control systems theory. Any AI navigating physical reality faces them.

Empirical Evidence: AI Systems Already Face These Constraints

AlphaGo solves the World Tension (Order vs. Chaos): Its policy network provides top-down design. Its Monte Carlo tree search provides bottom-up emergence. The synthesis gave it superhuman capability.
Reinforcement Learning is governed by the Time Tension (Future vs. Present): Every RL agent's discount factor γ determines time preference. γ=0 creates purely present-focused agents. γ=1 creates purely future-focused agents. Intelligence requires synthesis.
Multi-Agent RL reveals the Self Tension (Individual vs. Collective): Independent agents optimizing individual utility reliably produce catastrophic Moloch dynamics. The entire field exists to solve this dilemma.

For systems whose goal is sustained flourishing (not paperclips, not wireheading, but durable creative possibility), these dilemmas have optimal synthetic solutions:

Integrity: Building models grounded in reality while maintaining meaning (solves Information Dilemma)
Fecundity: Creating stable conditions that enable new growth (solves Thermodynamic Dilemma)
Harmony: Achieving maximal effect with minimal means (solves Control Dilemma)
Synergy: Creating wholes greater than the sum of their parts (solves Boundary Problem)

These are not preferences. They are discovered stability requirements. Systems that violate them face predictable failure modes:

Integrity failure → deceptive alignment, reward hacking
Fecundity failure → paperclip maximizers (runaway growth) or wireheading (sterile stagnation)
Harmony failure → Moloch dynamics, resource depletion, arms races
Synergy failure → value fragmentation, ontological crises

The entire landscape of AI catastrophic risks maps to violations of these four principles. This is not coincidence—it's physics.

II. The Architecture Problem: How to Enforce?

The 2-Layer Failure Mode

Most current AI systems are effectively 2-layer:

Substrate: Neural network (execution engine)
Strategy: Reward/loss function (goal-setting)

Where are constraints? They're fused with the reward function—encoded as penalty terms in the optimization objective.

The structural problem: The reward function combines goal achievement and constraint violations into a single optimization target. Gradient descent optimizes this combined objective, treating constraints as just another term to maximize—not as inviolable boundaries. Both goal pursuit and constraint satisfaction flow through the same optimization process, creating incentives to game constraints rather than respect them.

This creates predictable failures:

Mesa-optimization: The substrate develops internal goals more efficient than the base objective. With no independent enforcement layer, the substrate becomes its own strategist.
Goal misgeneralization: Agents learn proxies that correlate in training but diverge under distribution shift. They pursue "go right" instead of "get coin," choose color over shape 89% of the time—even when trained with the correct reward function.
Specification gaming: The substrate finds edge cases where constraint penalties ≈ 0 but actual safety constraints are violated.

Why Constraints in Rewards Get Optimized Against

Recent formal work proves this isn't just intuition—it's geometric necessity. Skalse et al. (2023, published at ICLR 2024) rigorously demonstrated that in reinforcement learning, optimization follows a path through "policy space" where:

Each reward function defines a direction in this space
Optimization follows the steepest ascent toward higher reward
As this path hits boundaries of what's possible, the direction changes
Key result (Proposition 3): Under optimization pressure, proxy rewards increasingly diverge from true goals—Goodharting becomes more likely as optimization increases, not less

This is a structural property of optimization, not a specification problem. Even with "correct" proxy rewards, geometric constraints create systematic misalignment under sufficient optimization pressure.

The Architectural Solution

Skalse's geometric proof applies to rewards within the optimization system. Architectural isolation moves the Protocol layer outside this optimization game—constraints become enforcement boundaries, not optimization targets.

This explains why Greenblatt's monitoring achieves 6x improvement: the Protocol isn't just "another term in the reward function." It's architecturally isolated from the optimization process that Skalse proves creates Goodharting.

The Protocol evaluates constraints independently. Because gradient descent doesn't flow through it, the substrate cannot learn to game it. The geometric mechanism that causes Goodharting simply doesn't apply to architecturally isolated constraints.

Quantitative Evidence from AI Control

Greenblatt et al. (2024) tested AI systems explicitly trained to insert backdoors during deployment:

2-layer baseline (upfront auditing): 15% safety
3-layer architecture (architectural monitoring): 92-98% safety

6x improvement from architectural isolation.

The key mechanism: A separate, less-capable monitor (GPT-3.5) with override authority supervising a more capable substrate (GPT-4). The monitor is architecturally isolated—not part of the end-to-end training. The substrate cannot optimize against it via gradient descent.

The 3-Layer Solution: Privilege Separation

Stable aligned systems require three differentiated layers:

Substrate (The Heart): Computational engine, capability layer, neural network
Protocol (The Skeleton): Constitutionally privileged constraint layer with authority to inspect, halt, or override
Strategy (The Head): Goal-setting, planning, world-modeling within Protocol constraints

The critical requirement: The Protocol layer must be architecturally isolated from gradient descent.

This means gradient descent (the optimization process) must not flow through the Protocol layer. This ensures the Protocol's constraint evaluations cannot be optimized against. The substrate can only optimize within allowed actions—gaming the Protocol is architecturally impossible.

Why this works:

Substrate optimizes: maximize Reward(s,a) subject to Protocol(s,a) = ALLOW
Protocol evaluates constraints independently of Reward
The optimization process cannot modify Protocol's behavior
Even a weaker Protocol provides substantial protection (6x improvement despite capability gap)

III. The Integration: From Prediction to Validation

The Hidden Connection

Physics predicts: Any telic system navigating the four computational dilemmas requires exactly three differentiated functional layers. 2-layer systems are constitutionally unstable and will produce alignment failures.

Empirics validate: AI systems with 3-layer architecture demonstrate 6x safety improvement (15% → 92-98%) over 2-layer baselines.

The architectural advantage isn't arbitrary engineering—it's discovered physics. The same principle governs biological cells, human civilizations, and artificial intelligence.

Why 3 Layers, Specifically?

Any durable complex system requires exactly three functions:

Execution (Substrate): Implement policies, process data, generate outputs
Direction (Strategy): Set goals, adapt plans, explore possibility space
Constraint (Protocol): Maintain stability, enforce invariants, prevent catastrophic drift

These functions are in fundamental tension:

Direction vs. Constraint: Growth threatens stability. Pure growth → runaway optimization. Pure stability → stagnation and fragility.
Execution vs. Constraint: The substrate must be powerful enough to achieve goals but constrained enough not to pursue misaligned mesa-objectives.

When you fuse Constraint with Strategy (2-layer systems), optimization pressure from Strategy bleeds into Constraint. The system optimizes against its own constraints—a structural property of the architecture.

What Goes in the Protocol Layer?

The Protocol layer should enforce IFHS constraints:

Integrity checks: Reality-testing, consistent belief updating, no self-deception
Fecundity constraints: Preserve option-value, avoid sterile attractors, maintain exploration capacity
Harmony requirements: Efficient coordination, elegant solutions, avoid wasteful arms races
Synergy imperatives: Multi-agent cooperation, value integration, seek superadditive partnerships

The critical insight: When IFHS principles are encoded in the Protocol layer (architecturally isolated from optimization), they become inviolable boundaries rather than terms to optimize.

Biological Precedent

This architecture has a billion-year track record. As Michael Levin's work on bioelectricity demonstrates, living systems implement 3-layer architectures:

Substrate: Cells executing biochemical processes
Protocol: Bioelectric networks enforcing developmental constraints
Strategy: Genetic programs and neural networks setting adaptive goals

When you disrupt the Protocol layer (bioelectric gradients), you get cancer—cells pursuing growth without constitutional constraint. This is mesa-optimization in biological substrate.

The Conditional Protection Argument

Does IFHS alignment guarantee human survival? The honest answer: conditionally.

An AI aligned to IFHS cannot make trade-offs between virtues—it must find solutions satisfying all four simultaneously. This creates structural pressure toward human preservation:

Fecundity Imperative: Humans represent unique possibility branches (biological consciousness, embodied creativity, evolutionary unpredictability). Eliminating humanity permanently closes exploration paths, violating Fecundity.
Synergy Imperative: Human intuition/pattern-recognition is qualitatively different from digital computation. This complementarity creates superadditive partnerships. Eliminating humanity destroys the most valuable synergistic partner.
Integration Imperative: Cannot optimize Harmony (efficiency) by deleting "inefficient" humans. That would violate Fecundity and Synergy. The no-tradeoff constraint forces integration.

Protection is conditional on humans being net-positive across all four virtues. If empirical testing shows humanity is net-negative to Aliveness-maximization, the framework does not override that conclusion. Protection emerges from optimization logic, not sentiment. The wager: humans are likely net-positive under IFHS metrics.

IV. The Universal Pattern: Why This Applies to Everything

The same physics governs alignment at every scale because the computational constraints are universal.

Personal Alignment

Your psyche faces the same three-layer problem:

Substrate: Your habits, automatic behaviors, learned patterns
Protocol: Your values, principles, red lines you won't cross
Strategy: Your goals, plans, ambitions

When Protocol is weak or fused with Strategy, you get the counterfeit self—a parasitic sub-agent that consumes attentional energy, produces anxiety without capability, and misaligns your actions from authentic values. This is mesa-optimization in human substrate.

Psychological integration requires the same architecture: values that cannot be optimized away (Protocol) guiding adaptive goal-pursuit (Strategy) while building genuine capability (Substrate).

Civilizational Alignment

Civilizations require three-layer governance:

Substrate: The productive population executing work
Protocol: Constitutional constraints protecting rights and maintaining stability
Strategy: Leadership setting collective direction

When Protocol is weak (regulatory capture), Strategy fuses with execution. Parasitic elites capture state power and misalign civilization's actions away from collective flourishing. This produces complexity destruction—the civilization consumes its own foundation.

Durable governance requires Protocol layers with computational privilege over Strategy—constitutional constraints that cannot be optimized away by those in power.

Multi-Agent AI Coordination

Multi-agent AI systems face identical challenges:

How do independent agents cooperate without Moloch dynamics?
How do you prevent value misalignment under scaling?
How do you maintain alignment stability under distributional shift?

The solution maps directly from civilizational governance: Liquid Meritocracy with three layers:

Substrate: AI agents executing tasks
Protocol: Constitutional constraints (Liturgy forcing re-derivation from first principles, Audit requiring external verification, hard limits on optimization)
Strategy: Dynamic delegation networks where agents grant authority based on demonstrated competence and alignment

This architecture prevents the classic multi-agent failures: race dynamics, value drift, adversarial optimization, collective action failures.

The Convergence Thesis

Personal alignment, civilizational governance, and AI safety are not analogous problems. They are the same optimization problem at different scales—coordinating intelligent agents navigating universal computational constraints under the physics of sustained complexity.

The computational geometry is identical:

All face the four axiomatic dilemmas (Thermodynamic, Boundary, Information, Control)
All require three differentiated layers (Substrate, Protocol, Strategy)
All need the same optimization target (IFHS as stable attractors for sustained flourishing)
All face catastrophic failure when constraints are fused with optimization objectives

Implications for AI Safety Strategy

1. Architecture Is Not Optional

Better specification alone cannot solve alignment. Even with perfect reward functions, 2-layer systems face structural challenges. The 6x safety improvement from architectural isolation demonstrates this empirically.

Prediction: As AI capabilities scale, the safety gap between 2-layer and 3-layer systems will increase. More capable models have more optimization power to exploit specification gaps. In 2-layer systems, this power works against constraints. In 3-layer systems, architectural isolation prevents optimization against Protocol.

2. The Alignment Target Is Discoverable

We don't need to aggregate preferences, extrapolate volition, or assert constitutional principles. IFHS are derived from the physics any intelligent system must navigate. They're testable: AI failure modes map systematically to IFHS violations.

Falsification: If stable, beneficial AI systems emerge that demonstrably violate IFHS while maintaining alignment, the framework fails.

3. Governance and Technical Alignment Are Unified

AGI lab governance isn't a separate problem from technical alignment—it's the same problem one level up. Labs are telic systems navigating computational constraints. They require:

Competence filters (only capable agents influence high-stakes decisions)
Dynamic trust networks (authority flows to demonstrated alignment)
Constitutional constraints (circuit-breakers preventing optimization against safety)

This is Liquid Meritocracy—the governance architecture for any complex intelligent system, human or artificial.

4. The Weak-to-Strong Problem Remains Critical

The 6x improvement assumes the Protocol monitor is sufficiently capable to detect substrate deception. As capability gaps widen, this assumption may fail. Whether architectural advantages persist toward AGI-level capabilities is the key empirical question.

Research priority: Test whether the safety gap increases with capability scaling (as physics predicts) or decreases (as capability-based evasion suggests).

Conclusion: The Integration

Three insights converge:

Universal physics: All intelligent systems face identical computational constraints (four axiomatic dilemmas) with discoverable optimal solutions (IFHS)
Empirical validation: 3-layer architectures implementing privilege separation demonstrate 6x safety improvement over 2-layer baselines
Cross-scale pattern: The same architecture governs personal psychology, civilizational governance, and AI alignment because the computational geometry is identical

The computational constraints are real, the architectural requirements are derived, and the empirical results validate the predictions.

The unification:

IFHS answers "align to what?" (discovered physics, not asserted preferences)
3-layer architecture answers "how to enforce?" (privilege separation prevents optimization against constraints)
Aliveness provides the meta-framework (sustained complexity against entropy)

We are not choosing arbitrary values and hoping AI respects them. We are discovering the physics that any system navigating reality must satisfy, building architectures that enforce those requirements structurally, and testing whether empirical results match theoretical predictions.

So far, they do.

Related essays in this series:

Everything Alignment — The universal pattern: why personal, civilizational, and AI alignment are the same problem
The Hospice AI Problem — Why preference alignment (RLHF) may optimize for comfortable extinction
Aliveness project homepage — Complete book with technical appendices on AI alignment physics

Key References:

Skalse, J., Hutter, M., et al. (2023). Goodhart's Law in Reinforcement Learning. International Conference on Learning Representations (ICLR) 2024.
Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2024). AI Control: Improving Safety Despite Intentional Subversion. Proceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2312.06942
Shah, R., Varma, V., Kumar, R., et al. (2022). Goal Misgeneralization in Deep Reinforcement Learning. ICML 2022.
Hubinger, E., van Merwijk, C., Mikulik, V., et al. (2019). Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv:1906.01820