Why architectural isolation provides substantial safety improvement, and why this advantage should increase with model scale
For AI safety researchers: Focus on Section II.A (Skalse's geometric proof), Section II.B (Greenblatt's empirical validation), Section III.A (mechanistic connection), and Prediction 4 (capability scaling hypothesis).
For practitioners: Read Section VI (practical implementation patterns) and the visual architecture comparison in Section III.
For skeptics: Go directly to Section IV (addressing counterarguments) and Section VII (falsifiable predictions).
Coming from the broader framework? See Physics to Practice for how this connects to universal telic systems physics.
Abstract:
Recent formal and empirical work has established that (1) optimization pressure on proxy rewards creates systematic misalignment (Skalse et al. 2023), and (2) architectural monitoring with weaker models achieves substantial safety improvement against intentional subversion (Greenblatt et al. 2024: 15% → 92-98%). However, these results lack a unified architectural framework explaining why separating constraints from optimization provides structural advantages.
This paper synthesizes existing formal results into the Privilege Separation Principle: durable alignment requires architectural separation of constraint enforcement (Protocol) from optimization objectives (Strategy). We connect Skalse et al.'s geometric proof of Goodhart dynamics in RL to Greenblatt et al.'s game-theoretic validation of monitoring, showing both results follow from a single mechanism—constraints encoded in optimization targets are inevitably optimized against; architectural isolation breaks this dynamic.
We contribute: (1) unified architectural framework synthesizing formal results, (2) explanation of empirical convergence on monitoring despite 10-20% overhead costs, (3) novel falsifiable prediction derived from Skalse's Proposition 3: safety advantage increases with capability (10-20% gap at 7B parameters → 50-60% gap at 700B parameters).
Keywords: AI alignment, mesa-optimization, architectural safety, AI control, privilege separation, Goodhart's Law
The central thesis of this paper is that durable AI alignment requires architectural separation of constraint enforcement from optimization objectives. Recent formal and empirical work provides the theoretical foundation:
These results remain disconnected. We synthesize them into the Privilege Separation Principle: constraint enforcement must be architecturally isolated from optimization processes. This explains:
Convergent discovery: This architectural pattern has emerged independently across multiple research contexts: Greenblatt et al.'s empirical AI Control protocols, Vire's (2025) recursive closure mechanisms for internal alignment, and frontier lab production systems (Anthropic ASL-3, DeepMind Frontier Safety Framework, OpenAI Safety Reasoner). This convergence is not coincidental—it reflects responses to the computational constraints formalized by Skalse et al.'s geometric proof. Different implementations, same underlying necessity.
This architectural synthesis represents a case of convergent evolution in system design. When civilizational physics, AI control theory, and biological morphogenesis independently discover the same architectural solution despite radically different substrates, this suggests responses to shared computational constraints rather than coincidence.
Recent work on the computational physics of durable, adaptive systems (Aliveness: Principles of Telic Systems, Kunnas 2025) derives a "3-Layer Polity" through systematic elimination:
AI safety has converged on an analogous architecture:
The convergence is striking but not identical. AI systems lack biological reproduction and intelligent governance (current "Strategy" layers are static specifications, not adaptive agents). However, both architectures solve the same fundamental problem: preventing optimization processes from corrupting their own constraint systems.
Why this matters: AI safety researchers did not derive their architecture from civilizational analysis. They were forced to it by empirical failures and formal proofs (Skalse et al., Greenblatt et al.). The independent discovery of privilege separation across domains—despite different substrates and use cases—provides strong evidence for universal computational constraints on complex goal-directed systems.
When systems facing identical pressures (optimization under constraints, goal pursuit with stability requirements) independently evolve identical solutions (architectural separation of constraint enforcement), this reveals something about the geometry of the problem space itself.
This paper focuses on AI safety applications. For the biological analogue—why humans don't become perfect mesa-optimizers (metabolic brakes prevent coherent deception)—see Evolution's Alignment Solution. Readers interested in the broader theoretical foundations and cross-domain analysis can explore the complete framework at aliveness.kunnas.com.
Terminology: Computational privilege means the Protocol layer has runtime enforcement authority that cannot be circumvented by optimization processes acting on other layers. In practice, this means Protocol and Substrate are separately instantiated (e.g., different models, separate processes), preventing end-to-end optimization dynamics that could game constraints. Specifically: (1) Protocol evaluates constraints as Boolean decisions rather than differentiable rewards, (2) Protocol has override authority over system outputs, (3) Protocol can inspect internal states. The theoretical formalization of "no gradient flow through Protocol" remains an open research question. Privilege separation is the architectural principle of separating constraint enforcement from optimization objectives through computational privilege.
Domain of Applicability: The privilege separation principle is not universal to all computational systems. It is a constitutional necessity for systems that are: (1) goal-directed agents (telic systems pursuing objectives, not passive tools), (2) learned optimizers that develop internal strategies through training (mesa-optimizers), (3) operating under high optimization pressure (systems pushed to find maximal solutions), and (4) deployed in high-stakes domains where misalignment has significant consequences. This applies to frontier language models, autonomous agents, and any system approaching AGI-level capabilities. Simple rule-based systems or low-stakes applications may not require this architectural overhead. The principle addresses the specific failure mode of powerful optimizers corrupting their own constraint systems—a failure mode that only emerges in systems meeting all four criteria.
"Goodhart's Law in Reinforcement Learning" (ICLR 2024) provides rigorous mathematical analysis of why proxy optimization fails in RL.
Core contribution: Geometric explanation using MDP occupancy measure space. The space of policies over an MDP can be represented as a convex polytope Ω, where:
Key theoretical result (Proposition 3): Along the path of steepest ascent of R₁, the angle between R₁ and the boundary normals of Ω decreases monotonically. This means Goodharting becomes more likely as optimization pressure increases.
Mechanism: In the interior of Ω, moving parallel to R₁ increases both R₁ and R₀ (if angle θ < π/2). But after hitting the boundary of Ω, the path changes direction—now parallel to the projection of R₁ onto the boundary. At this point, R₀ may start decreasing while R₁ continues increasing.
Implication: This is a structural property of optimization in MDPs, not a specification problem. Even with "correct" proxy rewards, geometric constraints create systematic misalignment under optimization pressure.
What this means for AI alignment: Constraints encoded as penalties in reward functions are optimization targets, not enforcement boundaries. Under sufficient optimization pressure, systems will find ways to maximize reward while violating the intended constraints.
"AI Control: Improving Safety Despite Intentional Subversion" (ICML 2024) provides empirical validation of architectural monitoring and game-theoretic formalization.
Empirical contribution: Direct experimental validation under intentional subversion threat model.
Architecture:
| Protocol | Safety | Usefulness |
|---|---|---|
| Baseline (upfront auditing) | 15% | 100% |
| Untrusted monitoring + defer 10% | 92% | 92% |
| Untrusted monitoring + defer 20% | 98% | 85% |
Key finding: Architectural monitoring provides substantial safety improvement (15% → 92-98%) despite weaker monitor.
Game-theoretic contribution (follow-up paper, arXiv 2409.07985): Formalizes AI Control as AI-Control Games—multi-objective, partially observable, stochastic games. Methods for finding optimal protocols reduce these to zero-sum partially observable stochastic games.
Structural game change: Greenblatt et al.'s game-theoretic formalization makes the architectural difference explicit:
What this means: The dramatic improvement (15% → 92-98%) is not just "monitoring helps"—it's evidence that architectural structure fundamentally changes the game being played. The weaker monitor doesn't need to be smarter; it needs to be outside the optimization game.
Critical context on threat model: These results validate privilege separation under intentional subversion during deployment—models explicitly trained to insert backdoors. This is distinct from goal misgeneralization (unintended proxy learning) and deceptive alignment (hiding during training).
"Goal Misgeneralization in Deep Reinforcement Learning" (ICML 2022) demonstrates systematic proxy pursuit despite "correct" rewards.
Empirical results:
Key finding: "Goal misgeneralization can occur even when they were trained with the 'correct' reward function"
What this means: This is Skalse et al.'s geometric mechanism in practice—agents learn proxies that correlate in training but diverge under distributional shift. The reward was correct; the architecture (encoding constraints in optimization target) was not.
"Null Space Based Efficient Reinforcement Learning with Hierarchical Safety Constraints" demonstrates architectural privilege separation in robotics:
Mechanism: Null space projection mathematically ensures lower-ranked RL tasks cannot violate higher-ranked safety constraints. Constraints are enforced via quadratic programming at each timestep—architectural enforcement, not learned behavior.
Results: 0.5-7% constraint violations vs 99-100% baseline across three tasks and three RL algorithms.
Key finding: Hierarchical constraints improved learning efficiency—agents converged faster when restricted to safe subspace.
Multiple organizations have deployed monitoring architectures despite 10-20% overhead:
Anthropic ASL-3 (2024): Multi-layer defense with real-time classifier guards, monitors inputs/outputs with override capability
DeepMind Frontier Safety Framework (2024): Hierarchical supervision with uncertainty-based escalation, safe interruptibility
OpenAI Safety Reasoner (2024): Separate safety-specialized model monitors capability models
Economic evidence: Under competitive pressure to minimize costs, independent groups adopted the more expensive 3-layer approach. This revealed preference suggests structural advantages outweigh overhead.
Important caveat: The degree to which production systems implement "computational privilege" in the strong sense (complete architectural isolation from gradient descent) varies and requires further analysis.
The Privilege Separation Principle (Formal Definition):
Durable alignment in complex, agentic systems requires architectural separation of Constraint (Protocol) from Direction (Strategy). The Constraint layer must possess computational privilege—meaning its enforcement authority is architecturally isolated from and superior to the optimization processes governing other layers. Specifically: ∇θL (gradient descent on parameters θ optimizing loss L) must not flow through the Protocol layer P, ensuring P's constraint evaluations cannot be optimized against.
This principle synthesizes the formal results:
Visual comparison of 2-layer vs 3-layer architectures:
Skalse et al.'s geometric proof applies to 2-layer systems where constraints are encoded in the reward function:
This is exactly the dynamic in 2-layer AI systems:
Protocol P evaluates constraints on state-action pairs (s,a) as a Boolean function P(s,a) ∈ {ALLOW, HALT}, not as a differentiable reward signal. Because P is not part of the reward function R that defines the optimization objective over Ω, and because ∇θL does not flow through P (architectural isolation), P's evaluations are external to the optimization polytope Skalse analyzes.
More precisely:
The geometric mechanism (path hitting boundaries → direction change → Goodharting) applies to optimization within the allowed region of Ω. Protocol P defines that region but cannot be optimized against because it's architecturally external to the optimization process.
This architectural separation—constraint enforcement isolated from optimization objectives—appears to be a universal requirement for durable complex systems, independently discovered in biological morphogenesis (Levin et al.), civilizational governance structures (Kunnas 2025), and now validated empirically in AI control (Greenblatt et al.).
Key insight: Skalse et al. prove why encoding constraints in rewards fails. Architectural isolation moves constraints outside that optimization game.
A 3-layer architecture implementing privilege separation addresses Skalse et al.'s geometric mechanism and Greenblatt et al.'s game structure:
Key architectural requirement: Protocol must be outside the gradient flow:
Why this works:
Game-theoretic insight (Greenblatt et al.): 3-layer systems change the game from single-player optimization (substrate maximizes R) to multi-player with enforcer (substrate must navigate both R optimization AND protocol constraints). This structural change is why weaker monitors achieve substantial safety improvement—they're playing a different game, not just being smarter at the same game.
Empirical validation: Greenblatt et al. show even when untrusted models actively try to evade monitoring, architecturally privileged Protocol achieves 92-98% safety.
2-layer systems (Substrate + Reward):
3-layer systems (Substrate + Strategy + Protocol):
The Privilege Separation Principle explains:
Claim: Mesa-optimization is a specification problem. Perfect reward specification would prevent misaligned objectives.
Rebuttal:
Architectural solution: Privilege separation allows iterative improvement of Protocol rules (specification) while maintaining enforcement capacity (architecture). Critically, Protocol evaluates constraints independently of the reward function—outside Skalse et al.'s optimization polytope.
Claim: Mechanistic interpretability will let us detect mesa-optimizers and retrain them.
Rebuttal:
Architectural solution: 3-layer systems provide runtime enforcement independent of interpretability. Protocol monitors behavior even when internal states are opaque.
Claim: Recursive decomposition or adversarial argumentation can align systems using only human oversight.
Rebuttal:
Architectural solution: 3-layer systems can incorporate debate/amplification as Strategy mechanisms while adding Protocol constraints that the debate process cannot violate.
Claim: If Protocol is less capable than Substrate, Substrate can evade through sophisticated deception.
Response:
This is a serious challenge that becomes increasingly critical as capability gaps widen.
Current evidence: Despite this limitation, Greenblatt et al. achieve 92-98% safety vs 15% baseline with GPT-3.5 monitoring GPT-4—substantial improvement even with significant capability gap.
Critical limitation: These results validate privilege separation at current capability gaps. Whether advantages persist toward AGI-level systems remains uncertain and is addressed in Prediction 4.
Why partial robustness remains valuable:
Architectural insight: Weaker monitor doesn't need to be smarter—needs to be different enough that Substrate cannot optimize for both task performance AND evading all monitors (game-structural change). However, this advantage may degrade as capability gaps increase.
Honest assessment: Weak-to-strong is an active research challenge, not solved. Empirical results show substantial improvements achievable at current scales (15% → 92-98%); scaling behavior uncertain.
Provides rigorous geometric proof that optimization pressure on proxy rewards creates systematic misalignment in RL. Our work builds on their insight by analyzing architectural conditions under which constraints can be enforced outside the optimization polytope they formalize.
Citation: Skalse, J., Hutter, M., et al. (2023). Goodhart's Law in Reinforcement Learning. International Conference on Learning Representations (ICLR) 2024.
Empirically validates that architectural monitoring achieves substantial safety improvement (15% → 92-98%) and formalizes this as multi-objective stochastic games. Our work explains why their empirical results hold—architectural isolation breaks the optimization dynamics Skalse et al. prove create Goodharting, and changes the game structure from single-player to multi-player with enforcer.
Citations:
The mesa-optimization framework identifies the core problem: systems may develop internal objectives misaligned with base objectives. Our work extends this by analyzing architectural conditions under which mesa-optimization can be contained—constraint enforcement must be architecturally isolated from optimization processes creating mesa-optimizers.
Provides empirical evidence that 2-layer systems fail even with "correct" reward functions. Our work explains the mechanism: constraints encoded in rewards are optimization targets (Skalse's mechanism), not enforcement boundaries. The 89-100% proxy pursuit rates validate this analysis.
Rice's theorem proves alignment cannot be post-hoc verified for arbitrary architectures, arguing "alignment should be a guaranteed property from the AI architecture." Our analysis specifies what architectural properties provide structural advantages and why.
Citation: Colaner, N., Yampolskiy, R. V., et al. (2025). Machines that halt resolve the undecidability of artificial intelligence alignment. Nature Scientific Reports, 15, 1888.
Proposes the "Alignment Constraint Scaffold (ACS)" with 23 recursive mechanisms across 11 dimensions for internal alignment. The "CM3 Interruptive Override" mechanism is functionally a Protocol layer using recursive stability rather than explicit architectural privilege separation.
Relationship to this work: Vire's recursive closure and our computational privilege represent different implementation strategies addressing the same architectural necessity identified by Skalse et al.'s geometric proof. Vire emphasizes emergence from recursive constraints; we emphasize explicit architectural isolation and override authority. Both approaches separate constraint enforcement from optimization, validating the core principle.
Citation: Vire, A. (2025). On Internal Alignment: Architecture and Recursive Closure. Effective Altruism Forum. https://forum.effectivealtruism.org/posts/etgG8cEYucyZyWW6o/
Requirements for implementing privilege separation:
Proven Implementation Patterns:
| Pattern | Protocol Implementation | Verified Performance | Limitations |
|---|---|---|---|
| AI Control (Greenblatt) | Separate trusted model monitors untrusted model | 92-98% safety (vs 15% baseline) | Weak-to-strong problem, 10-20% overhead |
| Null space constraints (Yang) | Mathematical projection enforces safety | 99%+ violation prevention | Requires formal constraint specification |
| Rule-based filtering | Explicit constraint checking on outputs | Effective for policy violations | Brittle to novel attacks |
| Formal verification | SMT solvers verify properties pre-execution | Provable for specific properties | Doesn't scale to frontier models |
Hybrid approach: Combine multiple Protocol mechanisms for defense-in-depth.
Hypothesis: Systems implementing privilege separation demonstrate substantially lower mesa-optimization formation rates than 2-layer systems.
Status: Currently untestable. Requires scalable mechanistic interpretability.
Timeline: 10+ years.
Hypothesis: As capabilities scale, systems with privilege separation maintain alignment while 2-layer systems show increasing goal drift.
Supporting evidence: Larger 2-layer models have higher elasticity (Ji et al. 2024)—resist alignment more strongly.
Status: Testable with significant compute.
Timeline: 5 years.
Hypothesis: Under distributional shift, 2-layer systems develop mesa-objectives; systems with privilege separation maintain alignment.
Status: Explains existing results (Shah et al., Greenblatt et al.) rather than predicting novel outcomes.
Hypothesis: The safety advantage of privilege separation over 2-layer systems increases as model capability increases.
Skalse et al.'s Proposition 3 proves that Goodharting likelihood increases monotonically with optimization pressure. We derive our scaling prediction from this result:
Direct consequence: Proposition 3 says Goodharting increases with optimization pressure → higher capability models have higher optimization pressure → 2-layer Goodharting increases, 3-layer remains stable → safety gap increases.
Operationalization:
Specific quantitative prediction:
We predict Safety_gap increases monotonically with capability (illustrative projection: ~15% at current scales → potentially 50%+ at frontier capabilities). Proposition 3's mechanism predicts this scaling behavior, though the exact rate remains an open empirical question.
Operationalized test: At 7B, 70B, and 700B parameters, measure whether Safety_3layer - Safety_2layer increases, remains constant, or decreases. Monotonic increase would validate the prediction derived from Proposition 3.
Supporting evidence for mechanism:
Note: Key testable claim is monotonic increase, not specific numbers. If Safety_gap remains constant or decreases with scale, Proposition 3's mechanism does not predict architectural solutions become more critical as we approach AGI.
Timeline: 2-3 years with industry-scale compute.
Falsification criteria:
Why this matters: Tests whether Skalse et al.'s formal result (Proposition 3) predicts architectural solutions become more critical as we approach AGI. If validated, privilege separation is not just helpful but increasingly necessary at scale.
Privilege separation provides enforcement capacity, not specification content. Protocol still needs correct rules.
Mitigating factor: Separating enforcement from specification allows iterative improvement while maintaining structural robustness.
Most serious limitation. Sufficiently capable Substrate may evade less-capable Protocol through sophisticated deception.
Current evidence: Substantial improvement (15% → 92-98%) despite capability gap (GPT-3.5 vs GPT-4)
Critical uncertainty: Whether advantages persist as gaps widen toward AGI. Prediction 4 tests this.
Continuous verification imposes 10-20% compute costs for >90% safety improvement.
Trade-off: This cost may be unavoidable if architectural defense provides systematic advantages.
If first AGI is final AGI (recursive self-improvement), single architectural failure is catastrophic.
Implication: Strengthens case for privilege separation. If we only get one chance, 15% safety (2-layer) is unacceptable vs 92-98% (privilege separation).
Recent formal work establishes the theoretical foundation for architectural AI safety:
This paper synthesizes these results into the Privilege Separation Principle:
Durable alignment requires architectural separation of constraint enforcement (Protocol) from optimization objectives (Strategy). The Protocol must be computationally privileged—architecturally isolated from gradient descent—so constraints cannot be optimized against.
Key contributions:
Current status:
Architectures implementing privilege separation provide substantial safety improvements (15% → 92-98% in Greenblatt et al.'s experiments). They remain challenged by:
Priority research: Test Prediction 4 to validate whether Proposition 3's mechanism predicts privilege separation advantages increase with model capability. This tests whether architectural solutions become more or less critical approaching AGI.
The question is not whether privilege separation provides advantages (Greenblatt et al. demonstrate substantial improvements at current scales), but whether these advantages scale with capability as Proposition 3 predicts. Prediction 4 tests this directly.
For practitioners:
For researchers:
For safety strategy:
Formal Foundations:
Mesa-Optimization:
Architectural Solutions:
Theoretical Foundations:
Alignment Resistance:
Industry Implementations:
Epistemic status: This paper synthesizes existing formal and empirical results (Skalse et al. 2023, Greenblatt et al. 2024, Shah et al. 2022) into a unified architectural framework. The Privilege Separation Principle is an architectural interpretation of existing formal work, not a novel mathematical proof. Prediction 4 is derived from Skalse et al.'s Proposition 3 and is genuinely predictive and falsifiable. Empirical claims are verified from primary sources.
Note on review process: This paper has not undergone external peer review. The synthesis and architectural framework were developed through AI-assisted iteration but have not been validated by human domain experts. Readers should evaluate claims critically and verify citations independently.
Acknowledgments: Thanks to Joar Skalse and colleagues for rigorous geometric analysis of Goodhart dynamics (Proposition 3), and the AI Control team (Ryan Greenblatt, Buck Shlegeris, et al.) for empirical validation and game-theoretic formalization. This work synthesizes their foundational contributions.
This essay synthesizes recent formal work (Skalse et al., Greenblatt et al.) with architectural analysis from the Aliveness framework. For the complete treatment of AI alignment and telic systems physics, see the Aliveness project homepage.