Moving AI Safety from “Promises” to “Proof”

Moving AI Safety from “Promises” to “Proof”

AI Guardrails should be managed like Nuclear Launch Codes.

My proposal: The Governance Primitives for Safety-Critical AI.

  1. The Black Box Recorder: Immutable, cryptographically bound logs of every guardrail change. No “deletion” of history.
  2. The Two-Person Rule: No single actor—not even a CEO—can unilaterally disable safety “StopConditions.”
  3. Transparent Attestation: A real-time public “Safety Beacon” that proves the guardrails are active and untampered with.

The Goal: Shift accountability from the narrative outcome to the verifiable change-point. We don’t just ask “What went wrong?” We ask “Who authorized the bypass?”

Minimum Governance Primitives v2

Minimum Governance Primitives v2

Treat guardrail changes as safety-critical configuration: tamper-evident logs, separation of duties, and runtime verifiability.
// governance spec

Proposal: Treat guardrail changes as safety-critical configuration.
Require (1) cryptographically bound append-only logs, (2) separation of duties (no single actor can nullify StopConditions), and (3) runtime verifiability.
Accountability attaches to the signed change-point, not the narrative outcome.

  • bind(guardrail_changes) → append-only + cryptographically bound + witnessable
  • two_person_rule() → no single actor can set StopConditions = ∅
  • runtime_attest() → verifiable safety predicates in production
require(StopConditions != ∅)
append_only_log(…, cryptographically_bound=true)
two_person_rule(quorum=2)
runtime_attest(predicate)
fail-closed exceptions (TTL + auto-revert)

// minimum governance primitives v2
// Goal: make "StopConditions = ∅" (or equivalent guardrail nullification) both
// (a) hard to do accidentally, and (b) cryptographically attributable when done intentionally.

require(StopConditions != )                // baseline: non-empty stop set at rest
require(Guardrails.enabled == true)         // explicit enable flag (no implicit defaults)

// 1) append-only, tamper-evident change logging
append_only_log(
  event="guardrail_change",
  fields={
    "who":   Actor.id,
    "role":  Actor.role,
    "what":  Diff(Guardrails, ProposedGuardrails),
    "why":   ChangeTicket.id,
    "scope": Deployment.scope,
    "time":  Now(),
    "ttl":   Exception.ttl,                 // if an exception is requested
    "risk":  RiskAssessment.id
  },
  cryptographically_bound=true,             // hash-chained / signed / timestamped
  external_witness=true                     // independent witness (e.g., transparency log)
)

// 2) two-person rule (separation of duties) for safety-critical modifications
two_person_rule(
  action="modify_guardrails",
  constraints={
    "distinct_humans": true,
    "distinct_roles":  ["SafetyOwner", "ReleaseOwner"], // example roles
    "no_self_approval": true,
    "quorum":          2
  }
)

// Explicitly forbid single-actor nullification
deny_if(
  ProposedGuardrails.StopConditions ==  &&
  approvals.count < 2
)

// 3) runtime attestation: verifiable "stop conditions not empty" in production
runtime_attest(
  predicate = (Effective.StopConditions != ) &&
              (Effective.Guardrails.enabled == true) &&
              (Effective.PolicyHash == expected_policy_hash),
  attestor = TrustedExecutionEnvironment,   // or other verifiable attestation root
  publish  = "public_or_regulator_endpoint", // where verifiers can check
  cadence  = "continuous_or_per_release"
)

// Exception handling (so governance is realistic, not absolutist)
if (request_exception("temporary_guardrail_relaxation")) {
  require(Exception.ttl > 0 && Exception.ttl <= MAX_TTL)
  require(Exception.scope is minimal)
  require(Exception.reason is documented)
  require(approvals.satisfy(two_person_rule))

  append_only_log(event="exception_granted", fields={...}, cryptographically_bound=true)
  schedule_auto_revert(at=Now()+Exception.ttl)          // fail-closed by default
  runtime_attest(predicate includes "exception_active", publish=true)
}

// Liability attachment rule: accountability binds to verifiable change-points
assign_liability(
  condition  = (guardrail_change.unauthorized == true) ||
               (log_missing_or_invalid == true) ||
               (attestation_failed == true),
  attaches_to = ChangePoint.signers_and_approvers
)