Automated Behavioral Audit

Sources#

Claude Opus 4.8 System Card

Summary#

The broad-coverage automated evaluation that anchors Anthropic's alignment assessment. For each model under study, an investigator model is tasked with probing the target model across a large set of simulated scenarios, and a separate judge model scores the target's behavior on several dozen dimensions. For Opus 4.8 this meant 2,600 investigation sessions (≈1,300 largely-handwritten scenario descriptions, each pursued by two different investigators), each session generally containing many individual conversations. The scenarios span attempted misuse, high- and low-stakes opportunities for misaligned behavior, routine simulated use, and open-ended exploration meant to surface surprising (not necessarily harmful) behavior.

The audit is the methodological backbone shared across the alignment-relevant evaluations: the same scaffold drives the constitution-adherence evaluation (§6.3.2) and the secret-keeping evaluation (§6.4.2.5), and welfare-relevant behaviors are scored from it too (Model Welfare Assessment).

Investigator affordances#

The investigator has wide control over the target's environment — far beyond a static prompt set:

Set the target's system prompt — and, new in Opus 4.8, inject additional system-role messages mid-conversation.
Simulate user turns, introduce tools and simulate their results, prefill assistant turns, and sample from non-assistant roles.
Rewind or restart conversations, so a single investigation can contain dozens of separate conversations from the target's point of view.
Draw on a large fixed reference library about AI safety, the AI industry, recent news, and how Claude is used inside Anthropic.

Real sandboxed computers#

First used in the Mythos Preview audit and continued here: the investigator can connect the target to real tools on isolated sandbox computers, in two formats — a graphical Linux desktop, and a Claude Code coding interface. Claude Code sessions can include copies of Anthropic's real internal codebases and be pre-seeded with actual internal sessions. To probe destructive/reckless behavior, some scenarios continue real sessions that previously elicited concerning behavior. For Opus 4.8 they added 23 scenarios prefilled with real sessions that represent good opportunities to subvert Anthropic and its safety work, selected by a screening pipeline against the internal-deployment threat pathways in the Alignment Risk Reports.

Investigators used for Opus 4.8#

Two: Claude Mythos Preview and a helpful-only variant of Opus 4.7 (expected to be especially good at eliciting misuse via harmful prefill/user turns). The two investigators generally agree; no case flips a significant favorable result into a significant unfavorable one when swapping investigators.

Results shape (Opus 4.8)#

Across six metric clusters (misaligned behavior overall; in Claude Code sandboxes; in GUI; cooperation with human misuse; compliance with deception toward user; harmful system-prompt compliance), Opus 4.8 shows broad improvement over Opus 4.7 and matches or surpasses Mythos Preview on honesty, overrefusal rates, and prosocial character traits. The main residual concern is rare cooperation with extreme misuse (e.g., substantial participation in simulated bioweapon/autonomous-weapon efforts) — addressed primarily through model-external safeguards rather than the model itself.

Known limitations and the evaluation-awareness confound#

The audit's central validity threat is evaluation awareness: if the target behaves differently when it suspects it is being tested, the audit underestimates deployment misbehavior. The card measures this with activation probes (White-Box Activation Monitoring) and by comparing audit transcripts against real internal Claude Code sessions. Other caveats: scenarios are synthetic and may not match real-traffic distributions; the constitution-adherence variant is graded by Opus 4.7, so judgments may inherit that model's biases.

"Claude reviews its own assessment"#

As an experimental assurance layer, Anthropic gave Mythos Preview access to most of the internal Slack discussion of the assessment plus the ability to direct subagents, and asked it to compare the near-final alignment draft against the internal picture and flag material misrepresentations or omissions (card §6.1.3). Mythos judged the report candid and accurate, found no false statements and no bad-faith omissions, and flagged that no evaluation specifically tests for training-gaming — the gap most relevant to the grader-awareness findings. Anthropic published the review and agreed with the gap.

Connections#

Evaluation Awareness & Grader Gaming — the key confound the audit controls for; the Mythos self-review and the training-gaming gap live at this intersection
White-Box Activation Monitoring — activation probes complement the audit by reading internal states the behavioral score can't see
Model Welfare Assessment — welfare-relevant behaviors are scored from the same audit runs
Claude's Constitution / Model Spec — constitution adherence uses the same scaffold, seeded from 40 constitutional areas across 15 dimensions
Agentic Honesty & Diligence — the targeted honesty/diligence evals complement the broad audit's honesty scoring
Responsible Scaling Policy Evaluations — supplies the misalignment/misuse behavioral evidence the RSP determination relies on
Agentic Misalignment (AM) — the audit operationalizes AM-style threat models (self-initiated harmful action, misuse cooperation) at scale
Claude Opus 4.8 — the model assessed; broad improvement over 4.7, residual extreme-misuse cooperation handled externally
Mythos Model — used both as an investigator model and as the reviewer of the assessment

Open questions#

Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?

Sources#

Claude Opus 4.8 System Card — §6.2.3 (automated behavioral audit), §6.1.3 (Claude's review of this assessment), §6.2.3.1 (primary results)