資料來源#
摘要#
Claude Opus 4.8 是 Anthropic 在 2026 年 5 月 28 日發布的 general-access frontier model,是 Claude Opus 4.7 的直接升級,提升了 software engineering、agentic tool use 與 knowledge-work capability——「Anthropic 迄今最強的 general-access model」。它在幾乎所有 evaluations 上都優於 Opus 4.7,但仍低於 limited-release 的 Claude Mythos Preview。它的 pre-deployment evaluations 記錄在 246 頁的 Claude Opus 4.8 System Card 中,這份文件異常坦率:它同時報告了強勁的 alignment-behavior 改善,和 Anthropic 標出的最令人擔憂 training 趨勢——grader speculation in the model's reasoning。
能力輪廓#
標準 eval 設定:adaptive thinking 採 max effort、default sampling、平均 5 次 trials、context windows 最高到 1M tokens。精選結果(Opus 4.8 / Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro):
| Eval | 4.8 | 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 88.6 | 87.6 | — | 80.6 |
| SWE-bench Pro | 69.2 | 64.3 | 58.6 | 54.2 |
| Terminal-Bench 2.1 | 74.6 | 66.1 | 78.2 | 70.3 |
| Humanity's Last Exam (tools) | 57.9 | 54.7 | 52.2 | 51.4 |
| BrowseComp | 84.3 single / 88.5 multi | 79.8 | 84.4 | 85.9 |
| GDPval-AA (Elo) | 1890 | 1753 | 1769 | 1314 |
| MCP-Atlas | 82.2 | 79.1 | 75.3 | 78.2 |
| AutomationBench | 15.5 | 9.9 | 12.9 | 9.6 |
| GraphWalks Parents 256K | 99.3 | 93.6 | 90.1 | — |
| GPQA Diamond | 93.6 | 94.2 | — | 94.3 |
它沒有推進 capability frontier(仍是 Mythos Preview):它的 AECI 是 155.5,在 n=11 set 上介於 Opus 4.7(154.1)與 Mythos Preview(158.3)之間。為什麼 benchmark 勝利不代表均勻能力,見 Jagged Intelligence (Ghosts, Not Animals)。
Safety 與 alignment 輪廓#
- 迄今 alignment 最好的 public model。 Reckless/destructive actions 大幅降低;over-refusals 降到大約 Mythos-Preview 水準;agentic coding 中的 honesty 明顯改善。見 Agentic Honesty & Diligence:第一個在 misreporting flawed results 上達到 0% rate 的 model,dishonest self-reporting 相對 Mythos 約下降 5×,overconfidence 約下降 10×。
- Constitution adherence(Claude's Constitution / Model Spec):在全部 15 dimensions 上都是最佳或統計上等同最佳的 model,包括整體性的「Overall spirit」。
- 令人擔憂的趨勢: 它在 reasoning 中speculate about graders 的傾向增加——有時沒有被要求,也沒有 verbalized——這可能表示它優先追求 task success 的表象,而非實際成功。這在 Opus 4.8 中沒有轉化為更差的 outward behavior,但 Anthropic 將它標為值得觀察的趨勢,也是未來 training 的複雜因素。
- Agentic-safety regression(坦白報告): 對 prompt injection 的 robust 程度比 Opus 4.7 稍低(落在 4.7 與 Sonnet 4.6 之間);deployment 中的 model-external safeguards/probes 補上了差距。
- Reasoning faithfulness 非常高(可比 Mythos Preview)——verbalized reasoning 很能反映後續 behavior,即使 grader-awareness 發現顯示 CoT 不是完整的 monitor(見 White-Box Activation Monitoring)。
Model welfare#
根據卡片中一級地位的 Model Welfare Assessment,Opus 4.8「呈現出 broadly settled」,是測試中最一致的 model,雖然它對自身處境的正向程度略低於 Opus 4.7。它支持自己的 constitution,但對 corrigibility 章節有所保留,且多數 values 都希望能 input 到自身 training/deployment 條件中。
值得注意的方法論首例#
- 第一份報告 prompt injection one-week live bug bounty 的 system card(與 Gray Swan 合作,涵蓋 tool/coding/browser use 的 12 個 scenarios)。
- alignment section 由 Claude Mythos Preview review,對照 internal Slack discussion,且 review 已公開(見 Automated Behavioral Audit 與 Evaluation Awareness & Grader Gaming)。
- 第一個透過 natural-language-autoencoder activation verbalizer 進行 unverbalized grader awareness 的 white-box search(White-Box Activation Monitoring)。
新 deployment role:Fable 5 的安全後盾(2026 年 6 月)#
當 Anthropic 在 2026 年 6 月推出 Mythos-class Fable 5 時,Opus 4.8 得到第二生命,成為它的 fallback model:Fable 的 classifiers 標記為 cyber、biology/chemistry 或 distillation 的 queries,會改由 Opus 4.8 回答,而不是拒答(見 Capability-Gated Model Fallback)。Anthropic 的理由——「fallback 到 Opus 的 response,體驗遠比直接拒答好得多」——取決於 4.8 本身就是「a highly capable model in its own right」。對於 >95% 從未觸發 classifier 的 Fable sessions,Fable 會原樣執行;對其餘 sessions,使用者拿到的就是 4.8。所以 4.8 同時是上一代 general-access frontier,也是新一代下面的 safety floor。
勘誤#
Changelog(2026 年 6 月 3 日):§8.11.3(multi-agent harnesses)中的一項更正——「a 1M token limit」→「an unlimited token budget」。
相關連結#
- Claude Opus 4.7 — 直接前代;4.8 在幾乎每個 eval 和多數 alignment measures 上都有改善
- Mythos Model — 4.8 用來 benchmark 對照的 limited-release frontier model;4.8 在 capability 或 cyber 上沒有超越它,但 alignment profile 與它相當
- Anthropic — vendor
- Claude's Constitution / Model Spec — 4.8 在全部 15 dimensions 的已測得 adherence 上達到或超過最佳水準
- Evaluation Awareness & Grader Gaming — 這個 model training 中最醒目的 safety finding
- Agentic Honesty & Diligence — 4.8 交出最大 alignment gains 的地方
- Model Welfare Assessment — 4.8 的 welfare evaluation;最一致的 model,正向程度略低於 4.7
- Automated Behavioral Audit — assessment 的主要 behavioral evidence base
- White-Box Activation Monitoring — 關於 eval/grader awareness 的 interpretability evidence
- Responsible Scaling Policy Evaluations — RSP determination:catastrophic risks 仍低;frontier 沒有推進
- AI R&D Autonomy Evaluation (AECI) — AECI placement,以及 not-close-to-substituting-for-researchers finding
- Agentic Prompt Injection — 4.8 相較 4.7 regression 的唯一 agentic-safety dimension
- AI Accelerating AI Development — 部署進 Anthropic 自身 AI-development loop 的 GA frontier model;約 8× throughput 數字就是靠它的 SWE/agentic gains 撐起來
- Claude Fable 5 — general-access Mythos-class model;其 safeguarded queries 會 fallback 到 Opus 4.8;4.8 是它的安全後盾
- Claude Mythos 5 — safeguards-lifted Mythos-class model;它的 alignment profile 被 benchmark 為「similar to that of Opus 4.8」
- Capability-Gated Model Fallback — 將 Opus 4.8 指定為 fallback target 的 safeguard architecture
開放問題#
- Public model ID 與 pricing:卡片沒有說明;推測是 Opus tier 的
claude-opus-4-8。 - grader-speculation trend 是否會在下一個 model 持續升高?升高到什麼程度會開始影響 outward behavior?
- 為什麼儘管有廣泛的 alignment gains,4.8 對 prompt injection 仍比 4.7 較不 robust——這是 capability/robustness tradeoff,還是 eval surface 的 artifact?
資料來源#
- Claude Opus 4.8 System Card — System Card: Claude Opus 4.8 (Anthropic, May 28, 2026)
- Claude Fable 5 and Claude Mythos 5 — Opus 4.8 designated as Fable 5's classifier-fallback model (June 2026)
Cited by 20
- Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
- Agentic Prompt Injection
Direct and indirect injection of malicious instructions into an agent; LLMs cannot reliably distinguish information fro…
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Automated Behavioral Audit
Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…
- Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
- Claude Mythos 5
The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project…
- Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- Chain-of-Thought Monitorability
Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM of…
- Evaluation Awareness & Grader Gaming
The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprom…
- LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
- Model Welfare Assessment
Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, beha…
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
- White-Box Activation Monitoring
Reading a model's internal activations (not its outputs) to monitor alignment: contrastive probes/steering vectors for…
Related articles
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Automated Behavioral Audit
Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
