Plate IIGovernance & Workforce機器翻譯 · machine-translatedENHOWARDISM

AI Accelerating AI Development

PublishedJune 7, 2026FiledConceptDomainGovernance & WorkforceTagsGovernanceAI RdProductivityCapability EvaluationAnthropicReading8 minSourceAI-synthesised

《When AI builds itself》的實證核心：有量化證據顯示 AI 已經在加速 Anthropic 內部的 AI R&D——超過 80% 的合併程式碼由 Claude 撰寫、每位工程師每日的程式碼量較 2024 年提升約 8 倍、核心優化（kernel-optimization）的 eval 在一年內從 3 倍提升至 52 倍、自動化研究員恢復了 weak-to-strong 差距的 97%，且模型在下一步決策判斷上擊敗人類 64%

資料來源#

When AI builds itself

摘要#

Anthropic Institute 的 When AI builds itself 實證部分——先前未公開的內部數據顯示，AI 已經在加速 Anthropic 內部的 AI 研發。 當公開基準測試（Task Time-Horizon Scaling）顯示能力正在提升時，本頁面收集了部署端的證據，證明這種提升的能力已經反饋到 Anthropic 自身的工程與研究吞吐量中。這是 Recursive Self-Improvement 外推所立足的現在進行式基礎，也是 AI R&D autonomy eval 所防範之加速現象的具體實例。

工程與研究的分流（與自主權階梯）#

建構前沿模型需要兩種工作，而 Claude 在這兩方面的進展各有不同：

Engineering（撰寫程式碼、架設基礎設施、監督訓練）：Claude「可以被交付一個未完全明確定義的問題，並自行找出解決方法；人類提供目標，但不再需要提供方法。」
Research（選擇實驗、解讀結果、決定下一步嘗試什麼）：Claude「在執行明確定義的實驗方面，已經可以媲美或超越有經驗的人類。」

在這兩者中，持續存在的差距是選擇目標時的判斷力——參見 Research Taste as the Human Bottleneck。該文章將能力對應到 Anthropic 對其員工使用的資歷階梯：

執行指定任務——「匯出按鈕無法運作，請修復它。」
為給定目標制定方法——「調查為什麼網路在重載下會變慢。」
選擇哪些問題值得研究——「團隊下個季度應該建構什麼？」

Claude 已經從第 1 階爬升至第 2 階；第 3 階則是目前的前沿。

工程證據#

Claude 撰寫了 Anthropic 的大部分程式碼。 截至 2026 年 5 月，>80% 的合併程式碼是由 Claude 撰寫，高於 Claude Code 2025 年 2 月研究預覽版發布前的低個位數百分比。（領導層公開估計，包含腳本 and 實驗性程式碼在內比例達 90% 以上；>80% 的數據是更為保守的合併至生產環境程式碼行數歸屬。）

每位工程師約 8 倍的產出。 每位工程師每日合併的程式碼行數在 2021 至 2024 年間持平，隨後出現了兩個拐點而攀升：2025 年（Claude 開始執行程式碼，而不僅僅是建議）和 2026 年（模型開始在更長的時間範圍內自主工作）。在 2026 年第二季，典型的工程師合併的程式碼量約為 2024 年每日合併量的 8 倍。 限制說明得很清楚：程式碼行數衡量的是數量而非品質，因此 8 倍「幾乎肯定高估了真正的生產力提升」——但它表明了真正的加速，且 Anthropic 並不獎勵 LOC。

2026 年 3 月的一項調查（對象為 130 名研究團隊員工）顯示，使用 Mythos Preview 的自我估計產出中位數為約 4 倍（相較於未使用 AI；Anthropic 認為實際的提升幅度略低——眾所周知，開發者的自我估計往往會高估）。
若非如此便不會進行的工作：在 2026 年 4 月，Claude 交付了 800 多個修復程式，將某一類 API 錯誤降低了 1000 倍——估計相當於四個人年的艱苦跨上下文 Bug 修復工作。

程式碼品質達到同等水準。「優質程式碼」 = 可以運作且其他工程師可以基於其上繼續開發。在運作方面：員工在任務中途進行糾正、重定向或接管的比例在一年內持續下降，即使在開放式問題上也是如此（最難級別的 session 成功率在 2026 年 5 月達到 76%，在六個月內提升了 50 個百分點）。在可讀性方面：Claude 撰寫的程式碼在「2025 年底略遜於人類撰寫的程式碼……今天已大致相當，我們預期在一年內將會明顯優於人類。」

自動審查員。 現在每次變更在合併前都會由自動化的 Claude 審查員進行閱讀。一項回顧性分析發現，它本可以在部署到生產環境前捕獲 過去 claude.ai 事故背後約 1/3 的 Bug——「Claude 現在正在捕獲 [世界上最優秀的工程師] 所遺漏的錯誤。」這是 verification 以及在組織規模上運作的全新上下文中的審查員（reviewer-in-fresh-context）模式。

研究證據#

三項評估指標，沿著從執行到判斷的階梯攀升：

Kernel/實驗優化（第 1–2 階，現已達到超越人類的水準）。 每次發布時進行固定的 eval 測試：給定訓練小模型的程式碼，在通過相同正確性檢查的前提下，使其運行速度儘可能加快。Opus 4 約 3 倍加速（2025 年 5 月） → Mythos Preview 約 52 倍（2026 年 4 月）。 技術熟練的人類需要 4 到 8 小時才能達到約 4 倍。「在不到一年的時間裡，Claude 在特定實驗中的優化能力已經從極有幫助躍升到超越人類的水準。」（限制說明：絕對倍數取決於初始程式碼的優化空間，並非真實世界的訓練加速；有參考價值的部分在於模型之間以及與人類之間的同等條件對比。）
開放式研究，end to end（第 2 階）。 2026 年 4 月的 automated weak-to-strong researcher：給予 agents 一個開放的 AI 安全問題（弱模型能否可靠地監督更強的模型？），它們提出假設、進行測試、在平行 agents 之間共享發現並進行迭代。它們在約 800 個累計 agent 小時（約 ~$18k 算力）內，恢復了約 97% 的基準與上限差距；兩名人類研究人員在一個星期內恢復了約 23%。限制說明：該結果無法乾淨地轉移到生產規模的模型上，且人類仍然選擇了問題並撰寫了評分標準——設定方向是人類唯一有意義的角色。
研究的下一步判斷（第 2–3 階，前沿訊號）。 在 n=129 次真實的研究 sessions 中，當人類採取了迂迴路線（detour）時，模型僅能看到迂迴前的研究工作並提出下一步；另一個 Claude（看到了 session 如何結束）則評估誰的選擇更好。Opus 4.5（2025 年 11 月）以 51% 擊敗人類 → Mythos Preview（2026 年 4 月）達到 64%。 關於審查員偏見（judge bias）的重要限制：在另一組人類的決策已經很強的 127 個瞬間中，模型僅在 約 20% 的時間裡被評估為更好——因此這並非同等條件下人類與模型的對比，而是面對艱難、模糊決策時的趨勢。

坦誠的限制與警語#

該文章非常謹慎地限制了其證據的適用範圍：LOC 高估了生產力；自我報告的提升幅度偏高；kernel 倍數取決於優化空間；W2S 的結果無法轉移至規模化模型且使用了人類選擇的問題；下一步測試是在故意選擇的人類疲軟瞬間運行的。關鍵的核心論點在克服這些限制後依然成立：人類的角色在每個步驟中都在收窄，而「執行」現在幾乎不消耗人類的時間（儘管仍然消耗算力）。

開放性問題#

LOC、自我報告和取決於優化空間的倍數都存在高估；Anthropic 所承諾的轉向「直接測量 AI R&D 加速和研究員提升」（AI R&D Autonomy Evaluation (AECI)）實際上會使用什麼無偏見的吞吐量指標？
W2S 的結果無法轉移到生產規模的模型上。這是暫時的擴展假象還是自主研究的結構性限制？
下一步判斷的趨勢（51%→64%）僅在人類表現較弱的決策切片上進行測量。在具有代表性的研究決策樣本上，這條曲線會是什麼樣子？

資料來源#

When AI builds itself —— §"Evidence from within Anthropic"（工程 + 研究證據、生產力調查、kernel eval、W2S 研究員、下一步判斷）

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 18

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Governance & Workforce
Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

Cited by 18

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Governance & Workforce
Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

AI Accelerating AI Development

資料來源#

摘要#

工程與研究的分流（與自主權階梯）#

工程證據#

研究證據#

坦誠的限制與警語#

相關連結#

開放性問題#

資料來源#