資料來源#
摘要#
DeepMind 的 AI-Driven Formal Proof Search 論文最受矚目的實證發現,也是其對 The Bitter Lesson 最清晰的跨領域確認:一個 basic agent——獨立的 prover subagents 各自運行一個簡單的「Ralph loop」 進行 generate-edit-compile 週期——解決了全部 9 個 Erdős problems,這與功能繁複完備的 agent(evolutionary search + bespoke AlphaProof RL prover)所解決的數量相同,只是在最難的問題上花費了更高成本。論文的結論指出:「隨著 LLMs 變得更加強大,目前正在經歷一場從特化訓練系統向簡單 agentic loops 的持續轉變。」
The surprise, in the authors' words#
團隊基於其強大的競賽基準表現,選擇了功能完備的 agent (D) 進行大規模探索——在規劃階段,「較簡單的 agentic loops 並未表現出強大性能」。隨後,針對 9 個已解決的 Erdős problems 進行的 post-hoc analysis 發現:
「值得注意的是,basic agent 解決了所有 9 個問題,儘管在較難的問題上花費了更高的成本。」
他們將此歸因於兩點:(1) LLM 領域在規劃與分析之間「發生了實質性變化」(模型能力躍升),以及 (2) 「編譯器回饋在鞏固 LLM 推理方面的力量」——簡單 loop 之所以有效,是因為 Lean 的驗證器讓每一步都切實無誤。
Why this is the bitter lesson in a new field#
The Bitter Lesson:隨著時間推移,規模化的通用方法會擊敗人工設計的結構。這裡的「人工設計的結構」是指 bespoke apparatus——AlphaProof RL theorem-prover(一個 特化訓練系統)以及演化種群/Elo 機制(Evolutionary Proof Search)。而「規模化的通用方法」則是處於 plain loop 中並配有驗證器的前沿 LLM。隨著 LLM 的提升,在大多數問題上,bespoke scaffolding 的優勢 坍縮為成本差異,而非能力差異。這正是 Harness Shrinkage as Models Improve 所描述的動態——為彌補模型弱點而設計的 scaffolding 在模型變強時會變成累贅——只是這是在形式數學中觀察到的,而非在 coding harnesses 中。
The residual advantage (and its expiry date)#
bespoke agent 並非毫無用處——它「目前在最難的問題上仍保有優勢」,在兩個最棘手的 Erdős problems(#125, #138)上可節省 2×–5× 的成本。但作者明確指出該優勢的時效:「隨著 LLM 能力的增長,這種優勢可能會減少。」 這種模式是一個 退縮 的前沿,在此處 bespoke systems 才發揮作用:問題越難且模型越弱,特化結構的回報就越高——而這個區域在每次模型發布時都在縮小。(獨立的 AlphaProof tree-search 以及較小模型的 basic agents 什麼也沒解決——因此 loop 仍然需要足夠強大的模型 + 驗證器;參見 Scale-Dependent Prompt Sensitivity。)
Generalization#
可推廣的結論:當一個領域擁有廉價且可靠的驗證器時,應偏好能夠利用它的最簡單 agentic loop,並在每次模型發布時重新評估 bespoke scaffolding——它往往會從「賦予能力(capability-enabling)」轉變為「僅節省成本」,最後變成累贅(drag)。驗證器(The Verifiability Thesis)是讓簡單 loop 可行的關鍵;而 loop 是讓系統建置和維護成本低廉的原因。bespoke systems 只有在退縮的艱難前沿才能發揮其價值。
相關連結#
- AI-Driven Formal Proof Search — 背景環境;這是其核心架構發現
- The Bitter Lesson — 這在形式數學中得到實證確認的原則
- Harness Shrinkage as Models Improve — 相同的「scaffolding 在模型提升時成為累贅」之動態,此處適用於 proof-search harness
- Agent Loop Pattern — 「Ralph loop」 basic agent 是 loop-as-primitive 的一個實例
- Evolutionary Proof Search — 簡單 loop 所匹敵的 bespoke scaffolding(種群 + Elo)
- AlphaProof Nexus — 涵蓋 basic (A) → full-featured (D) agents 的框架
- The Verifiability Thesis — 驗證器是讓簡單 loop 完全行得通的根本原因
- Client-Side Agent Optimization — 「以更低成本匹敵能力」是 AgentOpt 所形式化的成本/品質最佳化;在此處,廉價的配置勝出
- Scale-Dependent Prompt Sensitivity — loop 需要足夠強大的模型:較小的 Gemini 變體什麼也沒解決
- Recursive Self-Improvement — 這是 RSI 最清晰的現有領域代理:隨著模型改進,一個簡單 loop 匹敵了 bespoke trained system,這種動態若在 AI 研發本身上運行,便實現了閉環
- AI Accelerating AI Development — 相同的 simple-loop-overtakes-bespoke 模式,這是在 Anthropic 的內部 AI-R&D 吞吐量中觀察到的,而非在形式數學中
待解決的問題#
- bespoke 優勢被定為「目前而言」。下一代模型的結論會是什麼——演化/AlphaProof 機制是否能在 任何 問題上倖存,還是完全坍縮為一條成本線?
- 「簡單 loop + 驗證器擊敗 bespoke system」的結果,是僅在驗證器完美(Lean)的領域成立,還是在存在雜訊驗證器的領域(測試、LLM 裁判委員會)也成立?
資料來源#
Cited by 15
- Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
- AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- Formal Mathematics & Proof Search
Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
Related articles
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
