Sources#
Summary#
The headline empirical finding of DeepMind's AI-Driven Formal Proof Search paper, and its clearest cross-domain confirmation of The Bitter Lesson: a basic agent — independent prover subagents each running a simple "Ralph loop" of generate-edit-compile episodes — solved all 9 of the Erdős problems that the elaborate full-featured agent (evolutionary search + the bespoke AlphaProof RL prover) solved, only at higher cost on the hardest ones. The paper's conclusion: "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."
The surprise, in the authors' words#
The team chose the full-featured agent (D) for the large-scale exploration based on its strong competition-benchmark performance — at planning time, "simpler agentic loops did not show strong performance." Then a post-hoc analysis on the 9 solved Erdős problems found:
"Remarkably, the basic agent solved all 9 problems, though at a higher cost on the harder problems."
They attribute this to two things: (1) the LLM landscape "shifted substantially" between planning and analysis (model capability jumped), and (2) "the power of compiler feedback in grounding LLM reasoning" — the simple loop works because Lean's verifier keeps each step honest.
Why this is the bitter lesson in a new field#
The Bitter Lesson: scaled general methods beat hand-engineered structure over time. Here the "hand-engineered structure" is the bespoke apparatus — the AlphaProof RL theorem-prover (a specialized trained system) and the evolutionary population/Elo machinery (Evolutionary Proof Search). The "scaled general method" is a frontier LLM in a plain loop with a verifier. As the LLM improved, the bespoke scaffolding's advantage collapsed to a cost difference, not a capability difference on most problems. This is the exact dynamic Harness Shrinkage as Models Improve describes — scaffolding that compensates for model weakness becomes drag as the model strengthens — observed in formal mathematics rather than in coding harnesses.
The residual advantage (and its expiry date)#
The bespoke agent isn't useless — it "retains an advantage on the hardest problems for now," with 2×–5× cost savings on the two toughest Erdős problems (#125, #138). But the authors explicitly date the advantage: "as LLM capabilities grow, this advantage may diminish." The pattern is a receding frontier where bespoke systems matter: the harder the problem and the weaker the model, the more the specialized structure pays off — and that region shrinks every model release. (Standalone AlphaProof tree-search and smaller-model basic agents solved nothing — so the loop still needs a strong-enough model + a verifier; see Scale-Dependent Prompt Sensitivity.)
Generalization#
The transferable claim: when a domain has a cheap, reliable verifier, prefer the simplest agentic loop that can exploit it, and re-evaluate bespoke scaffolding every model release — it tends to convert from capability-enabling to merely cost-saving, then to drag. The verifier (The Verifiability Thesis) is what makes the simple loop viable; the loop is what makes the system cheap to build and maintain. Bespoke systems earn their keep only at the receding hard frontier.
Connections#
- AI-Driven Formal Proof Search — the setting; this is its central architectural finding
- The Bitter Lesson — the principle this empirically confirms in formal mathematics
- Harness Shrinkage as Models Improve — the same "scaffolding becomes drag as models improve" dynamic, here for a proof-search harness
- Agent Loop Pattern — the "Ralph loop" basic agent is an instance of the loop-as-primitive
- Evolutionary Proof Search — the bespoke scaffolding (population + Elo) the simple loop matched
- AlphaProof Nexus — the framework spanning basic (A) → full-featured (D) agents
- The Verifiability Thesis — the verifier is what lets the simple loop work at all
- Client-Side Agent Optimization — "match capability at lower cost" is the cost/quality optimization AgentOpt formalizes; here the cheap config wins
- Scale-Dependent Prompt Sensitivity — the loop needs a strong-enough model: smaller Gemini variants solved nothing
Open Questions#
- The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on any problems, or fully collapse to a cost line?
- Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?
Sources#
Cited by 9
- Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Related articles
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
