Agentic Loops Overtake Bespoke Systems

Sources#

Advancing Mathematics Research with AI-Driven Formal Proof Search

Summary#

The headline empirical finding of DeepMind's AI-Driven Formal Proof Search paper, and its clearest cross-domain confirmation of The Bitter Lesson: a basic agent — independent prover subagents each running a simple "Ralph loop" of generate-edit-compile episodes — solved all 9 of the Erdős problems that the elaborate full-featured agent (evolutionary search + the bespoke AlphaProof RL prover) solved, only at higher cost on the hardest ones. The paper's conclusion: "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."

The surprise, in the authors' words#

The team chose the full-featured agent (D) for the large-scale exploration based on its strong competition-benchmark performance — at planning time, "simpler agentic loops did not show strong performance." Then a post-hoc analysis on the 9 solved Erdős problems found:

"Remarkably, the basic agent solved all 9 problems, though at a higher cost on the harder problems."

They attribute this to two things: (1) the LLM landscape "shifted substantially" between planning and analysis (model capability jumped), and (2) "the power of compiler feedback in grounding LLM reasoning" — the simple loop works because Lean's verifier keeps each step honest.

Why this is the bitter lesson in a new field#

The Bitter Lesson: scaled general methods beat hand-engineered structure over time. Here the "hand-engineered structure" is the bespoke apparatus — the AlphaProof RL theorem-prover (a specialized trained system) and the evolutionary population/Elo machinery (Evolutionary Proof Search). The "scaled general method" is a frontier LLM in a plain loop with a verifier. As the LLM improved, the bespoke scaffolding's advantage collapsed to a cost difference, not a capability difference on most problems. This is the exact dynamic Harness Shrinkage as Models Improve describes — scaffolding that compensates for model weakness becomes drag as the model strengthens — observed in formal mathematics rather than in coding harnesses.

The residual advantage (and its expiry date)#

The bespoke agent isn't useless — it "retains an advantage on the hardest problems for now," with 2×–5× cost savings on the two toughest Erdős problems (#125, #138). But the authors explicitly date the advantage: "as LLM capabilities grow, this advantage may diminish." The pattern is a receding frontier where bespoke systems matter: the harder the problem and the weaker the model, the more the specialized structure pays off — and that region shrinks every model release. (Standalone AlphaProof tree-search and smaller-model basic agents solved nothing — so the loop still needs a strong-enough model + a verifier; see Scale-Dependent Prompt Sensitivity.)

Generalization#

The transferable claim: when a domain has a cheap, reliable verifier, prefer the simplest agentic loop that can exploit it, and re-evaluate bespoke scaffolding every model release — it tends to convert from capability-enabling to merely cost-saving, then to drag. The verifier (The Verifiability Thesis) is what makes the simple loop viable; the loop is what makes the system cheap to build and maintain. Bespoke systems earn their keep only at the receding hard frontier.

Connections#

AI-Driven Formal Proof Search — the setting; this is its central architectural finding
The Bitter Lesson — the principle this empirically confirms in formal mathematics
Harness Shrinkage as Models Improve — the same "scaffolding becomes drag as models improve" dynamic, here for a proof-search harness
Agent Loop Pattern — the "Ralph loop" basic agent is an instance of the loop-as-primitive
Evolutionary Proof Search — the bespoke scaffolding (population + Elo) the simple loop matched
AlphaProof Nexus — the framework spanning basic (A) → full-featured (D) agents
The Verifiability Thesis — the verifier is what lets the simple loop work at all
Client-Side Agent Optimization — "match capability at lower cost" is the cost/quality optimization AgentOpt formalizes; here the cheap config wins
Scale-Dependent Prompt Sensitivity — the loop needs a strong-enough model: smaller Gemini variants solved nothing
Recursive Self-Improvement — this is RSI's clearest existing-domain proxy: a simple loop matched a bespoke trained system as the model improved, the dynamic that, run on AI development itself, closes the loop
AI Accelerating AI Development — the same simple-loop-overtakes-bespoke pattern, observed in Anthropic's internal AI-R&D throughput rather than in formal math

Open Questions#

The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on any problems, or fully collapse to a cost line?
Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?

Sources#

Advancing Mathematics Research with AI-Driven Formal Proof Search