Sources#
Summary#
Google DeepMind's framework for LLM-aided formal proof generation in Lean (arXiv 2605.22763). Agents query a frontier LLM (Gemini 3.1 Pro) and the Lean compiler to turn a proof sketch (a theorem with sorry for its proof) into a verified, sorry-free proof. It spans a spectrum of four agent designs from a bare loop to an evolutionary system, and is the instrument behind the first large-scale evaluation of formal proof search on open research problems — 9/353 Erdős problems, 44/492 OEIS conjectures, and results in optimization, algebraic geometry, additive combinatorics, graph theory, and quantum optics. Lean proofs are open-sourced at github.com/google-deepmind/alphaproof-nexus-results.
The four agents (A → D)#
| Agent | Design | Notes |
|---|---|---|
| (A) Basic | Independent prover subagents, no shared state; each is a "Ralph loop" of episodes | Surprisingly matched (D) on all 9 Erdős solves — see Agentic Loops Overtake Bespoke Systems |
| (B) Basic + AlphaProof | (A) plus the AlphaProof RL prover as a callable tool | More efficient than (A) on the harder problems |
| (C) Basic + evolution | (A) plus the population/Elo evolutionary search | — |
| (D) Full-featured | Evolution and AlphaProof together | Used for the open-problem exploration; Evolutionary Proof Search details its machinery |
A prover subagent runs a multi-turn Gemini 3.1 Pro session with a search_replace tool; after each edit the Lean compiler returns feedback that steers the next turn; on episode end the sketch is validated by SafeVerify (compiles, no sorryAx/axiom injection), and if sorry remains the agent writes a lessons-learned comment and continues. $N$ subagents run in parallel; all stop when one finds a proof.
Input/output: proof sketches#
Input is a Lean file — target theorem with sorry, plus needed definitions and imports — optionally with natural-language context and domain knowledge encoded in Lean. Editable regions are delimited by EVOLVE-BLOCK (helper lemmas/definitions/proof steps) and EVOLVE-VALUE (parameter expressions). The EVOLVE-VALUE mechanism enabled a genuine discovery: marking an optimization algorithm's learning schedule as a value let agent D search the schedule and the proof jointly, yielding a novel parameter choice with a stronger convergence guarantee.
AlphaProof (the tool)#
AlphaProof is a separate, prior DeepMind system — an olympiad-level Lean theorem-prover trained with reinforcement learning (the system behind DeepMind's IMO results). Within Nexus it's used as a focused proof tool: given a subgoal it returns a proof, a disproof, or failure. It has a Test-Time RL mode but here runs in cheaper low-compute tree-search (~400 simulations, ~27.5 TPU-hours ≈ $60/problem). Notably, AlphaProof in standalone tree-search mode solved none of the evaluated problems — its value is as a subgoal-closer inside the LLM-driven loop, not as a soloist.
Models and cost#
Gemini 3.1 Pro for the multi-turn prover; the cheaper Gemini 3.0 Flash for rater agents. Per-problem inference cost is "a few hundred dollars" with high variance; basic-agent versions on smaller models (Gemini 3.0 Flash, 3.1 Flash-Lite) solved nothing — capability is sharply scale-gated (Scale-Dependent Prompt Sensitivity). Reported costs exclude AlphaProof's ~$60 and the considerable cost of finding tractable problems across all 353.
Significance#
The system grounds an LLM's mathematical reasoning in a compiler, converting hallucination-prone natural-language proofs into checkable artifacts (AI-Driven Formal Proof Search) and demonstrating that, as LLMs improve, simple agentic loops increasingly rival bespoke trained systems (Agentic Loops Overtake Bespoke Systems). Solved Erdős problems were logged on Terence Tao's wiki of AI contributions.
Connections#
- AI-Driven Formal Proof Search — the paradigm Nexus implements
- Lean — the proof assistant / verifier it drives
- Google DeepMind — the lab behind it
- Evolutionary Proof Search — the full-featured agent (D)'s population/Elo mechanism
- Agentic Loops Overtake Bespoke Systems — its own basic agent (A) matched the full system on most problems
- Agent Loop Pattern — the basic prover subagent is a "Ralph loop" (huntley2025ralph)
- Client-Side Agent Optimization — the A/B/C/D cost-vs-solve-rate Pareto study is AgentOpt-style combo optimization
- The Verifiability Thesis — the design embodies "automate what you can verify"
Open Questions#
- The framework's reach is gated by Lean's mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
- AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?
Sources#
Cited by 7
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
Related articles
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
