Sources#
Summary#
The paradigm — demonstrated at research scale by Google DeepMind's AlphaProof Nexus (arXiv 2605.22763) — of using LLMs to generate proofs in a formal language (Lean) whose compiler mechanically verifies every logical step, then searching for a complete proof in a generate-and-verify loop. This converts the LLM's biggest liability for mathematics — hallucinated/subtly-wrong natural-language proofs that need expensive expert review — into a checkable artifact: a proof is correct iff Lean accepts it with no sorry and no disallowed axioms. The paper reports the first large-scale evaluation on open research problems, autonomously resolving 9/353 attempted Erdős problems and 44/492 OEIS conjectures, among other results.
Why formal, not natural language#
LLM natural-language proofs "contain subtle logical errors or hallucinations," and mistakes in unreviewed intermediate steps cascade, capping the complexity of what you can delegate. Formal languages fix this: in Lean, "definitions, theorems, and proofs are all mechanically verified code." The key reframing in the paper's discussion:
Formal verification can serve as a filter for determining which proofs merit human review.
So AI-driven formal proof search doesn't replace mathematicians — it triages. Experts review only what compiled, and within that, focus on the structure rather than re-verifying every line. This is Karpathy's verifiability thesis in its purest form: math+Lean is the maximally-verifiable domain, the compiler is the reward signal.
The proof-sketch interface#
The unit of work is a proof sketch: a Lean file with the target theorem, its dependencies (definitions, imports), and sorry in place of the proof. User-provided markers bound what the agent may edit — EVOLVE-BLOCK (introduce helper lemmas/definitions/steps) and EVOLVE-VALUE (change parameter expressions). The agent succeeds when it emits a sorry-free proof that SafeVerify accepts (compiles + no axiom injection like sorryAx). Optionally the mathematician supplies natural-language context and domain knowledge encoded in Lean. (See AlphaProof Nexus for the agent architectures that drive this loop.)
Compiler feedback as grounding#
The engine is the tight loop between generation and verification: the subagent edits via a search-replace tool, Lean compiles after each edit, and Lean's error message directs the next turn. The paper attributes the surprising strength of even its basic agent partly to "the power of compiler feedback in grounding LLM reasoning" (Agentic Loops Overtake Bespoke Systems). The verifier isn't just a final gate — it's a per-step teacher that keeps the model's reasoning anchored to ground truth.
Results (open research problems)#
- Erdős problems: 9/353 from the Formal Conjectures repo, including questions open since 1970/1996 and two open ~56 years; logged on Terence Tao's wiki of AI contributions to Erdős problems. Techniques span CRT + 3-AP-avoiding-set constructions (#12), inductive thinning exploiting Diophantine approximation $3^m\approx 4^k$ (#125), etc.
- OEIS: 44/492 open conjectures (with "test lemmas" verifying the first few sequence terms as a misformalization guard).
- Algebraic geometry: a ~15-year-open question on log-concavity of pure $O$-sequences (codim 3, type 2).
- Convex optimization: an exact $\mathcal{O}(1/t)$ rate for Anchored GDA — discovering a novel parameter schedule by marking the learning schedule as an
EVOLVE-VALUE(proof and schedule searched jointly). - Additive combinatorics: helped resolve #57 from Ben Green's list (formalized a candidate counterexample, agent proved it disproves the conjecture).
- Quantum optics (with Mario Krenn): monochromatic quantum-graph / high-dim GHZ-state existence for $N=d\in{4,6,10}$.
- Graph theory: a bipartite variant of the reconstruction conjecture; a 1996 conjecture from the Graffiti auto-conjecturing system (pointing toward an AI-conjecture→AI-proof loop).
Misformalization detection — an unexpected payoff#
Because the agent reasons against the formal statement, it surfaces errors in how problems were formalized. Examples: it found proofs by reading "density" as natural density, prompting corrections to "lower density" (#125) and "upper density" (#741(i)); it identified misformalizations in the literature. Failure modes also justify the formality: top sketches sometimes offloaded the core difficulty into a single sorry in a helper lemma restating the target, or cited "established" lemmas that were hallucinations — both caught precisely because end-to-end formal verification refuses to accept them.
Deepening human understanding#
The paper's stance: "the future of mathematics lies in human–machine partnership." Collaborators found that proof attempts enhanced their understanding even when the agent failed — formal sketches let experts focus on the unresolved subgoals rather than re-verifying the whole argument. This is Outsource Your Thinking, Not Your Understanding realized: the AI does the search; the mathematician's understanding is sharpened, not bypassed.
Connections#
- AlphaProof Nexus — the framework and agent architectures that implement the paradigm
- Lean — the proof assistant whose compiler provides the verification/grounding
- The Verifiability Thesis — math+Lean is the maximally-verifiable domain; the compiler is the reward signal
- Agentic Loops Overtake Bespoke Systems — the headline finding: a simple loop matched the bespoke system as LLMs improved
- Evolutionary Proof Search — the full-featured agent's population/Elo search mechanism
- Agent Loop Pattern — the basic prover subagent is literally a "Ralph loop" (huntley2025ralph)
- Outsource Your Thinking, Not Your Understanding — formal sketches deepen mathematician understanding even on unsolved problems
- Client-Side Agent Optimization — solve-rate-vs-cost Pareto curves across agents (A/B/C/D) are the same cost/quality framing AgentOpt formalizes
- Scale-Dependent Prompt Sensitivity — smaller Gemini models solved nothing; capability is sharply scale-gated here (a hard threshold, not a smooth curve)
- Jagged Intelligence (Ghosts, Not Animals) — hallucinated "literature" lemmas are jaggedness; formal verification is the filter that catches it
Open Questions#
- Successes cluster where Lean's mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing new theory?
- The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
- The Graffiti result hints at closing the loop between AI conjecturing and AI proving. What does an end-to-end conjecture→formalize→prove pipeline look like?
Sources#
Cited by 14
- Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- Outsource Your Thinking, Not Your Understanding
"You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…
- Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Related articles
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
