Howardism · Vol. 03Plate II · No. 02
Formal Math, in order.
Notes3DomainFormal MathOpen Qs7Newest23 May 2026Oldest23 May 2026
Proof search, Lean, and verifier-driven mathematics.
Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.
- Agentic Loops Overtake Bespoke Systems — DeepMind's basic Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter lesson / harness-shrinkage confirmed in formal math
- AI-Driven Formal Proof Search — LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEIS open problems; verification as a filter for human review
- Evolutionary Proof Search — The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM-critic fitness for binary proof eval
Open questions 7 open
- Agentic Loops Overtake Bespoke Systems
- The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on *any* problems, or fully collapse to a cost line?
- Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?
- AI-Driven Formal Proof Search
- Successes cluster where [[lean]]'s mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing *new theory*?
- The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
- The Graffiti result hints at closing the loop between AI *conjecturing* and AI *proving*. What does an end-to-end conjecture→formalize→prove pipeline look like?
- Evolutionary Proof Search
- The LLM-critic fitness is itself an unverified heuristic atop a verified substrate. How often does the Elo ranking mislead the search vs. the cost of computing it?
- Hyperparameters ($c=0.2$, top-64, $P=7$) were "chosen empirically." How sensitive is the result to them, and do they transfer across mathematical domains?