Howardism | Formal Math notes

Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.

Agentic Loops Overtake Bespoke Systems — DeepMind's basic Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter lesson / harness-shrinkage confirmed in formal math
AI-Driven Formal Proof Search — LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEIS open problems; verification as a filter for human review
Evolutionary Proof Search — The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM-critic fitness for binary proof eval

Open questions 7 open

The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on *any* problems, or fully collapse to a cost line?
Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?

Successes cluster where [[lean]]'s mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing *new theory*?
The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
The Graffiti result hints at closing the loop between AI *conjecturing* and AI *proving*. What does an end-to-end conjecture→formalize→prove pipeline look like?

The LLM-critic fitness is itself an unverified heuristic atop a verified substrate. How often does the Elo ranking mislead the search vs. the cost of computing it?
Hyperparameters ($c=0.2$, top-64, $P=7$) were "chosen empirically." How sensitive is the result to them, and do they transfer across mathematical domains?

Formal Math, in order.