Autonomous Scientific Discovery

Sources#

Claude Fable 5 and Claude Mythos 5

Summary#

With Mythos 5 (the bio-safeguards-lifted form of Fable 5), Anthropic reports the first Claude results in which a model conducts novel scientific research largely on its own — choosing experimental moves, running domain tools, recovering from failures, and producing findings that match or beat skilled humans and recent published baselines. This is the wet-lab / life-sciences analogue of AI-Driven Formal Proof Search: where formal proof search has a Lean compiler as an instant verifier, science's verifier is the experiment — slower and more expensive — so the claims here are empirical demonstrations and selected examples, not compiler-checked guarantees. The results are the sharpest evidence yet for the less-conservative reading of recursive self-improvement: that "perspiration is becoming automated" reaches into discovery itself, and that research taste may be "just another capability AI fails at for a time, then gets good at."

The three results#

Drug / protein design — autonomy at human level#

Anthropic's internal protein-design experts accelerated aspects of drug design "by around 10 times" using Mythos 5. In one study, Mythos 5 — equipped with protein-design and bioinformatics tools but no human assistance — matched or beat skilled human operators, executing "all of the tasks normally completed by a scientist: choosing binding sites, selecting and running protein design tools, and recovering from failures along the way." 9 of 14 protein targets yielded strong drug-design candidates now under investigation (immune checkpoints, growth-factor/receptor signaling, neurodegeneration, muscle disease, harder structural targets).

Novel hypotheses — preferred over Opus-class, one corroborated#

Mythos 5 is Anthropic's "first model to consistently produce novel, compelling scientific hypotheses." In blinded head-to-head comparisons against Opus-class models, Anthropic scientists preferred Mythos's molecular-biology hypotheses ~80% of the time, and advanced several to experimental evaluation. One Mythos hypothesis — a novel mechanism for an E. coli protein — was independently corroborated by a study from a lab working on the same problem.

Genomics — a week of autonomy beating a published model at 100× smaller#

Over "more than a week of largely autonomous work," Mythos 5 assembled single-cell data for millions of cells across 138 animal species, then designed and trained a custom machine-learning model to identify cells performing the same role in even distantly related organisms. With only high-level human input, that trained model outperformed a recent model published in Science — despite being 100× smaller. Anthropic intends to publish.

The dual-use shadow#

The same capability is why biology must be safeguarded in the general-access Fable 5. The motivating evaluation: predicting how a genetic modification affects adeno-associated virus (AAV) capsid assembly — a real gene-therapy component whose design capability "in the wrong hands, could enable the design of dangerous viruses." Mythos-class models outperformed dedicated protein-language models on this without being trained for the task, using biological reasoning alone. Autonomous scientific capability and bio-uplift risk are the same capability seen from two sides — the core tension the RSP CB determination and the bio classifier exist to manage.

Why it matters for the trajectory#

Perspiration automation reaches discovery. When AI builds itself argued most research progress is incremental "scale-it-up-see-what-breaks-fix-it" work that Claude excels at. Autonomous genomics — assemble data, design a model, train it, beat the baseline — is that loop run end-to-end in a science domain, not just engineering.
It chips at the taste moat. "Consistently produce novel, compelling hypotheses" and "only high-level human input" are exactly the direction-setting functions presumed to stay human. The ~80% blinded preference is a concrete crack — though still human-judged and internally sourced.
Still jagged, still gated by verification. These are curated demonstrations (Jagged Intelligence (Ghosts, Not Animals)); science's verifier is slow wet-lab confirmation, not a compiler, so unlike AI-Driven Formal Proof Search the results can't be auto-validated — they await experimental and peer review. This keeps it adjacent to, but below, the AI-R&D autonomy threshold Anthropic gates on.

Connections#

AI-Driven Formal Proof Search — the formal-math sibling: AI doing novel research, but with an instant compiler-verifier; science substitutes the (slow, costly) experiment, so verification is the harder bottleneck here
Recursive Self-Improvement — the clearest wet-lab evidence for "perspiration is becoming automated," the essay's less-conservative reading
Research Taste as the Human Bottleneck — autonomous hypothesis-generation and "only high-level human input" are direct chips at the residual human comparative advantage
AI R&D Autonomy Evaluation (AECI) — adjacent autonomy: a model designing+training a model and beating a published baseline is AI-R&D-shaped, though in genomics rather than AI itself
Task Time-Horizon Scaling — "over a week of largely autonomous work" is a concrete long-horizon datapoint beyond Mythos Preview's measured 16h
Jagged Intelligence (Ghosts, Not Animals) — the caveat: these are selected demonstrations of a still-jagged capability, not uniform competence
The Verifiability Thesis — the limiting case: science is less verifiable than Lean proof, so autonomy outruns cheap verification — the experiment, not a compiler, is the reward signal
Capability-Gated Model Fallback — the dual-use flip side; the AAV result is the bio classifier's motivating example
Responsible Scaling Policy Evaluations — the CB (chemical/biological) risk domain these capabilities advance
Claude Mythos 5 — the model (bio safeguards lifted) that produced these results
Claude Fable 5 — the general-access sibling on which biology is safeguarded

Open questions#

Every result is Anthropic-reported and example-selected; the genomics "100× smaller beats Science" claim is "intend to publish" — what survives external peer review?
Science's verification gap: the formal-proof loop self-validates; here a wrong-but-confident hypothesis costs a wet-lab cycle to falsify. Does autonomy without a fast verifier increase the verification bottleneck rather than relieve it?
If hypothesis-generation is genuinely at ~80% preference, how much of "research taste" is left as a distinctively human function — and how would you measure the residue?

Sources#

Claude Fable 5 and Claude Mythos 5 — §"Evaluating Claude Fable 5 and Claude Mythos 5" (drug design; novel hypotheses; genomics) and §"Biology and chemistry" (AAV dual-use)