AI R&D Autonomy Evaluation (AECI)

Sources#

Claude Opus 4.8 System Card

Summary#

The evaluation cluster that measures whether a model can automate or dramatically accelerate AI research and development — the capability that, taken far enough, would enable recursive self-improvement and is therefore the load-bearing input to the RSP automated-AI-R&D threat model. For Opus 4.8 the determination is that it does not cross the automated AI-R&D capability threshold: it sits between Opus 4.7 and Mythos Preview on the measured axes, does not advance the frontier, and — most importantly per Anthropic — "does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones."

How it's measured#

AECI — the capability index#

The Anthropic ECI (AECI) is a fork of Epoch AI's Epoch Capability Index, used to track the rate of capability improvement over time. A slope-ratio analysis on the frontier models estimates how fast capability is rising. For Opus 4.8 (computed on a smaller n=11 evaluation set):

Opus 4.8: 155.5 — between Opus 4.7: 154.1 and Mythos Preview: 158.3.

Because the slope-ratio analysis is computed on frontier models only and Opus 4.8 is a non-frontier point, adding it leaves the trajectory unchanged from the Mythos Preview System Card.

The two-pronged threshold#

From the RSP, the AI-R&D threshold is met if either: (1) models can fully substitute for Anthropic's entire set of Research Scientists/Engineers at competitive cost (within 5×), or (2) there is "dramatic acceleration" of AI progress attributable to automation. Anthropic determined for Mythos Preview that neither holds — no sustained AI-attributable 2× acceleration, and no closeness to substituting for senior research staff — and both conclusions carry over to Opus 4.8.

Concrete shortcomings vs. human researchers#

Rather than rely on benchmark scores alone, the card collects observable failures from day-to-day internal pre-release use (§2.3.3): examples of fabrication, ignoring correction, skipping cheap verification, and instruction-following failures. These behavioral examples — not just scores — anchor the "not close to substituting" determination. (They also overlap with the Agentic Honesty & Diligence failure modes, observed here in a research-engineering setting.)

Why task-based AI-R&D benchmarks were retired#

Recent models have crossed the highest human baselines on many automated task-based AI-R&D evaluations, so those tasks are no longer load-bearing for RSP threshold determinations and are no longer reported. Anthropic is shifting toward direct measurement of AI R&D acceleration and researcher uplift — i.e., measuring the real-world speedup rather than proxy task scores.

The recursive-self-improvement link#

This is the capability-side gate on Recursive Self-Improvement: AECI and the substitution threshold are how Anthropic asks "can the model build the next model?" The deployment-side correlate — how much AI is already accelerating Anthropic's own work — is documented in the Anthropic Institute essay When AI builds itself and compiled here as AI Accelerating AI Development (>80% of merged code Claude-authored; ~8× code/engineer/day vs 2024; kernel-optimization eval 3×→52× in a year). The two are complementary: AECI gates the capability; AI Accelerating AI Development measures the acceleration already underway. The persistent gap both describe is the same one — judgment in choosing goals (Research Taste as the Human Bottleneck) — which is also exactly the axis the "not close to substituting for senior researchers" determination turns on.

Connections#

Recursive Self-Improvement — AECI is the capability-side gate on whether the model can build its successor
AI Accelerating AI Development — the deployment-side correlate the System Card anticipated, now compiled from When AI builds itself
Research Taste as the Human Bottleneck — "not close to substituting for senior researchers" is the formal version of "taste/judgment is still the human gap"
Task Time-Horizon Scaling — the saturating task-based benchmarks (and the time-horizon curve that outran them) are why AECI shifted toward direct acceleration measurement
Responsible Scaling Policy Evaluations — AECI feeds the RSP automated-AI-R&D threat-model determination
Claude Opus 4.8 — the model assessed; AECI 155.5, below the frontier, not close to substituting for researchers
Mythos Model — the frontier-setting model; its System Card holds the full methodology and bounds the Opus 4.8 case
Agentic Honesty & Diligence — the fabrication / ignored-correction / skipped-verification shortcomings are the same alignment failure modes seen in coding evals, here in a research setting
The Bitter Lesson — the acceleration AECI tracks is what makes "scaled general methods improve themselves" more than a slogan
Harness Shrinkage as Models Improve — the deployment-side correlate: as the model absorbs more capability, internal engineering accelerates (the recursive-self-improvement throughput story)
Autonomous Scientific Discovery — adjacent autonomy in a non-AI science domain (a model designing+training a model that beats a published baseline), though gated below the AI-R&D substitution threshold this page measures

Open questions#

"Not close to substituting for senior researchers" is a subjective, internally-sourced judgment. What objective signal would replace it as models approach the threshold?
AECI is a single scalar fork of an external index; how sensitive is the 155.5 / frontier-not-advanced conclusion to the choice of the n=11 evaluation set?
The shift to "direct measurement of AI R&D acceleration and researcher uplift" is announced but not yet operationalized in this card — what does that measurement look like?

Sources#

Claude Opus 4.8 System Card — §2.3 (AI R&D): §2.3.1 autonomy evaluations, §2.3.3 shortcomings vs. human researchers, §2.3.4 AECI capability trajectory, §2.3.5 conclusion