H
Howardism
Plate IIGovernance & Workforce中文HOWARDISM

AI R&D Autonomy Evaluation (AECI)

PublishedJune 7, 2026FiledConceptDomainGovernance & WorkforceTagsGovernanceAI RdCapability EvaluationRecursive Self ImprovementAnthropicReading6 minSourceAI-synthesised

How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives recursive self-improvement; tracked via the AECI capability index plus concrete shortcomings vs. human researchers; Opus 4.8 sits below the frontier and is not close to substituting for research staff

Illustration for AI R&D Autonomy Evaluation (AECI)

Sources#

Summary#

The evaluation cluster that measures whether a model can automate or dramatically accelerate AI research and development — the capability that, taken far enough, would enable recursive self-improvement and is therefore the load-bearing input to the RSP automated-AI-R&D threat model. For Opus 4.8 the determination is that it does not cross the automated AI-R&D capability threshold: it sits between Opus 4.7 and Mythos Preview on the measured axes, does not advance the frontier, and — most importantly per Anthropic — "does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones."

How it's measured#

AECI — the capability index#

The Anthropic ECI (AECI) is a fork of Epoch AI's Epoch Capability Index, used to track the rate of capability improvement over time. A slope-ratio analysis on the frontier models estimates how fast capability is rising. For Opus 4.8 (computed on a smaller n=11 evaluation set):

  • Opus 4.8: 155.5 — between Opus 4.7: 154.1 and Mythos Preview: 158.3.

Because the slope-ratio analysis is computed on frontier models only and Opus 4.8 is a non-frontier point, adding it leaves the trajectory unchanged from the Mythos Preview System Card.

The two-pronged threshold#

From the RSP, the AI-R&D threshold is met if either: (1) models can fully substitute for Anthropic's entire set of Research Scientists/Engineers at competitive cost (within 5×), or (2) there is "dramatic acceleration" of AI progress attributable to automation. Anthropic determined for Mythos Preview that neither holds — no sustained AI-attributable 2× acceleration, and no closeness to substituting for senior research staff — and both conclusions carry over to Opus 4.8.

Concrete shortcomings vs. human researchers#

Rather than rely on benchmark scores alone, the card collects observable failures from day-to-day internal pre-release use (§2.3.3): examples of fabrication, ignoring correction, skipping cheap verification, and instruction-following failures. These behavioral examples — not just scores — anchor the "not close to substituting" determination. (They also overlap with the Agentic Honesty & Diligence failure modes, observed here in a research-engineering setting.)

Why task-based AI-R&D benchmarks were retired#

Recent models have crossed the highest human baselines on many automated task-based AI-R&D evaluations, so those tasks are no longer load-bearing for RSP threshold determinations and are no longer reported. Anthropic is shifting toward direct measurement of AI R&D acceleration and researcher uplift — i.e., measuring the real-world speedup rather than proxy task scores.

This is the capability-side gate on Recursive Self-Improvement: AECI and the substitution threshold are how Anthropic asks "can the model build the next model?" The deployment-side correlate — how much AI is already accelerating Anthropic's own work — is documented in the Anthropic Institute essay When AI builds itself and compiled here as AI Accelerating AI Development (>80% of merged code Claude-authored; ~8× code/engineer/day vs 2024; kernel-optimization eval 3×→52× in a year). The two are complementary: AECI gates the capability; AI Accelerating AI Development measures the acceleration already underway. The persistent gap both describe is the same one — judgment in choosing goals (Research Taste as the Human Bottleneck) — which is also exactly the axis the "not close to substituting for senior researchers" determination turns on.

Connections#

  • Recursive Self-Improvement — AECI is the capability-side gate on whether the model can build its successor
  • AI Accelerating AI Development — the deployment-side correlate the System Card anticipated, now compiled from When AI builds itself
  • Research Taste as the Human Bottleneck — "not close to substituting for senior researchers" is the formal version of "taste/judgment is still the human gap"
  • Task Time-Horizon Scaling — the saturating task-based benchmarks (and the time-horizon curve that outran them) are why AECI shifted toward direct acceleration measurement
  • Responsible Scaling Policy Evaluations — AECI feeds the RSP automated-AI-R&D threat-model determination
  • Claude Opus 4.8 — the model assessed; AECI 155.5, below the frontier, not close to substituting for researchers
  • Mythos Model — the frontier-setting model; its System Card holds the full methodology and bounds the Opus 4.8 case
  • Agentic Honesty & Diligence — the fabrication / ignored-correction / skipped-verification shortcomings are the same alignment failure modes seen in coding evals, here in a research setting
  • The Bitter Lesson — the acceleration AECI tracks is what makes "scaled general methods improve themselves" more than a slogan
  • Harness Shrinkage as Models Improve — the deployment-side correlate: as the model absorbs more capability, internal engineering accelerates (the recursive-self-improvement throughput story)
  • Autonomous Scientific Discovery — adjacent autonomy in a non-AI science domain (a model designing+training a model that beats a published baseline), though gated below the AI-R&D substitution threshold this page measures

Open questions#

  • "Not close to substituting for senior researchers" is a subjective, internally-sourced judgment. What objective signal would replace it as models approach the threshold?
  • AECI is a single scalar fork of an external index; how sensitive is the 155.5 / frontier-not-advanced conclusion to the choice of the n=11 evaluation set?
  • The shift to "direct measurement of AI R&D acceleration and researcher uplift" is announced but not yet operationalized in this card — what does that measurement look like?

Sources#

  • Claude Opus 4.8 System Card — §2.3 (AI R&D): §2.3.1 autonomy evaluations, §2.3.3 shortcomings vs. human researchers, §2.3.4 AECI capability trajectory, §2.3.5 conclusion
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 12
  • Agentic Honesty & Diligence

    As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • Autonomous Scientific Discovery

    Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…

  • Claude Opus 4.8

    Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

  • Governance & Workforce

    Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.

  • Mythos Model

    Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

  • Research Taste as the Human Bottleneck

    The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…

  • Responsible Scaling Policy Evaluations

    Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

Related articles
  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Jagged Intelligence (Ghosts, Not Animals)

    "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…

  • Mythos Model

    Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…