H
Howardism
Plate IIEntities中文HOWARDISM

METR

PublishedJune 7, 2026FiledEntityDomainEntitiesTagsEntityOrgAI EvaluationBenchmarksReading2 minSourceAI-synthesised

Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on its own; the doubling-every-~4-months trendline and the 'upper end of what we can measure' verdict on Mythos Preview

Illustration for METR

Sources#

Summary#

METR (Model Evaluation & Threat Research) is an independent organization that evaluates frontier-AI capabilities, best known for its time-horizons measurement: the length of task a model can complete reliably on its own. Its data is the external-benchmark backbone of the Anthropic Institute's When AI builds itself essay and anchors this wiki's Task Time-Horizon Scaling page.

What it does#

  • Time horizons. Reports the task duration at which a model is 50%-reliable across a basket of tasks (trend holds at 80% too). METR's headline finding is that this horizon is doubling roughly every four months, up from an earlier ~seven-month doubling — the quantitative case that capability is accelerating, not merely improving.
  • Long-task measurement at the frontier. METR found Claude Mythos Preview could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks" — i.e. the frontier model has begun to outrun the benchmark's own ceiling.
  • Independent third-party signal. Because METR sits outside the labs, its numbers function as external corroboration of internal acceleration claims like Anthropic's ~8× code-throughput figure (AI Accelerating AI Development).

Connections#

Open questions#

Sources#

  • When AI builds itself — cites METR time horizons and METR's Mythos Preview "16 hours / upper end of what we can measure" assessment
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Entities — People, Orgs, Tools & Projects

    Map of Content for all 32 entity pages. See Home for concept domains.

  • Mythos Model

    Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles
  • AI R&D Autonomy Evaluation (AECI)

    How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • Anthropic Institute

    Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…

  • Claude Fable 5

    Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…

  • Claude Opus 4.8

    Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…