Howardismvol. 03 · quiet corner of the web

Plate IIEntities中文HOWARDISM

METR

PublishedJune 7, 2026FiledEntityDomainEntitiesTagsEntity OrgAI EvaluationBenchmarksReading2 minSourceAI-synthesised

Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on its own; the doubling-every-~4-months trendline and the 'upper end of what we can measure' verdict on Mythos Preview

Illustration for METR

Sources#

When AI builds itself

Summary#

METR (Model Evaluation & Threat Research) is an independent organization that evaluates frontier-AI capabilities, best known for its time-horizons measurement: the length of task a model can complete reliably on its own. Its data is the external-benchmark backbone of the Anthropic Institute's When AI builds itself essay and anchors this wiki's Task Time-Horizon Scaling page.

What it does#

Time horizons. Reports the task duration at which a model is 50%-reliable across a basket of tasks (trend holds at 80% too). METR's headline finding is that this horizon is doubling roughly every four months, up from an earlier ~seven-month doubling — the quantitative case that capability is accelerating, not merely improving.
Long-task measurement at the frontier. METR found Claude Mythos Preview could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks" — i.e. the frontier model has begun to outrun the benchmark's own ceiling.
Independent third-party signal. Because METR sits outside the labs, its numbers function as external corroboration of internal acceleration claims like Anthropic's ~8× code-throughput figure (AI Accelerating AI Development).

Connections#

Task Time-Horizon Scaling — the concept page built on METR's time-horizons metric
AI Accelerating AI Development — METR's external trendline corroborates Anthropic's internal-throughput evidence
Recursive Self-Improvement — the doubling curve, extrapolated, is the quantitative case for RSI arriving sooner than expected
Mythos Model — the model METR rated at "at least 16 hours," beyond its current measurement ceiling

Open questions#

What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
METR also runs the research showing developer self-estimates of AI uplift are overstated — how does it reconcile that skepticism with its own steep time-horizon curve?

Sources#

When AI builds itself — cites METR time horizons and METR's Mythos Preview "16 hours / upper end of what we can measure" assessment

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5

Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles

AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Related articles

AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Cited by 5

Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…