Howardismvol. 03 · quiet corner of the web

Plate IIEntities機器翻譯 · machine-translatedENHOWARDISM

METR

PublishedJune 7, 2026FiledEntityDomainEntitiesTagsEntityOrgAI EvaluationBenchmarksReading2 minSourceAI-synthesised

獨立的 AI 評估機構，'time horizons' 基準背後的推手——衡量模型能可靠獨立完成的任務時長；其每約 4 個月翻倍的趨勢線，以及對 Mythos Preview 做出的「處於我們能測量範圍上限」的評斷

METR 的插圖

資料來源#

When AI builds itself

摘要#

METR（Model Evaluation & Threat Research）是一個獨立組織，專門評估 frontier AI 的能力，最為人所知的是其 time-horizons 衡量指標：模型能可靠獨立完成的任務時長。它的數據是 Anthropic Institute 那篇 When AI builds itself 文章的外部基準支柱，也是本 wiki 中 Task Time-Horizon Scaling 頁面的基石。

它做什麼#

Time horizons。 報告在一籃子任務中模型達到 50% 可靠度的任務時長（此趨勢在 80% 可靠度下同樣成立）。METR 的核心發現是：這個時長大約每四個月就翻倍一次，比先前約七個月翻倍一次的速度更快——這是能力正在加速、而不僅僅是改善的量化證據。
frontier 上的長任務量測。 METR 發現 Claude Mythos Preview 能夠「至少」連續工作 16 小時，並且「處於 [METR] 在不增添新任務的情況下所能測量範圍的上限」——也就是說，這個 frontier 模型已經開始超出基準本身的天花板。
獨立第三方訊號。 由於 METR 處於各大實驗室之外，它的數字可作為對內部加速主張的外部佐證，例如 Anthropic 那個約 8 倍程式碼產出量的數字（AI Accelerating AI Development）。

相關連結#

Task Time-Horizon Scaling —— 建立在 METR 的 time-horizons 指標之上的概念頁面
AI Accelerating AI Development —— METR 的外部趨勢線佐證了 Anthropic 的內部產出量證據
Recursive Self-Improvement —— 這條翻倍曲線一旦外推，便構成 RSI 比預期更早到來的量化論據
Mythos Model —— METR 評定為「至少 16 小時」、超出其目前測量天花板的那個模型

未決問題#

一旦現有的任務籃飽和，METR 會打造哪些新任務，來測量以天、以週計的時長？
METR 同時也進行了一項研究，顯示開發者對 AI 提升幅度的自我估計被高估了——它要如何把這份懷疑，與自家那條陡峭的 time-horizon 曲線調和起來？

資料來源#

When AI builds itself —— 引用了 METR 的 time horizons，以及 METR 對 Mythos Preview 做出的「16 小時／我們能測量範圍的上限」評估

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5

Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles

AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Related articles

AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…

Cited by 5

Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…