H
Howardism
Plate IIEntities機器翻譯 · machine-translatedENHOWARDISM

METR

PublishedJune 7, 2026FiledEntityDomainEntitiesTagsEntityOrgAI EvaluationBenchmarksReading2 minSourceAI-synthesised

獨立的 AI 評估機構,'time horizons' 基準背後的推手——衡量模型能可靠獨立完成的任務時長;其每約 4 個月翻倍的趨勢線,以及對 Mythos Preview 做出的「處於我們能測量範圍上限」的評斷

METR 的插圖

資料來源#

摘要#

METR(Model Evaluation & Threat Research)是一個獨立組織,專門評估 frontier AI 的能力,最為人所知的是其 time-horizons 衡量指標:模型能可靠獨立完成的任務時長。它的數據是 Anthropic Institute 那篇 When AI builds itself 文章的外部基準支柱,也是本 wiki 中 Task Time-Horizon Scaling 頁面的基石。

它做什麼#

  • Time horizons。 報告在一籃子任務中模型達到 50% 可靠度的任務時長(此趨勢在 80% 可靠度下同樣成立)。METR 的核心發現是:這個時長大約每四個月就翻倍一次,比先前約七個月翻倍一次的速度更快——這是能力正在加速、而不僅僅是改善的量化證據。
  • frontier 上的長任務量測。 METR 發現 Claude Mythos Preview 能夠「至少」連續工作 16 小時,並且「處於 [METR] 在不增添新任務的情況下所能測量範圍的上限」——也就是說,這個 frontier 模型已經開始超出基準本身的天花板。
  • 獨立第三方訊號。 由於 METR 處於各大實驗室之外,它的數字可作為對內部加速主張的外部佐證,例如 Anthropic 那個約 8 倍程式碼產出量的數字(AI Accelerating AI Development)。

相關連結#

未決問題#

資料來源#

  • When AI builds itself —— 引用了 METR 的 time horizons,以及 METR 對 Mythos Preview 做出的「16 小時/我們能測量範圍的上限」評估
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 5
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Entities — People, Orgs, Tools & Projects

    Map of Content for all 32 entity pages. See Home for concept domains.

  • Mythos Model

    Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles
  • AI R&D Autonomy Evaluation (AECI)

    How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • Anthropic Institute

    Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…

  • Claude Fable 5

    Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…

  • Claude Opus 4.8

    Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…