Plate IIProduct & Org機器翻譯 · machine-translatedENHOWARDISM

Evals 即產品規格

PublishedMay 18, 2026FiledConceptDomainProduct & OrgTagsEvalsProduct ManagementDefinition Of DonePm SkillMeasurementReading11 minSourceAI-synthesised

Cat Wu 將 evals 視為新興核心 PM 技能的框架：十個出色的 evals 勝過一百個平庸的；為模糊的 AI 功能編碼出『完成』的樣貌；並與內省（提供假設）和 vibe-check（提供方向）相輔相成

資料來源#

摘要#

Cat Wu 闡述了為何撰寫 evals 是 AI 產品新興的核心 PM 技能——它不是一項 QA 工作，也不是 ML 工程師的職責，而是產品定義本身的介面。一個 eval 就是對*這個功能成功的樣貌是什麼？*所寫下的、可執行的答案。在一個模型幾乎能對任何提示產出流暢輸出的世界裡，產品品質的瓶頸不再是「我們能不能出貨？」，而是「我們能不能分辨一個已上線、真正可用的功能，與一個其實不行的功能之間的差別？」evals 將這份判斷編碼下來，讓它在模型與 harness 改變時能以低成本重新測試。

核心論點（Cat Wu）#

「光是打造 10 個出色的 evals 就很重要，因為它能幫助團隊量化目標是什麼、目前朝目標的進度如何，以及還缺少了什麼。所以我認為 eval 是一件被低估的事，更多 PM、更多工程師都應該投入其中。」

「產品管理的未來就是撰寫 evals，因為它回答的是：成功的樣貌是什麼？讓我把它具體地定義出來，然後我們就會知道了。」

轉變在於：PM 過去撰寫 PRD（「這是我們想要的東西」）。PRD 描述的是意圖；而 eval 定義了何謂完成。在 AI 產品中，PRD 位於上游——但 eval 才是團隊最終收斂到的目標，也是用來判斷模型 + harness 目前究竟能否做到這件事的依據。

為何十個出色的 evals 勝過一百個平庸的#

Cat 給出的數字很明確：10 個出色的 evals，而不是一百個平庸的。為什麼？

每個 eval 都必須可被解讀。 一個失敗的 eval 必須告訴你哪裡壞了以及為什麼，而不只是亮出一個紅色叉叉。平庸的 evals 失敗的方式無法被拆解分析。
每個 eval 都必須捕捉到一個原本得在審查中反覆爭辯的判斷。 「這個輸出好不好？」正是 evals 能大規模回答的問題；糟糕的 evals 只會去驗證大家早已有共識的表層性質。
維護成本是真實存在的。 數百個 evals 需要基礎設施、資料集的整理策劃，以及回歸測試的分流處理。一小組精挑細選的 evals 才能持續發揮承重作用。

對照 Harness Shrinkage as Models Improve——Cat 主張 prompt scaffolding 會隨每次發佈而縮減。evals 並不會以同樣的方式縮減：它們編碼的是我們想要什麼，即使模型變得更強，仍然得拿模型去對照衡量。

evals 在 Cat 的除錯堆疊中的位置#

完整的 Cat Wu PM 除錯堆疊分為三個部分：

請模型進行內省（Model Introspection Feedback）——當模型做出意料之外的行為時，問它為什麼。模型的回答是關於 harness 缺口的訊號，而非關於模型本身。
從一小群品味把關者那裡取得快速回饋——五位回饋夠格的人，他們能清楚說明是什麼讓某個模型／harness 組合變得出色。Cat 在團隊午餐時的 vibe-check 就是最典型的例子。
打造 evals——第三項工具，緩慢但耐久。當 (1) 與 (2) 浮現出一個假設（「模型對自己的測試不夠」）時，evals 就是用來大規模驗證該假設、並在修復後防止回歸的工具。

這三項工具彼此互補：

(1) 提供假設（模型的自我回報）
(2) 提供方向（品味把關者的判斷）
(3) 提供證明 + 回歸護欄（eval）

記憶：最典型需要 evals 的功能#

Cat 點名記憶是 evals 最為關鍵的功能：

「像記憶這樣的功能能從中獲益良多。」

為什麼特別是記憶？記憶是最典型的案例，因為：

它的輸出是「系統有沒有在對的時間記住對的事？」——在沒有基準真值資料集的情況下，這是主觀的。
失敗模式很容易被誤判（「模型很愛寫記憶，但我們不確定這些記憶品質是否夠高」）。
在沒有 evals 的情況下，修復循環很慢：你得做一次真實的使用者試驗，才能知道某次記憶的變更究竟是改善還是讓情況變糟。

沒有 evals，記憶功能的開發就會淪為憑感覺與軼事。有了 evals，團隊就能量化「對於我們在意的工作流程而言，這個版本的記憶是否比前一版更好？」

是什麼讓一個人「擅長 evals」#

Cat 點名了兩個參照案例：

Amanda——Anthropic 裡形塑 Claude 的性格的人。「這真的是一個非常困難的角色，因為這項任務太模糊了。連寫程式都還比較容易，因為你可以驗證是否成功；而打造性格則需要對 Claude 應該是誰、應該是什麼樣子，抱有非常堅定的信念。」這項技能在於：把一個模糊的目標表達得夠精確，精確到你能據此衡量進度。
Claude Code 團隊在午餐時的 vibe-check——像「這個模型對自己的測試不夠」這樣的回饋，會被轉化為「好，我們要看哪些資料才能驗證這是不是一種模式？」，再進一步變成「好，什麼樣的 eval 能證實或推翻這個假設？」

模式都是一樣的：對『好的樣貌』抱有強烈主張 + 能把那份主張轉譯成可衡量的產物。這就是把品味化為一次函式呼叫。

與 Matt Pocock 的驗證之間的連結（Design Concept Grilling）#

Matt Pocock 並沒有使用「evals」這個詞——他在教學上的框架是「驗證」與「回饋循環」。但底層的論點是相同的：在 agent 寫程式的工作流程中，回饋循環的品質決定了輸出品質的上限。Pocock 的深度模組模式把整合測試列為承重的 harness 資產之一，因為模型需要一種它能在循環中自行執行的驗證。

兩者的交會點：PM 端的 evals（Cat）與工程師端的整合測試（Matt）是同一種原語——一個編碼了某項判斷的可執行產物——只是被套用在產品的不同層級。

與 Founder's Playbook 的連結（AI-Native Startup Lifecycle）#

這份 playbook 中相鄰的概念，是 MVP 階段的**「在發佈之前就建立你的衡量框架」**：

「那些把早期的成長動能誤認為 product-market fit 的創辦人，通常正是那些在發佈之後才開始追蹤數據的人——他們所選的指標是用來評估哪些地方有效，而不是用來揭露哪些地方無效。解藥就是在第一位使用者出現之前，先建立好你的衡量框架。」

這是同一項技能，只是往上一層：不是「這個功能成功的樣貌是什麼？」，而是「在這個市場、面對這些使用者，這個產品成功的樣貌是什麼？」這份 playbook 讓 Claude 本身成為 eval 設計的夥伴（透過諮詢 Claude 來「在發佈前設計你的衡量框架」）。

對於同時採納這兩種視角的創辦人來說：既要撰寫產品層級的指標（CAC、留存、Sean Ellis 分數），也要撰寫功能層級的 evals（這個功能有沒有做到我們想要的？最新的模型是讓它變好還是變糟？）。前者用於決定公司層級的「做或不做」；後者用於決定每一次交付變更的「發或不發」。

為何這在 2026 年仍「被低估」#

Cat 主張這項技能被低估，這背後有三種解讀：

文化面。 在 2023 年以前養成的 PM 不寫程式，更別說寫 evals 了。撰寫 eval 需要能自在地處理資料集、評分函式以及機率性的輸出——而這正是過去 PM 人才管道並未篩選的一組技能。
地位面。 「寫測試」在歷史上一直是地位較低的工程工作。evals 不過是經過包裝的測試。撰寫 evals 的 PM 所做的工作看起來像 QA，但實際上是產品規格。
可著手面。 大多數 PM 並未意識到自己其實能做多少 eval 撰寫，因為工具參差不齊，而這門功夫也沒人教。Cat 的「十個出色的 evals」某種程度上是一張許可證：你不需要一百個，你只需要十個。

這預示了一場短期內的角色重新定義：會寫 evals 的 PM 將比不會寫的 PM 出貨更多。Engineer PM Convergence 正是這件事所屬的框架——工程師與 PM 收斂到一個混合型角色，而 evals 正是雙方最終都會做的活動之一。

未解的問題#

對於像性格這種由品味驅動的功能，你要怎麼寫一個 eval？Amanda 的角色正是「抗拒 eval」的典型；Cat 點名她是這方面擅長 evals 的人，卻沒有描述具體技巧。部分已解答： How Do You Write Evals for Taste? Character as the Limit Case——這項技巧是一條管線（信念 → 來自 dogfood 的失敗模式 → MSM 式的變體 A/B 量測 → 約 10 個可被解讀的 evals）；它在安全／價值觀的核心上已獲驗證，但在溫暖／機智的美學表層上仍屬只可意會。
10 對 100 這個數字並沒有給出論證。是否存在一個 Goldilocks zone，還是說它取決於功能的覆蓋面？Client-Side Agent Optimization 對各種組合的論述顯示，evals 同樣有組合爆炸的問題。
evals 與 Harness Shrinkage as Models Improve 之間如何互動？當某項 harness 資產因為模型現在能原生處理而縮減時，圍繞舊 harness 建立的那些 evals 可能會從護欄淪為遺留產物。Anthropic 是會淘汰這些 evals，還是會將它們改作他用？
有沒有任何一個非 Anthropic 的「PM 即 eval 撰寫者」案例可供引用，還是說這目前仍是 Cat Wu 獨有的框架？Matt Pocock 的工作坊雖然用了不同的詞彙卻抵達了同一個結論，但目前尚未納入第三個來源。

資料來源#

How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code) — 主要闡述（時間戳 ~55:00：「為何打造 evals 被低估」）；在除錯堆疊一節中也多次提及
Full Walkthrough: Workflow for AI Coding — Matt Pocock — 驗證循環的框架；來自工程教學的收斂論點
The Founder's Playbook: Building an AI-Native Startup — 以「在發佈前建立衡量框架」作為產品層級的類比

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 20

AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
Product & Organization
Map of Content for the product-org domain — 8 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Cited by 20

AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
Product & Organization
Map of Content for the product-org domain — 8 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Evals 即產品規格

資料來源#

摘要#

核心論點（Cat Wu）#

為何十個出色的 evals 勝過一百個平庸的#

evals 在 Cat 的除錯堆疊中的位置#

記憶：最典型需要 evals 的功能#

是什麼讓一個人「擅長 evals」#

與 Matt Pocock 的驗證之間的連結（Design Concept Grilling）#

與 Founder's Playbook 的連結（AI-Native Startup Lifecycle）#

為何這在 2026 年仍「被低估」#

未解的問題#

相關連結#

資料來源#