資料來源#
摘要#
實體。 「Model Spec Midtraining: Improving How Alignment Training Generalizes」(arXiv 2605.02087,2026 年 5 月)的第一作者。Anthropic Fellows Program 成員。設計了 MSM 規格,提出並設計實驗,產出所有結果,撰寫論文。
貢獻#
根據 MSM 論文附錄 A 的作者貢獻聲明:
- 主導整個專案
- 設計所使用的 Model Specs(cheese-preference specs、Philosophy Spec、Rules/Value-Augmented/Rule-Augmented specs、General Spec)
- 提出並設計所有實驗
- 產出所有結果
- 撰寫論文
共同作者:Sara Price(Anthropic;指導初始階段)、Jon Kutasov + Samuel Marks(共同指導;Jon 提出專案構想,Sam 引導 controlling-generalization 框架)。
程式碼釋出#
開源了完整的 MSM pipeline、AFT pipeline、Model Specs 及訓練模型:https://github.com/chloeli-15/model_spec_midtraining
相關連結#
- 著作:Model Spec Midtraining (MSM) 論文
- 隸屬:Anthropic(Fellows Program)
- 共同作者:Sara Price、Jon Kutasov、Samuel Marks(Anthropic)
- 相關研究:Synthetic Document Finetuning (SDF)(Wang et al.,MSM 所建構的基礎技術)
- 著作:Model Spec Science(關於哪些 Model Spec 特徵最能泛化的實證研究;她設計了規格與實驗)
資料來源#
Cited by 5
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
- Synthetic Document Finetuning (SDF)
Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Sp…
Related articles
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Alignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Deliberative Alignment
Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; ri…
