H
Howardismvol. 03 · quiet corner of the web
Plate IIArchitectureHOWARDISM

MOC — LLM Architecture, Training & Alignment

PublishedMay 25, 2026FiledConceptTopicArchitectureReading3 minSourceAI-synthesised

<!-- BEGIN GENERATED: moc -->

Illustration for MOC — LLM Architecture, Training & Alignment

LLM Architecture, Training & Alignment#

Map of Content for the llm-architecture domain — 13 concepts. Curated entry point; see Home for all domains.

<!-- BEGIN GENERATED: moc -->

  • Agentic Misalignment (AM) (hub) — Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining (MSM)
  • Alignment Fine-Tuning (AFT) — Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining (MSM)
  • Claude Character as Product — Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the harness asset that doesn't shrink
  • Chain-of-Thought Monitorability — Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
  • Deliberative Alignment — Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Chain-of-Thought Monitorability
  • Jagged Intelligence (Ghosts, Not Animals) — "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the loop, treat as tools
  • LLM-Driven Vulnerability Research — Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response
  • Model Spec Midtraining (MSM) — New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline
  • Model Spec Science — Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026
  • Scale-Dependent Prompt Sensitivity — Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM
  • Software 3.0 — Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist"; neural-net-as-host-process extrapolation
  • Synthetic Document Finetuning (SDF) — Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining (MSM) builds on
  • The Bitter Lesson — Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward <!-- END GENERATED: moc -->
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Related articles
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Alignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…

  • Open Questions Backlog

    _62 pages with open questions, as of 2026-05-25._

  • Claude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…