Howardism · Vol. 03Plate II · No. 02
LLM Architecture, in order.
Notes13DomainLLM ArchitectureOpen Qs22Newest23 May 2026Oldest10 Apr 2026
Model internals, training, scaling, alignment, and evaluation.
Map of Content for the llm-architecture domain — 13 concepts. Curated entry point; see Home for all domains.
- Agentic Misalignment (AM) (hub) — Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining (MSM)
- Alignment Fine-Tuning (AFT) — Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining (MSM)
- Claude Character as Product — Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the harness asset that doesn't shrink
- Chain-of-Thought Monitorability — Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
- Deliberative Alignment — Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Chain-of-Thought Monitorability
- Jagged Intelligence (Ghosts, Not Animals) — "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the loop, treat as tools
- LLM-Driven Vulnerability Research — Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response
- Model Spec Midtraining (MSM) — New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline
- Model Spec Science — Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026
- Scale-Dependent Prompt Sensitivity — Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM
- Software 3.0 — Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist"; neural-net-as-host-process extrapolation
- Synthetic Document Finetuning (SDF) — Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining (MSM) builds on
- The Bitter Lesson — Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward
Open questions 22 open
- Claude Character as Product
- How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
- Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
- For non-coding products like [[cowork]], does the same character work, or does Cowork need its own character tuning?
- Jagged Intelligence (Ghosts, Not Animals)
- Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
- If taste/aesthetics/simplicity entered the RL mix, would jaggedness in *those* dimensions smooth out — or are they too unverifiable to reward cleanly (cf. [[verifiability-thesis]])?
- LLM-Driven Vulnerability Research
- How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
- What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
- How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
- Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
- What safeguards are effective against Mythos-class outputs without crippling legitimate security research?
- Model Spec Science
- Does Model Spec science transfer across base models or families? Paper only tests Qwen.
- Does it survive RL post-training pressure?
- Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
- Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
- How does this interact with [[claude-character-as-product|Claude character]] — is the warm/curious personality also subject to spec-science optimization?
- Scale-Dependent Prompt Sensitivity
- Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
- What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
- How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
- Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
- Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?
- Software 3.0
- Where is the line between "the app shouldn't exist" (MenuGen) and apps that *should* — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
- The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?