H
Howardismvol. 03 · quiet corner of the web
Howardism · Vol. 03Plate II · No. 02

Alignment, in order.

Notes6TopicAlignmentOldest10 Apr 2026Newest8 May 2026

RLHF, character, interpretability, and safety case studies.

Alignment articles, sorted by date, newest first.
TitleSummaryDate
Agentic Misalignment (AM)Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining
Alignment Fine-Tuning (AFT)Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining
Chain-of-Thought MonitorabilityKorbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
Deliberative AlignmentGuan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Cot Monitorability
Synthetic Document Finetuning (SDF)Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining builds on
LLM-Driven Vulnerability ResearchClaude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response