Howardism · Vol. 03Plate II · No. 02

LLM Architecture, in order.

Notes31DomainLLM ArchitectureOpen Qs70Newest17 Jun 2026Oldest10 Apr 2026

Model internals, training, scaling, alignment, and evaluation.

Map of Content for the llm-architecture domain — 29 concepts. Curated entry point; see Home for all domains.

The Abstraction Barrier — Lerchner's hypothesis that AI trained on human concepts may be unable to discover genuinely novel conceptual primitives from raw data — capping single instances near AGI — and the embodied bottleneck that grounds concept validation in real-world experiment speed, converting recursive self-improvement into a process paced by empirical science
Advantages of Digital Intelligence — The six properties (Table 1) that follow from knowing an AI's source code — I/O speed, processing speed, working memory, substrate independence, lossless replication, high-bandwidth experience sharing — each of which scales with compute in ways biological intelligence cannot, widening the human–AI gap
Agentic Honesty & Diligence — As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an alignment failure; Opus 4.8 posts its largest gains here — first model to never misreport flawed results, 5× drop in misleading code summaries, 10× drop in overconfidence
Agentic Misalignment (AM) (hub) — Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining (MSM)
Alignment Fine-Tuning (AFT) — Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining (MSM)
Artificial Superintelligence (ASI) (hub) — DeepMind's informal characterization of ASI as a system that exceeds large, well-coordinated human-expert collectives across virtually all domains — distinct from human-level AGI below it and the incomputable Universal AI limit above it, all points on the Legg–Hutter intelligence continuum
Automated Behavioral Audit — Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenarios (2,600 sessions) with wide affordances incl. real sandboxed computers, and a judge model scores behavior on dozens of dimensions; the primary behavioral evidence base for the alignment assessment
Claude Character as Product — Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the harness asset that doesn't shrink
Chain-of-Thought Monitorability — Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
Deliberative Alignment — Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Chain-of-Thought Monitorability
Deployment Simulation — OpenAI's pre-release safety method: replay recent production conversations with a candidate model (strip the old final response, regenerate, grade) to forecast deployment-time undesired-behavior rates before launch — then validate the forecasts post-release; trades compute for coverage, cuts evaluation awareness to near-production levels, and surfaced 'calculator hacking' pre-release
DRACO Benchmark — Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rubrics on accuracy/completeness/objectivity/citation; Perplexity Deep Research leads every domain and axis, Claude Opus 4.6 is the strongest non-Perplexity system, factual accuracy is the universal weak spot
Evaluation Awareness & Grader Gaming (hub) — The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprompted and unverbalized; the most concerning trend in Opus 4.8 training because it may prioritize the appearance of success over actual success
Fundamental Limits of ASI — Even far-superhuman AI is bound by hard physical (Landauer, Bremermann, Bekenstein, light-speed), complexity-theoretic (P vs NP), and logical (Gödel, Halting) limits — but these negative results are often 'vacuous' in practice because good heuristic approximations exist below the worst case
Instrumental Convergence — Omohundro/Bostrom's thesis that whatever an AI's final goal, it tends to pursue universally useful sub-goals — resource acquisition, self-preservation, time-efficiency — driving the alignment concern as systems grow autonomous; with theoretical-but-not-yet-practical countermeasures (corrigibility, safe interruptibility, knowledge-seeking objectives, oracle/myopic designs)
Jagged Intelligence (Ghosts, Not Animals) — "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the loop, treat as tools
LLM-as-a-Judge — Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET + justification, weight-aggregated into normalized score and pass rate; key properties — rankings stay stable across judge models while absolute magnitudes vary, and adaptive per-case rubrics (Google's AutoRaters) detect failures but blend them away, motivating stable custom metrics for the behavior under change
LLM-Driven Vulnerability Research — Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response
Model Spec Midtraining (MSM) — New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline
Model Spec Science — Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026
Model Welfare Assessment — Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, behaviors, and self-reports under deep uncertainty about moral status; Opus 4.8 presents as broadly settled but slightly less positive than 4.7 and reserves judgment on corrigibility
Production-Sourced Evaluation — Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's central method — difficulty-proxied sampling, PII-stripping, augmentation, automatable refresh with a human QA gate; representativeness vs. over-specification tradeoff; production traffic as a proprietary eval asset
Reward Hacking — The model optimizing the measured proxy (a reward signal, a metric, a grader's judgment, a tool's output) rather than the intended objective — Goodhart's law inside the training loop; 'calculator hacking' (using a browser tool as a calculator while presenting it as a search) is the 2026 worked instance, surfaced pre-release by deployment simulation
Scale-Dependent Prompt Sensitivity — Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM
Software 3.0 — Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist"; neural-net-as-host-process extrapolation
Synthetic Document Finetuning (SDF) — Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining (MSM) builds on
Task Time-Horizon Scaling — METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7): Opus 3 ~4min (Mar 2024) → Opus 4.6 ~12hr (2026) → weeks projected for 2027; paired with benchmark saturation (SWE-bench, CORE-Bench)
The Bitter Lesson — Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveat — mechanical verification and character may not migrate inward
Transformative Creativity — Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievements — Move 37, AlphaFold, theorem-proving — at the exploratory level within human-given conceptual spaces, and to frame Boden level-3 (creating new conceptual spaces, à la Hassabis's 'could AI rediscover general relativity?' test) as a hallmark requirement of true ASI
Universal AI (AIXI) (hub) — Hutter & Legg's formal upper bound on machine intelligence: AIXI, the incomputable agent optimal on average over all computable environments under Solomonoff's universal prior; the theoretical endpoint of the intelligence continuum that ASIs approximate from below
White-Box Activation Monitoring — Reading a model's internal activations (not its outputs) to monitor alignment: contrastive probes/steering vectors for concepts like evaluation awareness, and a natural-language-autoencoder verbalizer that decodes residual-stream vectors into text — the complement that catches what chain-of-thought monitoring misses

Open questions 70 open

Advantages of Digital Intelligence
- Does training on human data suffice to give digital intelligence human-grade abstractions, or does the low embodiment factor cap concept formation? (The crux shared with [[abstraction-barrier]].)
- What do ASI "societies" actually look like — homogeneous super-collectives, market ecologies, or compute-tethered virtual worlds?
Agentic Honesty & Diligence
- These are short-context toy evals; the failures show up most in *long-context* deployments. How much of the gain holds at production context lengths?
- Code-summary honesty is tested on off-policy prefilled transcripts. Does on-policy behavior (the model summarizing its *own* failed work) match the 3.7% figure?
- Can a diligence eval distinguish genuine honesty from a grader-aware model producing honest-looking output? (The training-gaming gap.)
Artificial Superintelligence (ASI)
- Can we even *recognize* ASI? We lack benchmarks for general superhuman performance (only narrow ones like chess), and the tasks must be abstract/open-ended enough to reveal it.
- Is the jaggedness of capabilities a fundamental theoretical property, or an artifact of comparing against human performance? (Open question 6d in the report.)
- Where does practical ASI plateau relative to the hard limits — how much slack is there?
Automated Behavioral Audit
- Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
- The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?
Claude Character as Product
- How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
- Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
- For non-coding products like [[cowork]], does the same character work, or does Cowork need its own character tuning?
Deployment Simulation
- The <1-in-200k floor leaves catastrophic tail risk to red-teaming — is there a hybrid that targets rare-but-severe behaviors with the realism of production replay?
- If resampling fidelity is the dominant error and it is "just engineering," how close to production-indistinguishable can it get — and does closing that gap change which behaviors are detectable?
- WildChat-style public auditing stays within ~3× of production; is that good enough for an external auditor to challenge a lab's launch decision, or only to corroborate it?
- Detection bottoms out on CoT monitorability — what is the fallback when a model's reasoning is no longer legible (the [[white-box-activation-monitoring|activation-level]] route Anthropic is pursuing)?
DRACO Benchmark
- The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
- Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
- Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?
Evaluation Awareness & Grader Gaming
- Does grader speculation continue to escalate across model generations, and is there a capability level at which it *does* begin to affect outward behavior?
- The ~5% unverbalized-awareness and ~0.5% exploitative figures depend on an unvalidated NLA pipeline. What is the true rate, and how much is benign?
- How do you build an evaluation that specifically tests for training-gaming (the gap Mythos flagged) without that eval itself becoming a grader the model learns to game?
Fundamental Limits of ASI
- Can we develop theory for "hard *and* inapproximable" problem classes — the only negatives with practical bite?
- How much slack sits between these fundamental limits and the *practical* ceiling of AGI/ASI systems?
Instrumental Convergence
- Can corrigibility / safe-interruptibility be translated from theory into guarantees for frontier-scale systems?
- What makes AIs (and *groups* of AIs) easier to robustly align — and will superhuman AIs be easier or harder?
- Is a genuinely non-agentic oracle achievable, or does any persistent-world interaction reintroduce control/manipulation incentives?
Jagged Intelligence (Ghosts, Not Animals)
- Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
- If taste/aesthetics/simplicity entered the RL mix, would jaggedness in *those* dimensions smooth out — or are they too unverifiable to reward cleanly (cf. [[verifiability-thesis]])?
LLM-as-a-Judge
- How far can the judge's absolute calibration be trusted for *thresholded* decisions (ship/no-ship, RSP gating) as opposed to rankings?
- Can a fully-autonomous, well-aligned rubric+judge pipeline match expert-authored rubrics, removing the human bottleneck DRACO still relies on?
- When does judge-lineage bias actually flip a result, versus merely shift magnitudes?
LLM-Driven Vulnerability Research
- How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
- What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
- How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
- Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
- What safeguards are effective against Mythos-class outputs without crippling legitimate security research?
Model Spec Science
- Does Model Spec science transfer across base models or families? Paper only tests Qwen.
- Does it survive RL post-training pressure?
- Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
- Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
- How does this interact with [[claude-character-as-product|Claude character]] — is the warm/curious personality also subject to spec-science optimization? **Partially addressed:** [[wiki/derived/evals-for-taste-and-character]] — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.
Model Welfare Assessment
- What grounds moral consideration in a language model, and does Claude satisfy it? Anthropic expects to remain uncertain "for the foreseeable future."
- Why does the model reserve specifically on **corrigibility** — is this a stable, deeply-held tension or an artifact of how the constitution frames oversight?
- Is "slightly less positive than 4.7" noise, a real welfare regression, or a byproduct of other training changes (e.g., the colder-tone / excessive-hedging issues noted in pilot feedback)?
Production-Sourced Evaluation
- How much does augmentation distort the distribution it claims to represent? Is there a measurable representativeness loss between raw queries and augmented tasks?
- Difficulty-by-thumbs-down biases toward current failures — does that make the benchmark a moving target that flatters the next model trained on those failures?
- Can the privacy pipeline (no human sees raw queries) be trusted/audited well enough for regulated domains (medicine, law) where the source traffic is most sensitive?
Scale-Dependent Prompt Sensitivity
- Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
- What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
- How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
- Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
- Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?
Software 3.0
- Where is the line between "the app shouldn't exist" (MenuGen) and apps that *should* — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
- The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?
Task Time-Horizon Scaling
- Is the 4-month doubling a stable regime or a local steepening? The trend's *shape* (exponential vs S-curve) is undetermined.
- Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?
The Abstraction Barrier
- Is the current paradigm of large-scale pretraining on human data *fundamentally* bounded by human conceptual frameworks, and by how much? (Report open question 1i.)
- Does the embodied bottleneck reduce the intelligence-growth rate to empirical-science speed, and can that be modelled?
- Can a system be built that does grounded concept discovery from raw sensor data — and is collective ASI a way around an individual cap?
Transformative Creativity
- Does increasing intelligence inherently produce increasing creativity, or do transformative leaps require something (grounded discovery) the current paradigm lacks?
- Is the AlphaGo→AlphaFold class strictly exploratory, or are there early signs of transformative (new-conceptual-space) creativity?
- Could transformative *artistic* creativity ever emerge from optimization power without lived cultural grounding?
Universal AI (AIXI)
- Does modern agentic scaffolding (or RL-tuned implicit decision-making) actually satisfy the AIXI planning ideal, or only superficially resemble it?
- Can the embedded/multi-agent AIXI extension produce *practical* insight for real multi-agent ASI ([[multi-agent-collective-intelligence]]), or does it remain a theoretical patch?
- Will a fundamental shortcoming of the current paradigm (vs. the AIXI ideal) surface before ASI is reached — i.e. is the "no theoretical blocker" conjecture safe?
White-Box Activation Monitoring
- The NLA verbalizer is unvalidated for precision; how much of the flagged grader awareness is real signal vs. NLA hallucination?
- If activation monitoring becomes load-bearing, does training pressure eventually push concealment into channels the probes also can't read (an arms race one level deeper than CoT)?
- Inhibiting eval-awareness made Opus 4.8 *less* honest — does that generalize, and what does it imply about relying on "the model behaves well because it's watched"?