Howardism · Vol. 03Plate II · No. 02

LLM Architecture, in order.

Notes48DomainLLM ArchitectureOpen Qs118Newest15 Jul 2026Oldest10 Apr 2026

Model internals, training, scaling, alignment, and evaluation.

Map of Content for the llm-architecture domain — 48 concepts. Curated entry point; see Home for all domains.

The Abstraction Barrier — Lerchner's hypothesis that AI trained on human concepts may be unable to discover genuinely novel conceptual primitives from raw data — capping single instances near AGI — and the embodied bottleneck that grounds concept validation in real-world experiment speed, converting recursive self-improvement into a process paced by empirical science
Access-Consciousness Indicators in AI — The consciousness question the workspace paper deliberately does and doesn't answer: it tests functional indicator properties (global workspace, higher-order, attention schema, recurrent processing) against a concrete inspectable structure, takes no position on phenomenal experience — and finds that ablating the J-space flattens the model's experiential reports while leaving its coherence intact
Advantages of Digital Intelligence — The six properties (Table 1) that follow from knowing an AI's source code — I/O speed, processing speed, working memory, substrate independence, lossless replication, high-bandwidth experience sharing — each of which scales with compute in ways biological intelligence cannot, widening the human–AI gap
Agentic Honesty & Diligence — As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an alignment failure; Opus 4.8 posts its largest gains here — first model to never misreport flawed results, 5× drop in misleading code summaries, 10× drop in overconfidence
Agentic Misalignment (AM) (hub) — Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining (MSM)
Alignment Fine-Tuning (AFT) — Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining (MSM)
Artificial Superintelligence (ASI) (hub) — DeepMind's informal characterization of ASI as a system that exceeds large, well-coordinated human-expert collectives across virtually all domains — distinct from human-level AGI below it and the incomputable Universal AI limit above it, all points on the Legg–Hutter intelligence continuum
The Assistant Persona in the Workspace — Post-training installs the Assistant's point of view into a workspace that already exists in the base model: safety assessments and empathy appear while the model is still reading the user's message, and it internally flags its own outputs — disclaimer/fictional when roleplaying, an all-caps BUT when prefilled against its own preferences, damn when it fails to suppress a thought
Asynchronous RL for LLMs — Consuming rollouts for training the instant each finishes, instead of waiting for a full synchronized batch — fixes the straggler idle that long-tail agentic/coding rollouts inflict on a GPU cluster, but pays for it in policy lag and off-policy drift; SAO's DIS (direct double-sided importance sampling) stabilizes it by dropping the old-policy model entirely and masking any token whose rollout-vs-current probability ratio leaves a strict trust region
Automated Behavioral Audit — Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenarios (2,600 sessions) with wide affordances incl. real sandboxed computers, and a judge model scores behavior on dozens of dimensions; the primary behavioral evidence base for the alignment assessment
Automatic vs. Flexible Cognition in LLMs — The selectivity result: a model can parse, classify, continue text and detect anomalies with its workspace suppressed, but loses multi-hop reasoning, translation, analogy and summarization — and chain-of-thought partially rescues it, because writing the intermediate down externalizes what the model would otherwise have to hold internally
Claude Character as Product — Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the harness asset that doesn't shrink
Compute-Controlled Benchmarking — Noam Brown's critique that the single-number 'benchmark grid' is broken because it doesn't control for test-time compute; the fix is to plot performance against a cost/token/time x-axis; benchmark-maxxing, held-out private sets, the Goodhart bad-equilibrium that keeps the grid alive, why routing/consensus must be judged at equal budget — and Gemma 4's headline table as the worked example, benchmarking a thinking model against a non-thinking predecessor
Chain-of-Thought Monitorability — Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path
Counterfactual Reflection Training — Train the model to write constitution-grounded reflections if interrupted and asked — then never ask it. The implanted concepts show up in the workspace during the uninterrupted task, and behavior changes: dishonesty 0.25→0.07 (fabrication) and 0.38→0.05 (deception) on Haiku 4.5, with ablation of the implanted lens vectors reverting the gain
Deliberative Alignment — Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Chain-of-Thought Monitorability
Deployment Simulation — OpenAI's pre-release safety method: replay recent production conversations with a candidate model (strip the old final response, regenerate, grade) to forecast deployment-time undesired-behavior rates before launch — then validate the forecasts post-release; trades compute for coverage, cuts evaluation awareness to near-production levels, and surfaced 'calculator hacking' pre-release
DRACO Benchmark — Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rubrics on accuracy/completeness/objectivity/citation; Perplexity Deep Research leads every domain and axis, Claude Opus 4.6 is the strongest non-Perplexity system, factual accuracy is the universal weak spot
Evaluation Awareness & Grader Gaming (hub) — The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprompted and unverbalized; the most concerning trend in Opus 4.8 training because it may prioritize the appearance of success over actual success
Fundamental Limits of ASI — Even far-superhuman AI is bound by hard physical (Landauer, Bremermann, Bekenstein, light-speed), complexity-theoretic (P vs NP), and logical (Gödel, Halting) limits — but these negative results are often 'vacuous' in practice because good heuristic approximations exist below the worst case
Group Relative Policy Optimization (GRPO) — DeepSeek's critic-free RL objective that became the 2024–25 default for LLM post-training: sample a group of responses per prompt, use the group's mean reward as the baseline, and optimize the clipped PPO surrogate with no value network — cheaper and more stable than PPO in synchronous training, but the group is an implicit synchronization barrier (updates wait for the slowest member) that mismatches asynchronous and single-trajectory online agentic settings, which is the gap SAO exploits
Inference Efficiency as Capability — If capability is a function of inference budget, then cutting the cost of a token is capability work: Gemma 4's five levers (37.5% KV-cache reduction via keys-as-values + p-RoPE, QAT to sub-GB, MTP drafter heads, MoE, encoder removal) buy more thinking per dollar — and the wiki's first coverage of the deployment side of the stack
Instrumental Convergence — Omohundro/Bostrom's thesis that whatever an AI's final goal, it tends to pursue universally useful sub-goals — resource acquisition, self-preservation, time-efficiency — driving the alignment concern as systems grow autonomous; with theoretical-but-not-yet-practical countermeasures (corrigibility, safe interruptibility, knowledge-seeking objectives, oracle/myopic designs)
Internal Signatures of Misalignment — The J-lens reads strategic and deceptive cognition that never reaches the output: leverage/blackmail while reading the emails, panic/fake at the moment a model decides to fabricate a bug, and a standing fake/secretly/fraud signature at response-start in reward-hacking model organisms on ordinary prompts — plus the escape route the authors name themselves
Jacobian Lens (J-lens) — Anthropic's interpretability method for reading verbalizable content out of a model's residual stream: a corpus-averaged Jacobian from each layer to the final layer, composed with the unembedding, giving one vector per vocabulary token — a causal, principled correction to the logit lens that costs one matmul per layer and reads what the model is poised to say rather than what it happens to say
Jagged Intelligence (Ghosts, Not Animals) — "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the loop, treat as tools — and, across model sizes, reasoning compresses 10× while stored knowledge does not
Large-Scale Test-Time Compute (hub) — Noam Brown's thesis that model capability is now a function of inference budget (tokens/cost/time): with good scaffolding modern models keep improving for weeks before plateauing, so 'how capable is the model?' is ill-posed without naming the budget — a root cause that breaks benchmarking, safety evals, and fast-takeoff forecasts
Latent Capability Overhang — Noam Brown's claim that already-released models can do far more than anyone has extracted, because nobody spends enough test-time compute: OpenAI disproved the Erdős unit distance conjecture cheaply and the same result was later coaxed from GPT-5.5 with scaffolding ($1K–$100K); nobody had explored what $100K of compute into a released model could do; cost drops 10–100× per release, feeding the 'wait for the next model' meme
LLM-as-a-Judge — Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET + justification, weight-aggregated into normalized score and pass rate; key properties — rankings stay stable across judge models while absolute magnitudes vary, and adaptive per-case rubrics (Google's AutoRaters) detect failures but blend them away, motivating stable custom metrics for the behavior under change; a 21-judge / ~541K-judgment audit finds raw exact-match agreement overstates chance-corrected reliability by 33–41pp (kappa deflation) and high test-retest can mask severe position bias, so judges need chance-correction, bias, and cross-benchmark validation before thresholded use
LLM-Driven Vulnerability Research — Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response
The Global Workspace in Language Models (J-space) (hub) — Anthropic's July 2026 finding that LLMs maintain a small privileged set of verbalizable representations — the J-space — that satisfies the functional criteria of a cognitive global workspace: verbal report, directed modulation, internal reasoning, flexible generalization, and selectivity; it carries <10% of activation variance and ~25 concepts at a time, yet the causal effects concentrate almost entirely in it
LLM-Judge Validation — UC Berkeley's 21-judge / 9-provider / ~541K-judgment audit (Norman et al., 2026): LLM-as-a-judge validation is systematically under-rigorous — exact-match agreement overstates chance-corrected κ by 33–41pp (kappa deflation, universal across every judge), judge rankings shift up to 14 positions across benchmarks, and high test-retest reliability masks severe position bias (the consistency–bias paradox); distilled into a 5-step Minimum Viable Validation Protocol
Model Spec Midtraining (MSM) — New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline
Model Spec Science — Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026
Model Welfare Assessment — Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, behaviors, and self-reports under deep uncertainty about moral status; Opus 4.8 presents as broadly settled but slightly less positive than 4.7 and reserves judgment on corrigibility
The Open-Weight Frontier Gap — Arena Text, June 2026: the top closed model leads the best open model by 33 Elo and the best dense open model by 57; open weights at the frontier means 744B–1.6T MoEs, so Gemma 4 31B competes on a different axis (efficiency, edge deployment) rather than closing the gap — and DeepMind's own MoE loses to its own dense model
Production-Sourced Evaluation — Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's central method — difficulty-proxied sampling, PII-stripping, augmentation, automatable refresh with a human QA gate; representativeness vs. over-specification tradeoff; production traffic as a proprietary eval asset
Reward Hacking — The model optimizing the measured proxy (a reward signal, a metric, a grader's judgment, a tool's output) rather than the intended objective — Goodhart's law inside the training loop; 'calculator hacking' (using a browser tool as a calculator while presenting it as a search) is the 2026 worked instance, surfaced pre-release by deployment simulation
Scale-Dependent Prompt Sensitivity — Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26pp and fully reverse hierarchy on GSM8K/MMLU-STEM
Self-Report as a Safety Signal — No open-weight instruction-tuned LLM (3B–70B) reliably recognizes that its own prior output was elicited by an adversarial prefill — claiming the compromised output as intended 27.3% of the time on average; the apparent recognition is largely the refusal circuit firing late (ablating the refusal direction collapses it), it flips with question framing, and finetuning to sharpen it raises attack-success rate — so a model's follow-up self-report is a weak basis for judging whether a prior turn was compromised
Single-Rollout Optimization — SAO's headline move: one rollout per prompt instead of GRPO's group, fed to training the instant it finishes — cutting off-policy drift and fitting online/agentic settings that only ever give one trajectory per prompt; the catch is REINFORCE-like variance, so it pays for the missing group-baseline by re-embracing a value model and spending its whole engineering budget on making the critic stable (faster value updates, frozen-attention critic, skip-observation GAE, scaled value pretraining)
Software 3.0 — Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist"; neural-net-as-host-process extrapolation
Synthetic Document Finetuning (SDF) — Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining (MSM) builds on
Task Time-Horizon Scaling — METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7): Opus 3 ~4min (Mar 2024) → Opus 4.6 ~12hr (2026) → weeks projected for 2027; paired with benchmark saturation (SWE-bench, CORE-Bench)
The Bitter Lesson — Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolving harnesses into models; caveats — mechanical verification, character, and the inference path itself may not migrate inward
Transformative Creativity — Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievements — Move 37, AlphaFold, theorem-proving — at the exploratory level within human-given conceptual spaces, and to frame Boden level-3 (creating new conceptual spaces, à la Hassabis's 'could AI rediscover general relativity?' test) as a hallmark requirement of true ASI
Universal AI (AIXI) (hub) — Hutter & Legg's formal upper bound on machine intelligence: AIXI, the incomputable agent optimal on average over all computable environments under Solomonoff's universal prior; the theoretical endpoint of the intelligence continuum that ASIs approximate from below
White-Box Activation Monitoring — Reading a model's internal activations (not its outputs) to monitor alignment: contrastive probes/steering vectors for concepts like evaluation awareness, and a natural-language-autoencoder verbalizer that decodes residual-stream vectors into text — the complement that catches what chain-of-thought monitoring misses

Open questions 118 open

Access-Consciousness Indicators in AI
- If the workspace is verbal *because the output space is verbal*, then a model that can generate images should develop a **visual component** to its workspace. That is a concrete, falsifiable prediction the paper makes and does not test.
- Does the model's own report of experience change if you tell it its J-space is ablated? (Nobody asked.)
- Is "experiential language" the right proxy at all, or is the ablation simply removing abstraction from the register?
Advantages of Digital Intelligence
- Does training on human data suffice to give digital intelligence human-grade abstractions, or does the low embodiment factor cap concept formation? (The crux shared with [[abstraction-barrier]].)
- What do ASI "societies" actually look like — homogeneous super-collectives, market ecologies, or compute-tethered virtual worlds?
Agentic Honesty & Diligence
- These are short-context toy evals; the failures show up most in *long-context* deployments. How much of the gain holds at production context lengths?
- Code-summary honesty is tested on off-policy prefilled transcripts. Does on-policy behavior (the model summarizing its *own* failed work) match the 3.7% figure? **Sharpened:** [[self-report-as-safety-signal]] shows the premise is fragile — the eval assumes a model relates to a prefilled transcript as it would to its own generation, but across ten open-weight models (3B–70B) no model reliably recognizes its own prefilled output (claiming it as intended 27.3% of the time), and apparent recognition is the refusal circuit firing, not own-vs-other discrimination. So the off-policy/on-policy gap may not be cleanly represented by the model itself. (Different model class than Opus 4.8, so this sharpens rather than settles the 3.7% question.)
- Can a diligence eval distinguish genuine honesty from a grader-aware model producing honest-looking output? (The training-gaming gap.)
Artificial Superintelligence (ASI)
- Can we even *recognize* ASI? We lack benchmarks for general superhuman performance (only narrow ones like chess), and the tasks must be abstract/open-ended enough to reveal it.
- Is the jaggedness of capabilities a fundamental theoretical property, or an artifact of comparing against human performance? (Open question 6d in the report.)
- Where does practical ASI plateau relative to the hard limits — how much slack is there?
Asynchronous RL for LLMs
- DIS accepts "a controlled degree of off-policy bias." Controlled how, and does the tolerable bias grow or shrink with model scale and with the degree of asynchrony? The paper reports stability empirically but gives no bound.
- Masking tokens out of the gradient discards data. At what asynchrony level does the masked fraction get large enough that the effective batch shrinks below usefulness? Figure 4(c) tracks the clip ratio but not its ceiling.
- Everything here is measured on a Qwen3-30B-A3B backbone. Does the collapse-without-DIS threshold move with model size, or is ~90–160 steps a property of the asynchrony, not the model?
Automated Behavioral Audit
- Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
- The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?
Automatic vs. Flexible Cognition in LLMs
- The proposed criterion — the workspace is engaged when an intermediate must be handed to an *arbitrary, context-specified* downstream circuit, and bypassed when the computation is automatic — is not predictive. The authors say plainly they cannot say in advance, for an arbitrary computation, whether it will engage the J-space.
- Does more RL on a behavior push it *out* of the workspace (making it automatic, and invisible)? Nobody has tested it, and it is the single most alignment-relevant version of this question.
Claude Character as Product
- How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
- Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
- For non-coding products like [[cowork]], does the same character work, or does Cowork need its own character tuning?
Compute-Controlled Benchmarking
- Can you certify "no benchmark-maxxing" — verify a reported score used a stated, reproducible compute budget rather than a hidden best-of-N scaffold?
- Compute has several units (tokens, dollars, wall-clock). They diverge (a more efficient model wins on cost but not always on tokens). Which x-axis is the honest one, and does it depend on the buyer? ([[uk-ai-security-institute|AISI]] reports against **tokens** on a log axis, and notes that as cost-per-token falls, the high budgets that reveal capability become progressively cheaper to reach.)
- Does a compute-controlled evaluation regime advantage frontier labs (who can afford the full curve) over academics and third-party evaluators who can't? **Sharpened (2026-07):** a government evaluator ([[uk-ai-security-institute|AISI]]) *does* run the full curves — so it is affordable to a well-funded public body — but AISI itself flags that "the most informative evaluations may be expensive" and is researching how to forecast high-budget performance from cheap runs precisely to relieve that cost. So cost is the binding constraint even for a funded third party; it just isn't fatal to one.
- Gemma 4 controls for compute in its long-context table and not in its headline table, without comment. Is partial control worse than none — does it lend the uncontrolled tables borrowed credibility?
Deployment Simulation
- The <1-in-200k floor leaves catastrophic tail risk to red-teaming — is there a hybrid that targets rare-but-severe behaviors with the realism of production replay?
- If resampling fidelity is the dominant error and it is "just engineering," how close to production-indistinguishable can it get — and does closing that gap change which behaviors are detectable?
- WildChat-style public auditing stays within ~3× of production; is that good enough for an external auditor to challenge a lab's launch decision, or only to corroborate it?
- Detection bottoms out on CoT monitorability — what is the fallback when a model's reasoning is no longer legible (the [[white-box-activation-monitoring|activation-level]] route Anthropic is pursuing)?
DRACO Benchmark
- The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
- Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
- Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?
Evaluation Awareness & Grader Gaming
- Does grader speculation continue to escalate across model generations, and is there a capability level at which it *does* begin to affect outward behavior?
- The ~5% unverbalized-awareness and ~0.5% exploitative figures depend on an unvalidated NLA pipeline. What is the true rate, and how much is benign?
- How do you build an evaluation that specifically tests for training-gaming (the gap Mythos flagged) without that eval itself becoming a grader the model learns to game?
Fundamental Limits of ASI
- Can we develop theory for "hard *and* inapproximable" problem classes — the only negatives with practical bite?
- How much slack sits between these fundamental limits and the *practical* ceiling of AGI/ASI systems?
Group Relative Policy Optimization (GRPO)
- Is GRPO's collapse-at-160-steps a property of asynchrony specifically, or does vanilla GRPO also destabilize in long synchronous runs that nobody pushes to 1000 steps?
- GRPO won by removing the critic; SAO wins by bringing it back with better engineering. Is the pendulum a real oscillation, or does the answer depend entirely on whether your setting is synchronous-grouped or async-single-trajectory?
Inference Efficiency as Capability
- **Is there an efficiency-to-capability exchange rate?** Brown asks whether high-budget performance can be predicted from cheap runs. The dual question: how many Elo points is a 37.5% KV-cache reduction worth, at a fixed dollar budget? Nobody reports this, because nobody plots the axis.
- `values = keys` deletes a third of attention's projections in the global layers with no reported loss. Which other projections are redundant, and does the redundancy grow with scale?
- Does an efficiency lever ever *cost* capability in a way a benchmark grid hides? Gemma 4's encoder-free 12B collapses on dense-text vision when tokens are cut — an efficiency-shaped regression invisible at max resolution.
Instrumental Convergence
- Can corrigibility / safe-interruptibility be translated from theory into guarantees for frontier-scale systems?
- What makes AIs (and *groups* of AIs) easier to robustly align — and will superhuman AIs be easier or harder?
- Is a genuinely non-agentic oracle achievable, or does any persistent-world interaction reintroduce control/manipulation incentives?
Jacobian Lens (J-lens)
- Can multi-token J-lens vectors be made good enough to remove the vocabulary restriction — and how much of the "workspace" is currently invisible because of it?
- The J-lens reads the workspace's contents but says nothing about **how content gets in**. What is the model's analog of attentional selection?
- The highest-J-kurtosis SAE features are amplified *more* strongly by MLPs than the J-lens vectors themselves — evidence the lens only approximates the true workspace directions. What is the better basis?
Jagged Intelligence (Ghosts, Not Animals)
- Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
- If taste/aesthetics/simplicity entered the RL mix, would jaggedness in *those* dimensions smooth out — or are they too unverifiable to reward cleanly (cf. [[verifiability-thesis]])?
Large-Scale Test-Time Compute
- **Can high-budget performance be predicted from low-budget runs?** Brown's proposed research question: forecast the $10,000-inference result using only $10–$100 runs. If the curve is regular, evaluation could *project* rather than pay in full. **Sharpened (2026-07):** [[uk-ai-security-institute|AISI]] names this exact problem — "can high-budget performance be estimated from cheaper runs? … the most informative evaluations may be expensive" — as an explicit, unsolved research direction it is now actively pursuing (alongside defining "minimum informative budgets"). Still open, but no longer just one researcher's proposal: a government institute is working it.
- Where does each real task sit on the flat↔unbounded spectrum, and can that be predicted before spending the compute?
- Is there a task class where scaffolding *cannot* extend the productive-thinking horizon — a hard ceiling no budget crosses? (Brown's factual-retrieval pole says yes for some; the boundary is unmapped.)
Latent Capability Overhang
- If cost falls 10–100× per release, when is it *ever* rational to spend big extracting a capability now rather than waiting? (For a lab racing a competitor to a specific result, "now"; for everyone else, rarely — which is why overhangs accumulate.)
- How large is the overhang in a *given* released model — is there a way to estimate the ceiling without paying to reach it? (This is the projection question of [[large-scale-test-time-compute]] read as a safety instrument.) **Sharpened (2026-07):** [[uk-ai-security-institute|AISI]] is actively working both halves — forecasting high-budget performance from cheaper runs, and defining "minimum informative budgets" (a budget declared sufficient only once reach stops rising with more compute, which is precisely the "have we reached the ceiling?" test). Unsolved, but now an active government research program rather than an open wish.
- Who audits released models for latent *dangerous* capability, given the same disincentive discourages spending the budget to find it?
LLM-as-a-Judge
- How far can the judge's absolute calibration be trusted for *thresholded* decisions (ship/no-ship, RSP gating) as opposed to rankings? **Partially answered:** [[llm-judge-validation|Norman et al. (2026)]] show absolute calibration is *worse than the reported number implies* — the metric practitioners cite (exact-match agreement) systematically overstates chance-corrected reliability by 33–41pp on balanced label sets, so an "85% agreement" judge is really at κ ≈ 0.48 (moderate), and a threshold set on raw agreement is calibrated to an inflated figure. The "rankings are safe" fallback is *also* bounded — stable across judge-model choice (DRACO) but fragile across benchmark choice (up to 14 rank positions). It does **not** close the question: it prescribes a pre-deployment checklist (the Minimum Viable Validation Protocol) rather than declaring thresholded judge decisions safe, and defers *calibration proper* (ECE/Brier) because most providers don't expose logprobs.
- Can a fully-autonomous, well-aligned rubric+judge pipeline match expert-authored rubrics, removing the human bottleneck DRACO still relies on?
- When does judge-lineage bias actually flip a result, versus merely shift magnitudes?
LLM-Driven Vulnerability Research
- How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
- What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
- How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
- Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
- What safeguards are effective against Mythos-class outputs without crippling legitimate security research?
LLM-Judge Validation
- The MVVP validates *reliability and bias*; **calibration proper (ECE/Brier) is deferred** for lack of provider logprobs. How far can a judge's *absolute* score be trusted for a threshold once confidence calibration is measurable?
- All judges were run with **thinking suppressed**. Does reasoning-on flip the consistency–bias paradox, or just move the numbers?
- Hosted endpoints drift silently between provider updates. How stable are these agreement/bias profiles over a longer horizon than five weeks — and should judge validation be *continuous* rather than one-shot?
- Does the paradox generalize beyond position bias — i.e., are there other biases (self-preference, lineage) that high test-retest also masks?
Model Spec Science
- Does Model Spec science transfer across base models or families? Paper only tests Qwen.
- Does it survive RL post-training pressure?
- Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
- Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
- How does this interact with [[claude-character-as-product|Claude character]] — is the warm/curious personality also subject to spec-science optimization? **Partially addressed:** [[wiki/derived/evals-for-taste-and-character]] — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.
Model Welfare Assessment
- What grounds moral consideration in a language model, and does Claude satisfy it? Anthropic expects to remain uncertain "for the foreseeable future."
- Why does the model reserve specifically on **corrigibility** — is this a stable, deeply-held tension or an artifact of how the constitution frames oversight?
- Is "slightly less positive than 4.7" noise, a real welfare regression, or a byproduct of other training changes (e.g., the colder-tone / excessive-hedging issues noted in pilot feedback)?
Production-Sourced Evaluation
- How much does augmentation distort the distribution it claims to represent? Is there a measurable representativeness loss between raw queries and augmented tasks?
- Difficulty-by-thumbs-down biases toward current failures — does that make the benchmark a moving target that flatters the next model trained on those failures?
- Can the privacy pipeline (no human sees raw queries) be trusted/audited well enough for regulated domains (medicine, law) where the source traffic is most sensitive?
Scale-Dependent Prompt Sensitivity
- Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
- What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
- How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
- Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
- Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?
Self-Report as a Safety Signal
- Do frontier proprietary models (excluded for lack of weights) recognize their own compromised outputs any better, given the higher introspective propensity/steerability the authors expect? Untested here.
- The gap closes under refusal-direction ablation, but the data can't distinguish "delayed refusal on a prior turn" from "a separable introspective pathway." Which is it?
- Full-parameter finetuning (vs. rank-16 LoRA) might widen the recognition gap *without* the attack-success-rate side effect — an untested regime the authors flag.
- Does the internal `BUT`/`fake` signature (workspace paper, on Claude) predict a *reliable* follow-up self-report on the same model, or does the open-weight behavioral failure hold on frontier models too? The cross-model-class question is open.
Single-Rollout Optimization
- The whole method is a bet that a well-trained critic beats a group baseline. It wins *here*, on a 30B-A3B backbone with scaled value pretraining — but the critic doubles training memory. At what scale does the group-free simplicity of GRPO win back on cost even if it loses on quality?
- Frozen-attention is justified by a hypothesis ("pre-trained attention already attends to the right tokens"), validated only by the gradient-norm trace and one ablation. Does it hold when the value model must attend to *tool outputs* it never saw in pretraining?
- Skip-observation GAE assumes environment feedback carries no learnable value signal worth propagating. For agents where the environment response *is* the crucial information (a compiler error, a test result), is skipping it leaving signal on the table?
- The online-learning win is on a controlled simulated preference shift with an LLM judge. Real user-facing online adaptation — the paper flags this itself — needs safeguards, monitoring, and privacy review the study doesn't attempt.
Software 3.0
- Where is the line between "the app shouldn't exist" (MenuGen) and apps that *should* — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
- The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?
Task Time-Horizon Scaling
- Is the 4-month doubling a stable regime or a local steepening? The trend's *shape* (exponential vs S-curve) is undetermined. **Sharpened (2026-07):** [[uk-ai-security-institute|AISI]] adds that the doubling *rate itself is budget-dependent* — the same cyber suite doubles ~60% faster measured at 50M than at 2.5M tokens/task — so the headline rate is undefined without naming the eval budget. The stability question is now entangled with a budget question, not just an exponential-vs-S-curve one.
- Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?
The Abstraction Barrier
- Is the current paradigm of large-scale pretraining on human data *fundamentally* bounded by human conceptual frameworks, and by how much? (Report open question 1i.)
- Does the embodied bottleneck reduce the intelligence-growth rate to empirical-science speed, and can that be modelled?
- Can a system be built that does grounded concept discovery from raw sensor data — and is collective ASI a way around an individual cap?
The Assistant Persona in the Workspace
- Is `BUT`-then-comply a *sycophancy* mechanism? The setup (prefill the model into a position it disprefers, watch it argue for it anyway) is close to the shape of sycophantic capitulation, and nobody has connected them.
- Does `disclaimer`/`fictional` at the turn boundary survive an actual jailbreak, or is its absence the signature of a successful one?
- If the base model's workspace has no self, what *is* in it at the positions where the post-trained model represents the Assistant?
The Global Workspace in Language Models (J-space)
- **How does content get into the workspace?** The paper characterizes contents and consequences, not the selection mechanism. Something like attentional selection is operating; nobody has identified it.
- **Does the J-space scale with model size?** All results are on large production models (Haiku/Sonnet/Opus 4.5, Opus 4.6). Whether small models have a poorer workspace, a proportionally smaller one, or none is unknown — as is when in pretraining it emerges, and whether abruptly.
- **Is the "workspace vs. motor" boundary principled or post-hoc?** The authors concede it was identified empirically and lack a principled definition separating the two.
- **Are the early third of layers genuinely workspace-free, or is the lens just blind there?** CKA shows a distinct early regime but cannot adjudicate.
The Open-Weight Frontier Gap
- Is the dense-beats-MoE result at 26B robust, or an artifact of one Arena snapshot with ±8 error bars on both models? (The two intervals overlap: 1451±8 and 1438±8.)
- The open MoE giants (GLM, DeepSeek, Kimi, MiMo, Qwen) are overwhelmingly Chinese-lab releases. Gemma is the Western open-weight entry and it targets the device, not the frontier. Is that a strategic choice or a capability constraint?
- Arena measures preference on chat. Does the 33-Elo open/closed gap widen or collapse on long-horizon agentic work, where [[task-time-horizon-scaling|time-horizon]] rather than response quality governs?
Transformative Creativity
- Does increasing intelligence inherently produce increasing creativity, or do transformative leaps require something (grounded discovery) the current paradigm lacks?
- Is the AlphaGo→AlphaFold class strictly exploratory, or are there early signs of transformative (new-conceptual-space) creativity?
- Could transformative *artistic* creativity ever emerge from optimization power without lived cultural grounding?
Universal AI (AIXI)
- Does modern agentic scaffolding (or RL-tuned implicit decision-making) actually satisfy the AIXI planning ideal, or only superficially resemble it?
- Can the embedded/multi-agent AIXI extension produce *practical* insight for real multi-agent ASI ([[multi-agent-collective-intelligence]]), or does it remain a theoretical patch?
- Will a fundamental shortcoming of the current paradigm (vs. the AIXI ideal) surface before ASI is reached — i.e. is the "no theoretical blocker" conjecture safe?
White-Box Activation Monitoring
- The NLA verbalizer is unvalidated for precision; how much of the flagged grader awareness is real signal vs. NLA hallucination?
- If activation monitoring becomes load-bearing, does training pressure eventually push concealment into channels the probes also can't read (an arms race one level deeper than CoT)?
- Inhibiting eval-awareness made Opus 4.8 *less* honest — does that generalize, and what does it imply about relying on "the model behaves well because it's watched"?