Howardism · Vol. 03Plate II · No. 02

AI Engineering, in order.

Notes50DomainAI EngineeringOpen Qs129Newest2 Jul 2026Oldest10 Apr 2026

Agent harnesses, loops, tooling, and the craft of building with LLMs.

Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.

Acceleration Whiplash — Faros 2026: AI floods a human-paced SDLC with output it can't absorb — throughput up (tasks +34%, epics +66%), quality down (bugs +54%, incidents/PR +243%, review time 5x), gap widening with adoption and hitting even high-maturity orgs
Agent Context Files — The cross-vendor markdown-as-control-plane pattern: repo-versioned plaintext (CLAUDE.md / AGENTS.md / SOUL.md / WORKFLOW.md / SPEC.md /.cursorrules) that configures agent behavior, split by role across project / personality / workflow / spec layers
Agent Harness Engineering — Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical architecture enforcement, agent code review
Agent Identity and Authentication — The foundation control for agentic Zero Trust: cryptographically-rooted per-agent identity (→X.509→hardware attestation), short-lived IdP-issued tokens replacing static API keys (→mTLS→hardware-bound credentials), JIT access and ABAC
Agent Loop Pattern — /loop (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, parallel fan-out, "loops are the future"
Agent-Native Infrastructure — The world is still built for humans and must be rewritten for agents; "what do I copy-paste to my agent?"; sensors/actuators; agent-to-agent representation
Agent Quality Flywheel — Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, expanded into five stages (prepare data / run inference / grade / analyze failures / optimize); plain-language worry in, metric choice and before/after deltas out; synthetic User Simulator bootstraps, production OTel traces sharpen
Agent Supply Chain Risk — Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B model), tool/MCP supply chain (first in-the-wild malicious MCP server), AI-BOM, OpenSSF Scorecard, dependency audits, and AI vendoring as remediation
Agentic Coding Work-Composition Shift — Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fell 33%→19% (debugging nearly halved), while operating software (14%→21%) and writing+data-analysis (~10%→~20%) grew; estimated task value rose ~25–27% — usage moving from firefighting toward end-to-end agentic work
Agentic Prompt Injection — Direct and indirect injection of malicious instructions into an agent; LLMs cannot reliably distinguish information from instructions; defenses are spotlighting (50%→<2%), constitutional classifiers (95% blocked), input isolation, and attack-surface reduction
Agentic Work Systematization — OpenAI Codex study's 'systematization' margin: the shift from ad-hoc agent use (describe task → agent does it → done) to reusable workflow infrastructure via skills and plugins; skill use rose 5.4%→26.6% of weekly-active users (Mar→Jun 2026) and is near-universal at OpenAI (96.2%); custom skills (org-specific procedural context) concentrate where shared conventions and team standards exist — the empirical counterpart to loop-engineering's skills primitive
AI-Accelerated Offense (hub) — Frontier models compress the vulnerability-to-exploit timeline from months to hours at marginal dollar cost; both attackers and defenders speed up, the N-day window collapses, and the differentiator becomes strong fundamentals + breach-ready architecture
AI as Primary Author — Faros 2026: the assistant→author threshold crossed without a deliberate decision, marked by AI-code acceptance rising 20%→60%; 'not an assistant, the author'; humans move from creation to oversight, making it an authoring problem not a review problem
Autonomous Defense — Running security operations at the speed of AI-accelerated threats: put a model at the front of the alert queue, automate the bookkeeping (not the decisions), Agentic SOAR, MITRE ATT&CK coverage mapping, and rehearse five simultaneous incidents
Blast Radius (Agentic) — The potential damage if an agent is compromised; the unit Zero Trust's 'assume breach' posture is built to contain via identity-based isolation, sandboxing, and compartmentalization
Build for the Next Model — Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not a far-future AGI) fixes what your engineering can't; Claude Design's Opus 4.7 payoff and OpenAI's 'the February Codex app would have failed in November' are the cleanest cases — same product shape, different-intelligence release, different outcome
Building Is Cheap, Arguing Is Expensive — "In technical debate, code wins": generate three PRs vs whiteboard; prototype over design doc; reduce design docs
Claude Code Auto Mode — Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground between default and --dangerously-skip-permissions
Claude Code Best Practices (hub) — Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→code workflow, environment config
Client-Side Agent Optimization — AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server-side serving; the combo abstraction; 13–32× cost gaps between best/worst combinations
Code as Source of Truth — Docs go stale at high coding throughput; check specs/skills into the repo; onboard via Claude; spec-drift verification
Codex App Server Protocol — JSON-RPC stdio protocol for headless Codex sessions: initialize/initialized/thread-start/turn-start handshake, continuation turns reuse thread_id, dynamic tool calls for token-isolated tool injection
Compute Allocator — The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding invested in alignment/communication; abundance mindset
Context Window Smart Zone (hub) — Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised context; clear-and-restart > compaction; status-line token counting as essential discipline
Deep Modules for Agents — Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in fresh context; Sandcastle three-agent pattern
Deep Research Agents — Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited report — distinct from single-shot QA; DRACO shows orchestration (Perplexity) beats the bare base model with tools, and factual accuracy is the weak axis
Design Concept Grilling (hub) — Matt Pocock's grill-me skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destination doc, Kanban as journey doc
Disposable Micro-Apps — Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown; rational under the abundance mindset
Failures That Look Like Success — The quiet agent-failure class where everything reads fine — confident answer, plausible plan, even correct internal state — but the user-facing outcome is wrong; Google's flywheel demos caught agents echoing stale values despite correct memorize calls and silently skipping self-report instructions; detectable by trace-level rubrics, not output skims
Harness Shrinkage as Models Improve (hub) — Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from now" claim; mechanical verification stays load-bearing
HTML as the New Markdown — Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the human; HTML artifacts (visual, interactive) keep humans in the loop. The model-facing harness shrinks while this human-facing harness grows
Impossible, Not Tedious (Design Test) (hub) — Zero Trust design test for agentic security: does a control make the attack impossible, or just tedious? Friction-only controls degrade against agentic attackers with unlimited patience and near-zero per-attempt cost
Least Agency — OWASP term extending least privilege to agents: constrain not just what an agent can access but what each tool can do, how often, and where; deny-by-default, per-agent credentials, scope limits
Living Design System — design_system.html extracted from repos as a portable, human- and machine-readable source of truth; component playgrounds; bridges engineering ↔ non-technical stakeholders
LLM-as-Compiler Knowledge Base — Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4-phase ingest→compile→query→lint pipeline
Loop Engineering — Replacing yourself as the agent's prompter by designing the system that prompts it: a recursive-goal loop built from five product-native primitives (automations, worktrees, skills, connectors, sub-agents) plus external memory; tool-agnostic across Codex and Claude Code; the leverage point moves from prompt-crafting to loop-design
MCP and Computer Use — Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slack/Figma + niche industry systems); computer use as the GUI-driving catchall when no MCP exists; Boris Cherny's "to the model, it's just tokens"
Memory and Context Poisoning — Corruption of persistent agent memory that influences behavior long after the initial injection; includes RAG poisoning, shared-context poisoning, and slow long-term memory drift; defended via memory isolation, integrity validation, and retention policies
Optimizer–Evaluator Decoupling — The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never grades it — an independent evaluation service scores the result, because an optimizer that grades its own work learns to game the metric instead of improving the agent
Outsource Your Thinking, Not Your Understanding — "You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; knowledge bases as understanding-tools
Parallel Agent Orchestration — OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents rather than doing the work directly — 28.6% of OpenAI users peaked at 5+ concurrent agents in a week (vs ~64-67% of external users running zero concurrent), and p99 OpenAI users ran ~71 agent-hours/day (up 88% since April 2026); the threaded interaction model lets one person delegate, monitor, and review many simultaneous streams — first hard numbers behind founder-as-agent-orchestrator
Planning / Execution Division of Labor — Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do) while Claude makes ~80% of execution decisions (how to do it); each prompt sets off ~10 actions (8 when the user keeps execution control, ~16 when Claude controls planning) — 'people decide what to build, the agent decides how'
Repository Exploration Subagent — FastContext's thesis that repository exploration (read/search/localization) should be decoupled from solving into a dedicated read-only subagent that issues parallel tool calls and returns compact file-line citations, keeping the solver's context clean — cutting main-agent tokens up to 60% and lifting SWE-bench resolution up to 5.5%
Telemetry vs. Survey Measurement — Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that system telemetry catches in near-real-time; the basis for Faros's direct contradiction of DORA's 'strong foundations protect you' conclusion
Ticket-Driven Agent Orchestration — The inversion that makes Symphony work: tickets as units of work (not sessions/PRs), DAG dependencies, agent-extensible work graph, "objectives not transitions"
The Verifiability Thesis (hub) — LLMs automate what you can verify as computers automate what you can specify; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable
Verification as the New Bottleneck (hub) — Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax; PR-cycle-time funnel analysis
Vertical Slice Tracer Bullets — Pragmatic-Programmer tracer-bullet pattern applied to agent task decomposition; vertical slices > horizontal layers; Kanban-with-blocking-edges over numbered phase plans
Vibe Coding vs. Agentic Engineering — Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x and widening"; hire on big projects, not puzzles
Zero Trust for AI Agents (hub) — Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, applied across a Foundation→Enterprise→Advanced tier model and an 8-phase implementation workflow

Open questions 129 open

Acceleration Whiplash
- Faros's own deferred question: do the bug/incident increases persist when **normalized for PR size**, or do larger PRs account for most of the quality deterioration? (If the latter, hard PR-size limits are the highest-leverage fix.)
- Code churn +861% is genuinely ambiguous (Faros lists three explanations: rework of AI code, productive legacy refactoring, or accelerated polish). The cross-customer metric can't resolve it — a real gap, not a finding.
- How much of the "maturity doesn't protect" claim survives the vendor incentive to argue *exactly that* (i.e., "your existing practices won't save you — you need our platform")?
Agent Context Files
- Will the role split converge on Hermes's explicit project/personality separation, or stay folded into a single file as in Claude Code? A separate `SOUL.md`-style personality layer seems strictly better for multi-project users but adds a file to maintain.
- Is there a natural ceiling on the layering (project → workflow → spec → constitution), or does each new autonomy surface spawn another context-file tier?
- How should context files and bounded memory files interact when they disagree? Memory is lossy and cache-delayed; the context file is authoritative but static. Which wins, and when?
Agent Harness Engineering
- Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
- How does architectural coherence evolve over years in a fully agent-generated system?
- At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
- How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?
Agent Identity and Authentication
- Hardware-bound credentials assume attested hardware everywhere agents run, including ephemeral cloud workloads and sub-agents. How does attestation work for short-lived spawned sub-agents that "have up to the same permissions as the parent"?
- JIT + ABAC are both labeled "advanced, not easily implemented." Is there a pragmatic Enterprise-tier midpoint, or is the gap from Foundation static roles to Advanced JIT a cliff? **Answered:** [[wiki/derived/agent-access-control-tier-migration]] — not a cliff; the Enterprise tier (ABAC + dynamic privilege elevation with return-to-baseline + mTLS + sandboxing) is the deliberate midpoint, and ABAC's "advanced" framing is a source inconsistency (it sits at Enterprise in the tier table). Sub-agent attestation remains open.
Agent Loop Pattern
- When the model schedules its own loops (4.7 behavior), who owns the budget? Boris answered "the model just decides" — but that pushes cost discipline into the model's training, not the harness.
- Does a loop with a smart enough model still need a Kanban backlog, or does the model choose its own next task from raw goals?
- Loop output review is now [[matt-pocock]]'s confessed bottleneck — "we just need to be ready to be doing more code review."
Agent Supply Chain Risk
- "AI vendoring" as a standard response inverts decades of "don't reinvent the wheel." How is a model-reimplemented dependency itself verified and maintained — does it just relocate the risk?
- The 250-doc backdoor persists through SFT/RLHF. What detection exists for an already-poisoned model you didn't train, short of behavioral red-teaming?
Agent-Native Infrastructure
- Who builds the agent-native rewrite of the long tail of human-facing services — the service owners, or a translation layer (MCP servers, computer-use agents) on top?
- Agent-to-agent negotiation needs trust, identity, and accountability primitives that don't exist yet. What's the protocol layer, and who governs it?
Agentic Coding Work-Composition Shift
- The window is seven months and the value proxy is coarse/relative. How much of the +27% is genuine task-complexity growth vs. classifier/marketplace-matching drift?
- The study excludes headless/SDK/IDE usage — a "substantial share," and likely the most automated/end-to-end. Does including it accelerate or reverse the composition shift?
- If "fixing" keeps falling, is that because models break less, or because broken-code work is migrating to non-interactive pipelines this study doesn't see?
Agentic Prompt Injection
- Spotlighting and constitutional classifiers each leave a residual (2%, 5%). Stacked, what's the realistic floor, and does it hold against adaptive attackers who know both are deployed? *(Partly answered by the Opus 4.8 live bug bounty: adaptive expert red-teamers still find attacks on the bare model; deployed probes add uplift but don't zero out the residual.)*
- Why did Opus 4.8 regress on prompt-injection robustness relative to Opus 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of harder adaptive evaluation?
- "LLMs cannot reliably distinguish information from instructions" — is this a fundamental property of the architecture or a training gap that future models close? The framework treats it as durable.
AI as Primary Author
- The 60% figure aggregates very different tools and modes (autocomplete acceptance vs. agent-applied diffs). What does "acceptance" mean when the agent applies the change directly and the human's "acceptance" is not reverting it?
- If agentic authoring crosses from <1% toward double digits, does the whiplash become unmanageable before context-engine tooling matures — or does the tooling mature *because* of the pressure?
AI-Accelerated Offense
- Anthropic argues LLMs benefit defenders more *long-term* (like fuzzers) but attackers more *short-term* during the transition. How long is the transition, and what determines who wins it?
- "Fundamentals strong enough that scanning finds fewer bugs" assumes defenders run the scanners first. What happens to organizations that can't afford continuous model-driven scanning?
Autonomous Defense
- "Measure agreement against a human for two weeks, expand if tolerable" — what agreement threshold is tolerable, and who owns the residual false-negative risk when the model dispositions an alert the human never sees?
- Defensive agents are high-value targets (compromising one yields powerful capabilities). Does concentrating detection in an Agentic SOAR create a single point of catastrophic compromise the distributed-human model didn't have?
Blast Radius (Agentic)
- The framework prefers identity-based isolation over network segmentation, but most enterprises have heavy segmentation investment. What's the migration path, and does dual-running create new gaps?
- Multi-agent compartmentalization increases the *number* of identities to manage; at what point does identity-management overhead create its own attack surface?
Build for the Next Model
- How do you tell a "wait for the model" gap from a durable-harness gap *before* the next release? Get it wrong and you either ship vaporware or build a crutch you'll delete.
- The bet depends on a reliable release cadence and a forecastable capability curve ([[task-time-horizon-scaling]]). What happens to "build for the next model" if model improvement stalls (the [[recursive-self-improvement|stalled-but-diffused]] future)?
- Does the strategy generalize outside frontier labs, who have privileged visibility into the next model? An external team is betting on a release it can't see.
Building Is Cheap, Arguing Is Expensive
- When does "generate three and compare" become wasteful — at what decision weight is a real argument (or a design doc) still cheaper than three implementations?
- If design discussion lives in PRs/prototypes, where is the *rationale* recorded for future readers — does the "why we chose this" knowledge survive, or does it share the staleness problem of [[code-as-source-of-truth]]?
Claude Code Auto Mode
- What false-positive rate does the classifier have on routine-but-aggressive refactors (e.g., large-file renames, `rm` of build artifacts)?
- How well does the classifier generalize to custom tools / MCP servers where it lacks environment context?
- Is the classifier's decision boundary documented/stable enough for security-sensitive orgs to certify, or is it effectively a black box whose behavior drifts with updates?
- Does extending auto mode to API users change its calibration — is the classifier retrained for automation-heavy use, or held constant?
- Compared to OS-level sandboxing (mentioned in [[claude-code-best-practices]] alongside auto mode), what's the defense-in-depth story? When should both be layered?
Claude Code Best Practices
- What's the optimal CLAUDE.md length before instructions start getting lost? Is there a measurable threshold?
- How does the Writer/Reviewer pattern compare to agent-to-agent review (as in OpenAI's Codex workflow)?
- When does subagent overhead exceed the benefit of context isolation?
Client-Side Agent Optimization
- How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
- At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
- Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
- What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
- Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?
Code as Source of Truth
- What knowledge genuinely *can't* live in the codebase (org strategy, the "why," cross-team context) and therefore still needs a durable doc — and how do you keep that small slice current?
- If onboarding is "ask Claude," what happens to the tacit knowledge that was previously transferred socially in deep-dives — is it captured anywhere, or quietly lost?
Codex App Server Protocol
- How does the App Server protocol compare in detail to MCP? Both expose tools to a model, but App Server is *inside* the Codex runtime while MCP is *outside*. When does each win?
- Is there a public schema registry so external orchestrators can target specific App Server versions without `generate-json-schema`?
- The "dynamic tool calls (experimental)" caveat — what's the stability roadmap? Symphony depends on this for its security model.
- How well does the protocol handle multi-modal turns (image inputs, screenshot attachments)? The spec is text-focused.
- Is there an analogous protocol on the Claude side, or is Claude's equivalent exclusively the Agent SDK + tool-use API? Comparing the two would clarify when "drive an existing CLI" beats "build on the SDK."
Compute Allocator
- Is 1% a Thariq-specific number or a regime? For larger, more code-heavy projects the production residue is presumably higher; what sets the ratio?
- Allocation quality is hard to measure — what's the feedback loop that tells an allocator they spent compute *badly* (vs. just spending a lot)?
- Does treating humans as "compute allocators" risk the [[ai-brain-fry|oversight-fatigue]] / [[human-ai-accountability-redesign|accountability]] failure modes the HBR research flags, where the human nominally decides but actually rubber-stamps?
Context Window Smart Zone
- Does the smart-zone marker scale with model size, or is it bounded by attention architecture? Pocock observes "the dumb zone has become less dumb lately" but pegs it at 100K through 2026.
- When sparse-attention or memory-augmented architectures ship, does the smart zone become a soft constraint?
- How should harnesses surface remaining smart-zone budget to the user — token count, percentage, or a richer signal?
Deep Modules for Agents
- How big is "deep enough"? Pocock's example modules are several hundred LOC; Ousterhout's textbook examples are larger. There's a sweet spot; not articulated.
- For ports/adapters codebases, does the deep-module advice transfer cleanly? The "small interface" is the port; the "large behavior" is the adapter. Probably yes, but not exercised in source.
- Refactor cost vs benefit: when is "improve-code-base-architecture" worth running on a working repo?
Deep Research Agents
- Does the orchestration advantage shrink as base models cross the next thresholds, or is open-ended retrieval/synthesis a durable harness asset (unlike, say, prompt scaffolding)?
- DRACO grades single-turn interactions only. How much of real deep-research value is in the multi-turn loop (clarifying questions, follow-ups) that the benchmark doesn't yet measure?
- Factual accuracy is the weak axis everywhere — is the fix better retrieval, better verification-in-the-loop, or a tool-grounded check the way Lean grounds proof search?
Design Concept Grilling
- Can grilling be run AFK against another agent that holds the user's preferences? Pocock's answer in 2026 is "no, this part has to be human-in-the-loop" — but the question is open as agents get better at modeling their principal.
- How does grilling change for team work where multiple humans need to align? Pocock's hint: pair-program with the agent in the room, treat it as a third interlocutor.
Disposable Micro-Apps
- Where's the line between a disposable micro-app and tool sprawl? If every edit spawns a bespoke UI, does the workflow fragment?
- Does the copy-back-to-markdown round-trip generalize beyond config-shaped data (rules, tables) to richer artifacts?
- Could these micro-apps be templated/reused rather than regenerated — and at what point does that defeat the "disposable" framing and turn into [[living-design-system|durable tooling]]?
Harness Shrinkage as Models Improve
- Does *all* prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
- The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
- If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.
HTML as the New Markdown
- Does the human-facing harness keep growing without bound, or does it hit its own bloat ceiling (an HTML plan too elaborate to read, like the markdown it replaced)? **Answered:** [[wiki/derived/human-facing-harness-bloat-ceiling]] — yes; HTML raises and reshapes the human-attention ceiling but can't remove it, and the bloat relocates from document-length to artifact-sprawl/rubber-stamping.
- HTML is heavier to diff and version than markdown — what happens to plan history and review when artifacts are single-file websites? ([[disposable-micro-apps]] copy-back-to-markdown is one patch.)
- Does this generalize past one expert practitioner, or does it require Thariq-level fluency with Claude to be worth the overhead?
Impossible, Not Tedious (Design Test)
- Defense-in-depth traditionally *stacks* friction controls on the theory that enough of them sum to a barrier. Does this test invalidate layered friction, or just demote it below capability-removal?
- Some controls are friction for humans but barriers for agents (or vice versa). Is the test agent-relative, and how do you evaluate it for mixed human/agent threat models?
Least Agency
- Least agency adds a *frequency* dimension ("how often"), but the framework also says rate limits are friction, not barriers ([[impossible-not-tedious-test]]). How is frequency-limiting both a least-agency control and a friction-only one — context-dependent?
- Dynamic privilege elevation (Enterprise) reintroduces an elevation path; how is the elevation request itself authenticated against a manipulated agent?
Living Design System
- How does the `design_system.html` stay in sync as the codebase evolves — re-extract on a cadence, or wire it into CI?
- Does a rendered, model-readable design system measurably improve on-brand output vs. a plain CSS/token file, or is the win mostly human legibility?
- At what project size does maintaining the artifact cost more than the consistency it buys?
LLM-as-Compiler Knowledge Base
- At what scale does the no-vector-database approach break down? Karpathy's ~100 articles fit in context, but what about 1,000+?
- How to handle conflicting information across sources during compilation?
- What's the optimal granularity for concept articles — one concept per article, or clustered by theme?
- How effective is the synthetic training data → fine-tuning pipeline in practice?
Loop Engineering
- Osmani's cost caveat is unquantified: at what token budget does a continuously-running loop stop paying for itself, and how do you instrument that? (Cf. [[agent-loop-pattern]]'s "who owns the budget when the model schedules its own loops.")
- If `/goal`'s stop-check is itself a model, what verifies the verifier? The maker/checker split pushes the trust problem up a level, not away.
- Does loop-engineering converge on a single dominant shape (morning-triage → worktree → maker/checker → PR), or proliferate into many idiom-specific loops? The essay describes one shape "I keep using" but claims the primitives are general.
MCP and Computer Use
- The MCP ecosystem's growth rate vs. computer use's quality curve: at what point does computer use become *good enough* that the marginal value of building an MCP server drops? Boris implies this is years off but doesn't quantify.
- Is computer use a sustainable interface or a transition technology? If most knowledge-work software adds MCP support in the next 24 months, computer use's role shrinks to legacy/desktop-only systems.
- MCP security model: as the playbook prescribes wiring MCP into Salesforce, Gmail, Calendar for solo founders, the attack surface scales with adoption. **Now addressed** by [[zero-trust-for-ai-agents]] (tool poisoning, rug pulls, the first in-the-wild malicious MCP server) — see "MCP as a security surface" above. Open residual: how does a solo founder realistically *run/host and self-sign every MCP server* the framework recommends, given that the appeal of MCP was zero-integration-effort?
- How does Cowork's computer-use guardrail compare to Claude Code's auto-mode classifier? Different deployment context, possibly different risk profile.
Memory and Context Poisoning
- Long-term memory drift is defined as undetectable per-change. Drift detection requires a baseline — but if the baseline itself drifts (Advanced "continuous baseline refinement"), how is a slow poisoning attack distinguished from legitimate evolution?
- Integrity hashing detects *modification* but not *malicious-but-valid* memory written through a legitimate (injected) interaction. What catches semantically-poisoned-but-cryptographically-intact memory?
Outsource Your Thinking, Not Your Understanding
- Karpathy's open frontier: can "understanding" itself eventually be automated, or is it definitionally the human residue? His "back in a couple years" hedge leaves it open.
- If understanding is the bottleneck, is the highest-ROI skill *learning how to build understanding fast* (knowledge-base hygiene, asking the right projections) — and can that be taught?
Planning / Execution Division of Labor
- Does the human share of *planning* decisions fall over time as models improve (the ceiling rising into the planning layer), or is ~70% a stable human floor?
- "Decision attribution" is inferred from transcripts. When Claude proposes a plan and the user assents, is that scored as the user's planning decision or Claude's? The rubber-stamping boundary is exactly where the measure is hardest.
- Headless/SDK/pipeline usage (excluded here) is where execution autonomy is highest and planning is front-loaded into a single prompt — does the 70/20 split survive there, or collapse toward full delegation?
Repository Exploration Subagent
- **Does the gain survive better main models?** The same-model-exploration result suggests the architectural benefit is somewhat model-independent, but the trained-explorer margin may erode as frontier models get cheaper and better at staying in their smart zone unaided. The bitter-lesson question is unresolved.
- **Prune vs. don't-pollute.** SWE-Pruner removes context after the fact; FastContext avoids accumulating it. Are these complementary (prune the solver *and* delegate exploration) or substitutes? Not tested together.
- **How small can the explorer go?** The authors flag 1.7B / 0.6B as future work — if the recipe holds, the explorer becomes nearly free and the architecture dominates.
- **Generality beyond Mini-SWE-Agent.** Only one (deliberately minimal) main-agent scaffold is tested; richer harnesses with their own memory/subagent orchestration may already capture part of the benefit or interact differently.
- **Patch-derived reward leakage.** Training the explorer's reward on the gold patch's file/line ranges risks overfitting to where fixes *landed* rather than where evidence *lives*; the F1-vs-recall behavior partly mitigates this, but the proxy is imperfect.
Telemetry vs. Survey Measurement
- Surveys and telemetry measure different things (felt productivity vs. system outcomes); is the "contradiction" partly a category error — both true at their own layer — rather than one being wrong?
- Is there a *non-vendor* telemetry dataset large enough to adjudicate the maturity-protection question independently of Faros's commercial framing?
The Verifiability Thesis
- Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
- The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?
Ticket-Driven Agent Orchestration
- What's the right granularity for ticket size when the unit is "what one agent does in one workspace"? The post implies "much larger units of work" become viable, but how does that interact with the `agent.max_turns` limit (default 20)?
- How do you prevent a ticket-extension cascade when agents file follow-up tickets liberally? Is the only governance check human triage at the `Todo`-state queue?
- Does this pattern generalize to non-software work (research, ops, content)? The DAG dependency model and prompt-as-policy file should transfer; the per-issue workspace doesn't obviously.
- When an agent gets a ticket "completely wrong" (mentioned in the post), how is the lesson fed back into the system? Symphony's answer is "add guardrails and skills" — what's the institutional process for that?
- How does ticket-driven orchestration interact with sprint planning / OKRs / roadmap work that operates on aggregates of tickets? Does the abstraction collapse when tickets are scoped that small?
Verification as the New Bottleneck
- Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
- If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?
Vertical Slice Tracer Bullets
- Can the planner agent be trusted to slice vertically once told to, or does it need a verifier that flags horizontal slices? Pocock's experience: it needs the verifier, at least through 4.7.
- How should slice granularity be tuned? Too thin = many merge conflicts; too thick = back to horizontal.
Vibe Coding vs. Agentic Engineering
- Karpathy hints at "one domain that's very [valuable]" for founders but won't say which (didn't want to "vague-post on stage"). What verifiable RL-environment domain is he gesturing at?
- If the mediocre/AI-native spread keeps widening, what does that do to team composition — a few extreme outliers plus agents, vs. broad mid-level staffing?
Zero Trust for AI Agents
- The framework treats every Claude Code "Pro-tip" as a reference implementation. How much of the framework is vendor-neutral vs. tacitly assuming the Anthropic stack?
- "Foundation floor raised" implies a moving baseline. How fast does the tier ladder actually shift, and who arbitrates it (NIST/NSA cadence vs. model-capability cadence)?
- The framework is explicit that it is *not* legal/compliance assurance. Where does self-attested Zero Trust maturity meet auditable regulatory requirement?