H
Howardism
Howardism · Vol. 03Plate II · No. 02

AI Engineering, in order.

Notes35DomainAI EngineeringOpen Qs77Newest28 May 2026Oldest10 Apr 2026

Agent harnesses, loops, tooling, and the craft of building with LLMs.

Map of Content for the ai-engineering domain — 35 concepts. Curated entry point; see Home for all domains.

  • Agent Harness Engineering — Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical architecture enforcement, agent code review
  • Agent Identity and Authentication — The foundation control for agentic Zero Trust: cryptographically-rooted per-agent identity (→X.509→hardware attestation), short-lived IdP-issued tokens replacing static API keys (→mTLS→hardware-bound credentials), JIT access and ABAC
  • Agent Loop Pattern/loop (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, parallel fan-out, "loops are the future"
  • Agent-Native Infrastructure — The world is still built for humans and must be rewritten for agents; "what do I copy-paste to my agent?"; sensors/actuators; agent-to-agent representation
  • Agent Supply Chain Risk — Runtime-composed agent ecosystems expand the supply-chain attack surface: model poisoning (250 docs backdoor a 13B model), tool/MCP supply chain (first in-the-wild malicious MCP server), AI-BOM, OpenSSF Scorecard, dependency audits, and AI vendoring as remediation
  • Agentic Prompt Injection — Direct and indirect injection of malicious instructions into an agent; LLMs cannot reliably distinguish information from instructions; defenses are spotlighting (50%→<2%), constitutional classifiers (95% blocked), input isolation, and attack-surface reduction
  • AI-Accelerated Offense (hub) — Frontier models compress the vulnerability-to-exploit timeline from months to hours at marginal dollar cost; both attackers and defenders speed up, the N-day window collapses, and the differentiator becomes strong fundamentals + breach-ready architecture
  • Autonomous Defense — Running security operations at the speed of AI-accelerated threats: put a model at the front of the alert queue, automate the bookkeeping (not the decisions), Agentic SOAR, MITRE ATT&CK coverage mapping, and rehearse five simultaneous incidents
  • Blast Radius (Agentic) — The potential damage if an agent is compromised; the unit Zero Trust's 'assume breach' posture is built to contain via identity-based isolation, sandboxing, and compartmentalization
  • Building Is Cheap, Arguing Is Expensive — "In technical debate, code wins": generate three PRs vs whiteboard; prototype over design doc; reduce design docs
  • Claude Code Auto Mode — Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground between default and --dangerously-skip-permissions
  • Claude Code Best Practices (hub) — Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→code workflow, environment config
  • Client-Side Agent Optimization — AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server-side serving; the combo abstraction; 13–32× cost gaps between best/worst combinations
  • Code as Source of Truth — Docs go stale at high coding throughput; check specs/skills into the repo; onboard via Claude; spec-drift verification
  • Codex App Server Protocol — JSON-RPC stdio protocol for headless Codex sessions: initialize/initialized/thread-start/turn-start handshake, continuation turns reuse thread_id, dynamic tool calls for token-isolated tool injection
  • Compute Allocator — The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding invested in alignment/communication; abundance mindset
  • Context Window Smart Zone (hub) — Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised context; clear-and-restart > compaction; status-line token counting as essential discipline
  • Deep Modules for Agents — Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in fresh context; Sandcastle three-agent pattern
  • Design Concept Grilling (hub) — Matt Pocock's grill-me skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destination doc, Kanban as journey doc
  • Disposable Micro-Apps — Throwaway custom UIs built per-task to edit a plan ("micro-software on top of micro-software"); copy-back-to-markdown; rational under the abundance mindset
  • Harness Shrinkage as Models Improve (hub) — Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from now" claim; mechanical verification stays load-bearing
  • HTML as the New Markdown — Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the human; HTML artifacts (visual, interactive) keep humans in the loop. The model-facing harness shrinks while this human-facing harness grows
  • Impossible, Not Tedious (Design Test) (hub) — Zero Trust design test for agentic security: does a control make the attack impossible, or just tedious? Friction-only controls degrade against agentic attackers with unlimited patience and near-zero per-attempt cost
  • Least Agency — OWASP term extending least privilege to agents: constrain not just what an agent can access but what each tool can do, how often, and where; deny-by-default, per-agent credentials, scope limits
  • Living Design Systemdesign_system.html extracted from repos as a portable, human- and machine-readable source of truth; component playgrounds; bridges engineering ↔ non-technical stakeholders
  • LLM-as-Compiler Knowledge Base — Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4-phase ingest→compile→query→lint pipeline
  • MCP and Computer Use — Anthropic's two complementary connector mechanisms: MCP for structured programmatic access (Salesforce/Drive/Gmail/Slack/Figma + niche industry systems); computer use as the GUI-driving catchall when no MCP exists; Boris Cherny's "to the model, it's just tokens"
  • Memory and Context Poisoning — Corruption of persistent agent memory that influences behavior long after the initial injection; includes RAG poisoning, shared-context poisoning, and slow long-term memory drift; defended via memory isolation, integrity validation, and retention policies
  • Outsource Your Thinking, Not Your Understanding — "You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; knowledge bases as understanding-tools
  • Ticket-Driven Agent Orchestration — The inversion that makes Symphony work: tickets as units of work (not sessions/PRs), DAG dependencies, agent-extensible work graph, "objectives not transitions"
  • The Verifiability Thesis (hub) — LLMs automate what you can verify as computers automate what you can specify; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable
  • Verification as the New Bottleneck (hub) — Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax; PR-cycle-time funnel analysis
  • Vertical Slice Tracer Bullets — Pragmatic-Programmer tracer-bullet pattern applied to agent task decomposition; vertical slices > horizontal layers; Kanban-with-blocking-edges over numbered phase plans
  • Vibe Coding vs. Agentic Engineering — Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x and widening"; hire on big projects, not puzzles
  • Zero Trust for AI Agents (hub) — Anthropic's security framework for deploying autonomous agents: trust nothing / verify everything / assume breach, applied across a Foundation→Enterprise→Advanced tier model and an 8-phase implementation workflow

Open questions 77 open

  • Agent Harness Engineering
    • Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
    • How does architectural coherence evolve over years in a fully agent-generated system?
    • At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
    • How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?
  • Agent Loop Pattern
    • When the model schedules its own loops (4.7 behavior), who owns the budget? Boris answered "the model just decides" — but that pushes cost discipline into the model's training, not the harness.
    • Does a loop with a smart enough model still need a Kanban backlog, or does the model choose its own next task from raw goals?
    • Loop output review is now [[matt-pocock]]'s confessed bottleneck — "we just need to be ready to be doing more code review."
  • Agent-Native Infrastructure
    • Who builds the agent-native rewrite of the long tail of human-facing services — the service owners, or a translation layer (MCP servers, computer-use agents) on top?
    • Agent-to-agent negotiation needs trust, identity, and accountability primitives that don't exist yet. What's the protocol layer, and who governs it?
  • Building Is Cheap, Arguing Is Expensive
    • When does "generate three and compare" become wasteful — at what decision weight is a real argument (or a design doc) still cheaper than three implementations?
    • If design discussion lives in PRs/prototypes, where is the *rationale* recorded for future readers — does the "why we chose this" knowledge survive, or does it share the staleness problem of [[code-as-source-of-truth]]?
  • Claude Code Auto Mode
    • What false-positive rate does the classifier have on routine-but-aggressive refactors (e.g., large-file renames, `rm` of build artifacts)?
    • How well does the classifier generalize to custom tools / MCP servers where it lacks environment context?
    • Is the classifier's decision boundary documented/stable enough for security-sensitive orgs to certify, or is it effectively a black box whose behavior drifts with updates?
    • Does extending auto mode to API users change its calibration — is the classifier retrained for automation-heavy use, or held constant?
    • Compared to OS-level sandboxing (mentioned in [[claude-code-best-practices]] alongside auto mode), what's the defense-in-depth story? When should both be layered?
  • Claude Code Best Practices
    • What's the optimal CLAUDE.md length before instructions start getting lost? Is there a measurable threshold?
    • How does the Writer/Reviewer pattern compare to agent-to-agent review (as in OpenAI's Codex workflow)?
    • When does subagent overhead exceed the benefit of context isolation?
  • Client-Side Agent Optimization
    • How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
    • At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
    • Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
    • What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
    • Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?
  • Code as Source of Truth
    • What knowledge genuinely *can't* live in the codebase (org strategy, the "why," cross-team context) and therefore still needs a durable doc — and how do you keep that small slice current?
    • If onboarding is "ask Claude," what happens to the tacit knowledge that was previously transferred socially in deep-dives — is it captured anywhere, or quietly lost?
  • Codex App Server Protocol
    • How does the App Server protocol compare in detail to MCP? Both expose tools to a model, but App Server is *inside* the Codex runtime while MCP is *outside*. When does each win?
    • Is there a public schema registry so external orchestrators can target specific App Server versions without `generate-json-schema`?
    • The "dynamic tool calls (experimental)" caveat — what's the stability roadmap? Symphony depends on this for its security model.
    • How well does the protocol handle multi-modal turns (image inputs, screenshot attachments)? The spec is text-focused.
    • Is there an analogous protocol on the Claude side, or is Claude's equivalent exclusively the Agent SDK + tool-use API? Comparing the two would clarify when "drive an existing CLI" beats "build on the SDK."
  • Compute Allocator
    • Is 1% a Thariq-specific number or a regime? For larger, more code-heavy projects the production residue is presumably higher; what sets the ratio?
    • Allocation quality is hard to measure — what's the feedback loop that tells an allocator they spent compute *badly* (vs. just spending a lot)?
    • Does treating humans as "compute allocators" risk the [[ai-brain-fry|oversight-fatigue]] / [[human-ai-accountability-redesign|accountability]] failure modes the HBR research flags, where the human nominally decides but actually rubber-stamps?
  • Context Window Smart Zone
    • Does the smart-zone marker scale with model size, or is it bounded by attention architecture? Pocock observes "the dumb zone has become less dumb lately" but pegs it at 100K through 2026.
    • When sparse-attention or memory-augmented architectures ship, does the smart zone become a soft constraint?
    • How should harnesses surface remaining smart-zone budget to the user — token count, percentage, or a richer signal?
  • Deep Modules for Agents
    • How big is "deep enough"? Pocock's example modules are several hundred LOC; Ousterhout's textbook examples are larger. There's a sweet spot; not articulated.
    • For ports/adapters codebases, does the deep-module advice transfer cleanly? The "small interface" is the port; the "large behavior" is the adapter. Probably yes, but not exercised in source.
    • Refactor cost vs benefit: when is "improve-code-base-architecture" worth running on a working repo?
  • Design Concept Grilling
    • Can grilling be run AFK against another agent that holds the user's preferences? Pocock's answer in 2026 is "no, this part has to be human-in-the-loop" — but the question is open as agents get better at modeling their principal.
    • How does grilling change for team work where multiple humans need to align? Pocock's hint: pair-program with the agent in the room, treat it as a third interlocutor.
  • Disposable Micro-Apps
    • Where's the line between a disposable micro-app and tool sprawl? If every edit spawns a bespoke UI, does the workflow fragment?
    • Does the copy-back-to-markdown round-trip generalize beyond config-shaped data (rules, tables) to richer artifacts?
    • Could these micro-apps be templated/reused rather than regenerated — and at what point does that defeat the "disposable" framing and turn into [[living-design-system|durable tooling]]?
  • Harness Shrinkage as Models Improve
    • Does *all* prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
    • The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
    • If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.
  • HTML as the New Markdown
    • Does the human-facing harness keep growing without bound, or does it hit its own bloat ceiling (an HTML plan too elaborate to read, like the markdown it replaced)? **Answered:** [[wiki/derived/human-facing-harness-bloat-ceiling]] — yes; HTML raises and reshapes the human-attention ceiling but can't remove it, and the bloat relocates from document-length to artifact-sprawl/rubber-stamping.
    • HTML is heavier to diff and version than markdown — what happens to plan history and review when artifacts are single-file websites? ([[disposable-micro-apps]] copy-back-to-markdown is one patch.)
    • Does this generalize past one expert practitioner, or does it require Thariq-level fluency with Claude to be worth the overhead?
  • Living Design System
    • How does the `design_system.html` stay in sync as the codebase evolves — re-extract on a cadence, or wire it into CI?
    • Does a rendered, model-readable design system measurably improve on-brand output vs. a plain CSS/token file, or is the win mostly human legibility?
    • At what project size does maintaining the artifact cost more than the consistency it buys?
  • LLM-as-Compiler Knowledge Base
    • At what scale does the no-vector-database approach break down? Karpathy's ~100 articles fit in context, but what about 1,000+?
    • How to handle conflicting information across sources during compilation?
    • What's the optimal granularity for concept articles — one concept per article, or clustered by theme?
    • How effective is the synthetic training data → fine-tuning pipeline in practice?
  • MCP and Computer Use
    • The MCP ecosystem's growth rate vs. computer use's quality curve: at what point does computer use become *good enough* that the marginal value of building an MCP server drops? Boris implies this is years off but doesn't quantify.
    • Is computer use a sustainable interface or a transition technology? If most knowledge-work software adds MCP support in the next 24 months, computer use's role shrinks to legacy/desktop-only systems.
    • MCP security model: as the playbook prescribes wiring MCP into Salesforce, Gmail, Calendar for solo founders, the attack surface scales with adoption. Not discussed in any source ingested.
    • How does Cowork's computer-use guardrail compare to Claude Code's auto-mode classifier? Different deployment context, possibly different risk profile.
  • Outsource Your Thinking, Not Your Understanding
    • Karpathy's open frontier: can "understanding" itself eventually be automated, or is it definitionally the human residue? His "back in a couple years" hedge leaves it open.
    • If understanding is the bottleneck, is the highest-ROI skill *learning how to build understanding fast* (knowledge-base hygiene, asking the right projections) — and can that be taught?
  • The Verifiability Thesis
    • Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
    • The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?
  • Ticket-Driven Agent Orchestration
    • What's the right granularity for ticket size when the unit is "what one agent does in one workspace"? The post implies "much larger units of work" become viable, but how does that interact with the `agent.max_turns` limit (default 20)?
    • How do you prevent a ticket-extension cascade when agents file follow-up tickets liberally? Is the only governance check human triage at the `Todo`-state queue?
    • Does this pattern generalize to non-software work (research, ops, content)? The DAG dependency model and prompt-as-policy file should transfer; the per-issue workspace doesn't obviously.
    • When an agent gets a ticket "completely wrong" (mentioned in the post), how is the lesson fed back into the system? Symphony's answer is "add guardrails and skills" — what's the institutional process for that?
    • How does ticket-driven orchestration interact with sprint planning / OKRs / roadmap work that operates on aggregates of tickets? Does the abstraction collapse when tickets are scoped that small?
  • Verification as the New Bottleneck
    • Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
    • If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?
  • Vertical Slice Tracer Bullets
    • Can the planner agent be trusted to slice vertically once told to, or does it need a verifier that flags horizontal slices? Pocock's experience: it needs the verifier, at least through 4.7.
    • How should slice granularity be tuned? Too thin = many merge conflicts; too thick = back to horizontal.
  • Vibe Coding vs. Agentic Engineering
    • Karpathy hints at "one domain that's very [valuable]" for founders but won't say which (didn't want to "vague-post on stage"). What verifiable RL-environment domain is he gesturing at?
    • If the mediocre/AI-native spread keeps widening, what does that do to team composition — a few extreme outliers plus agents, vs. broad mid-level staffing?