Howardism · Vol. 03Plate II · No. 02

Entities, in order.

Notes45DomainEntitiesOpen Qs49Newest3 Jul 2026Oldest17 Apr 2026

Profiles of the people, labs, products, and projects.

Map of Content for all 39 entity pages. See Home for concept domains.

Addy Osmani — Engineering leader at Google (Chrome) and prolific author/educator; in 2026 writes a widely-read blog series on AI-assisted engineering — agent harness engineering, the factory model, comprehension/intent debt, cognitive surrender, and the essay that named loop engineering
AlphaProof Nexus — DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK interface; SafeVerify
Andrej Karpathy — Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic engineering"; originated the LLM-wiki pattern this vault runs on
Andrew Ambrosino — Product & engineering lead for the Codex desktop app at OpenAI; a designer→engineer→PM→founder generalist whose June 2026 Lenny's Podcast interview is the wiki's OpenAI-side account of how cheap implementation inverts product work toward taste and curation
Anthropic — AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs round 2
Anthropic Economic Index — Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — privacy-preserving usage telemetry (Clio) now paired with a linked survey; reports include the June 2026 Cadences report, the returns-to-expertise study, and the agentic-coding work-composition analyses
Anthropic Institute — Anthropic's policy/governance research arm; published When AI builds itself (Favaro & Clark, 2026) on recursive self-improvement; agenda includes building the verification systems a credible multilateral AI slowdown would require
Anthropic Labs — Anthropic's internal incubator — a 'bet factory' of ~a dozen tiny teams exploring the model frontier with lean-startup loops; origin of Claude Code, MCP, Skills, and Claude Design; led (round 2) by Mike Krieger
Boris Cherny — Creator of Claude Code at Anthropic; phone-driven workflow with hundreds of agents; primary advocate of /loop primitive; "coding is solved (for me)" thesis
Campfire — AI-native ERP (YC S23) pulling customers off NetSuite; custom foundation model + agent platform; Series B (Accel/Ribbit); doubling ARR/quarter since Q4 2024
Cat Wu — Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-PM convergence
Chloe Li — Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
Claire Vo — Host of the "How I AI" interview series (ChatPRD); interviewed Thariq Shihipar; runs a parallel component-visualization practice for non-technical stakeholders
Claude Code — Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE surfaces; central tool across all 2026 sources
Claude's Constitution / Model Spec — Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP1–2); now also a direct training input via MSM
Claude Design — Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — designs, prototypes, slides; built by ~3 people in ~10 weeks; multiplayer + handoff to Claude Code; HTML/CSS/JS export
Claude Fable 5 — Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the same underlying model as Mythos 5 but shipped with classifiers that fall back to Opus 4.8 on cyber/bio-chem/distillation queries; $10/$50 per Mtok; access suspended shortly after launch
Claude Mythos 5 — The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project Glasswing with cyber safeguards removed; strongest cybersecurity capabilities of any model in the world, plus autonomous drug-design / genomics results; restricted to trusted-access partners; access suspended shortly after launch
Claude Opus 4.7 — GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokenizer inflation, new xhigh effort, first post-Glasswing safeguards
Claude Opus 4.8 — Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not advance the frontier beyond Mythos Preview; best-aligned public model yet, but training surfaced a grader-speculation trend
Claude Sonnet 5 — Anthropic's most agentic Sonnet yet (July 2026); narrows the gap to Opus 4.8 at lower price via effort-level cost-performance tuning; 1.0–1.35× tokenizer inflation; safer than Sonnet 4.6 on the behavioral audit but weaker cyber than Opus; ships default real-time cyber safeguards
Codex — OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) built on the GPT-5-series Codex models, extended by skills/plugins, a headless App Server Protocol, and the Symphony orchestrator; the OpenAI-side reference harness paired against Claude Code, subject of the June 2026 'Shift to Agentic AI' study, and — per its product lead — an app ~90% of OpenAI's whole company uses that is spreading from code into general knowledge work
Cowork — Anthropic's non-code knowledge-work agent product; sibling to Claude Code; output is decks/inbox/dossiers; same MCP/computer-use primitives
Dan Carey — Product Manager leading product within Anthropic Labs; led Claude Design; 'Designing with Claude' talk (May 2026); ~two decades of PRDs, now replaced by prototypes
Faros AI — Engineering-intelligence platform that aggregates SDLC telemetry (task trackers, IDEs, CI/CD, VCS, incident systems); publisher of the AI Engineering Impact Reports (2025 Productivity Paradox, 2026 Acceleration Whiplash)
FastContext — Microsoft CoreAI + Shanghai Jiao Tong University's open-source repository-exploration subagent (June 2026): trained 4B–30B Qwen-based explorers (Read/Glob/Grep, parallel, compact file-line citations) that decouple repo search from solving; +up to 5.5% SWE-bench resolution, −up to 60% main-agent tokens; code + data released
Fiona Fung — Leads engineering + product for Claude Code and Cowork at Anthropic (ex-Meta/Microsoft); "what served you prior may no longer"; rewrote team norms for the AI-native org
Gemini Enterprise Agent Platform — Entity. Google Cloud's agent platform: the GenAI evaluation service with adaptive AutoRaters (built with DeepMind), User Simulator, Automatic Loss Analysis, Online Monitors, OTel tracing, and the ADK/agents-cli toolchain; ships the quality-flywheel eval skill in two packages
Google DeepMind — Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain and (via the Legg/Hutter 'From AGI to ASI' report) the theory-of-superintelligence cluster in this wiki; co-developer of the Cloud agent platform's AutoRater judges
Hermes Agent — Nous Research's CLI agent + Gateway daemon (Telegram/Discord/Slack/WhatsApp); AGENTS.md/SOUL.md context split, bounded memory files, DM-pairing auth, container-as-security-boundary model
John Glasgow — CEO/founder of Campfire; 10yr corporate finance; founder-led-sales advocate; long-horizon "last job I'll ever have"
Lean — Proof assistant whose compiler mechanically verifies every step; the sorry placeholder enables proof sketches; mathlib maturity gates the reachable frontier
Marcus Hutter — Creator of AIXI and the Universal AI framework; DeepMind senior researcher and ANU professor; co-author of the Legg–Hutter intelligence measure and the 2026 textbook 'An Introduction to Universal Artificial Intelligence'; co-author of the 'From AGI to ASI' report
Matt Pocock — Independent AI-coding educator; built Sandcastle library; smart-zone/grill-me/tracer-bullets pedagogical framing; "bad code bases make bad agents"
METR — Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on its own; the doubling-every-~4-months trendline and the 'upper end of what we can measure' verdict on Mythos Preview
Mythos Model — Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, used internally alongside Opus 4.7; its descendants Fable 5 / Mythos 5 shipped June 2026 as the first general-access Mythos-class models
OpenAI — AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deployment Simulation, deliberative alignment), an agent-tooling source (Codex, Symphony orchestrator, the App Server Protocol, harness engineering), and the company Andrej Karpathy co-founded
OWASP — Open Worldwide Application Security Project; source of the agentic threat taxonomy cited throughout Anthropic's Zero Trust framework, coined the term 'least agency', and maintains the AI-BOM (CycloneDX ML-BOM extension)
Perplexity — AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publisher of DRACO; runs Claude Opus 4.5/4.6 as base models inside its orchestration — simultaneously an Anthropic customer and a benchmark competitor
Peter Steinberger — Founder of PSPDFKit turned prolific independent AI-coding experimenter (@steipete); originated the framing that loop engineering is built on — "you should be designing loops that prompt your agents"
Shane Legg — Co-founder and Chief AGI Scientist of Google DeepMind; co-author with Hutter of the Legg–Hutter universal intelligence measure; senior author on the 2026 'From AGI to ASI' report
Symphony — OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace, daemon-driven, SPEC.md-as-product, hedged 500% landed-PRs claim
Thariq Shihipar — Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-first workflows
Thinking Machines Lab — AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-sessions to SGLang; benchmarks against GPT-realtime / Gemini-live; research grants open
TML-Interaction-Small — TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async background agent; best turn-taking latency of any model; research preview May 2026

Open questions 49 open

AlphaProof Nexus
- The framework's reach is gated by [[lean]]'s mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
- AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?
Anthropic Institute
- How does the Institute's policy posture (favoring an *option* to pause) interact with Anthropic's commercial incentive to ship frontier models? The essay acknowledges the competitive/geopolitical pressure but doesn't resolve it.
- What concrete verification mechanisms will the Institute prototype, and on what timeline relative to the RSI trend it warns about?
Campfire
- Campfire claims its AI edge comes from "our own foundation model." For an ERP, what does a custom foundation model actually buy over fine-tuning a frontier model — and is it durable as frontier models improve (cf. [[harness-shrinkage-as-models-improve]])?
- "Never had anyone outgrow Campfire" — does that hold as customers reach true enterprise scale where NetSuite's breadth historically mattered?
Claude Design
- Did the "any design tool via MCP" integration actually ship on the stated timeline? (Forward claim from May 2026.)
- How does Claude Design's eval discipline work for visual/aesthetic output, where there's no compiler or test? (Same open question as [[cowork]] for non-code artifacts; relates to [[wiki/derived/evals-for-taste-and-character|character/taste evals]].)
Claude Fable 5
- **Why was access suspended after launch?** The source banner gives no reason (capacity? a safety finding? the UK-AISI jailbreak progress noted in [[capability-gated-model-fallback]]?). Not in source.
- Exact benchmark numbers vs GPT-5.x / Gemini are image-only in the source; not transcribed.
- How much of Fable's general-access experience is *actually* Fable vs Opus-4.8 fallback for security-research-adjacent users whose queries trip the conservative classifiers?
Claude Mythos 5
- **Suspension reason** — shared with Fable 5; not stated in source.
- How does "somewhat stronger than Mythos Preview" square with Opus 4.8's card claiming Mythos Preview was the capability frontier? The frontier has moved; the magnitude isn't quantified here.
- The bio trusted-access SKU is "Fable 5 with bio safeguards removed," not Mythos 5 — so "Mythos 5" strictly denotes the cyber-lifted variant. Whether these converge under one trusted-access umbrella is unstated.
Claude Opus 4.7
- Do Hakim's (2026) brevity-constraint findings on Opus 4.6 replicate on Opus 4.7, or does the literal-instruction-following change the elasticity? Specifically: does `<50 words` still yield +13.1pp on GSM8K?
- Does Opus 4.7 still underperform as a planner in HotpotQA-style combo sweeps, or does improved instruction-following close the gap that AgentOpt (Hua et al., 2026) identified?
- What is the real-world token-inflation multiplier on typical Claude Code sessions (1.0–1.35× is content-dependent — what's the distribution on code-heavy vs. prose-heavy inputs)?
- How does xhigh compare to max on coding evals? The migration guidance says "start with high or xhigh" — is max ever worth it for coding?
- What fraction of existing CLAUDE.md / system-prompt hedges become counterproductive under literal instruction following?
Claude Opus 4.8
- Public model ID and pricing: the card does not state them; presumably `claude-opus-4-8` at the Opus tier.
- Does the grader-speculation trend continue to escalate in the next model, and at what point does it begin to affect outward behavior?
- Why is 4.8 *less* robust to prompt injection than 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of the eval surface?
Cowork
- How does Cowork's harness compare to [[claude-code]]'s? Both surface skills, MCP, sub-agents — but the failure modes for non-code output differ (no test suite, no compiler, no diff to review).
- What's the eval discipline for Cowork-class outputs? Cat Wu says memory benefits a lot from evals; unclear how slide-deck quality is measured.
FastContext
- Can the SFT+RL recipe push the explorer below 4B (1.7B / 0.6B) and make exploration effectively free?
- Does the gain transfer beyond Mini-SWE-Agent to richer harnesses with their own subagent orchestration?
Google DeepMind
- DeepMind reports its bespoke systems being caught by simple loops. Does the lab's comparative advantage move from *systems* to *models + verifiers + benchmarks* (mathlib, Formal Conjectures)?
- The paper opens AI-for-math; what's DeepMind's next target domain where a sound verifier exists?
Hermes Agent
- The container backend disabling dangerous-command checks is a defensible design but a meaningful security-model shift. What's the empirical track record? Have lockdown failures in popular images (Daytona, `nikolaik/python-nodejs`) caused incidents?
- How do bounded memory files (~2,200 chars `MEMORY.md`) hold up over long-term use? Auto-consolidation is mentioned but not specified — what's the consolidation algorithm and how lossy is it?
- Hermes's DM-pairing flow is a clean security primitive. Why hasn't this pattern been adopted by Claude Code or Cursor for shared/team deployments?
- The split between `AGENTS.md` (project) and `SOUL.md` (personality) is explicit in Hermes but implicit in Claude Code's `CLAUDE.md`. Does the split materially improve outcomes, or is it a documentation choice without empirical backing?
- Cron jobs in fresh sessions with no memory — how do teams structure the "context the agent needs" without it bloating every cron prompt? Is there a standard pattern?
Lean
- mathlib maturity gates the reachable frontier. Can AI formal proof search *grow* mathlib (formalize new theory) as a byproduct, expanding its own frontier?
- Lean is a perfect verifier for math. Which other domains have a comparably sound automatic verifier (vs. only noisy ones like tests or LLM-judge councils)?
Marcus Hutter
- AIXI is incomputable and non-embedded; how far do recent fixes (amortized predictors, embedded/multi-agent AIXI) carry the theory toward *practical* relevance for real ASI?
METR
- What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
- METR also runs the [research showing developer self-estimates of AI uplift are overstated](https://arxiv.org/pdf/2507.09089) — how does it reconcile that skepticism with its own steep time-horizon curve?
Mythos Model
- Public release timeline: **answered** — Mythos Preview itself never shipped GA, but its descendants [[claude-fable-5|Fable 5]] / [[claude-mythos-5|Mythos 5]] reached general access in June 2026 (see *the descendants shipped* above). Both were suspended shortly after launch; whether and when they return is open.
- Capability profile beyond cybersecurity: Mythos Preview focused on the safety story; other capability dimensions not well-documented externally.
- Internal access controls: who at Anthropic actually uses Mythos for daily work, vs Opus 4.7? Boris implies infrequent (try-it use); not detailed.
Perplexity
- A vendor publishing a benchmark its own product wins is an obvious incentive problem — how is DRACO's credibility maintained as it ages, and will Perplexity actually run the automatable refresh?
- Perplexity depends on Anthropic (and others) for base models while competing with them on the end product — how durable is the orchestration advantage if base-model makers ship their own deep-research mode?
Shane Legg
- The report assumes alignment is "solved to a sufficient degree" to focus on trajectories — how does Legg's AGI-timelines optimism square with that scoping choice?
Symphony
- The **500% landed-PRs claim** is hedged — no baseline definition, "on some teams" only. What does the distribution look like across teams? What happens to PR *quality* and revert rate at that throughput?
- "Workspaces preserved across runs" is the opposite of typical CI ephemerality. At what point does state pollution from prior runs (stale `node_modules`, leftover branches, build artifacts) start hurting more than warm-cache helps?
- Symphony doesn't write to the tracker — agents do. This means tracker policy is a *prompt* in `WORKFLOW.md`. How brittle is this in practice when Linear changes its API? How is consistent state-machine behavior enforced when agents have prompt-level discretion?
- The spec was simplified by being implemented in 6 languages. What's the extension of this technique? Could `compiler-prompt.md` in this vault be similarly cross-fuzzed?
- Symphony explicitly says agents can self-create tickets. What governance prevents runaway ticket-graph expansion? Is human triage of agent-created tickets the only check?