Sources#
Summary#
Andrew Ambrosino (OpenAI Codex) answers a question the wiki keeps circling — why is "this looks like AI design" still a putdown while AI writes production code? — with four reasons frontier models trail at visual/product design, two practical (and fading) and two harder. It's a sharp, first-hand articulation of where the verifiable-reward frontier stops: code has a clean grader ("does it compile, does it do what it's supposed to"); design's grader is human taste, which is expensive to put in a training loop.
Evidence note.
practitioner-opinion— an OpenAI product leader's read, explicitly hedged ("I'm not in our research… I'll get yelled at for saying this"), not a research claim.
The four reasons#
1. Design is hard to grade (the load-bearing one). "Creating a loop where you can train the model on what's good design and what's bad design is more tedious and onerous than 'does the code compile.'" Code carries its own verifier; design's verifier is the human aspect of taste, "part of the feedback mechanism you need." This is the verifiability thesis stated from the design side: capability advances fastest where reward is cheap and objective, and design's reward is neither.
2. It sat outside the AI-research flywheel. "Labs historically invest in making their models good at things that accelerate AI research." In the early coding-model era it was obvious that a model writing correct code would accelerate research; "you can't really make the same case for design." So design got less deliberate investment — not because it doesn't matter, but because it isn't in the self-improvement loop. (Practical; Ambrosino expects it to fade — "these models will get pretty good at design.")
3. Design rewards novelty; code rewards known patterns. "In software engineering you almost want it to over-index on known patterns." In design you don't: "there's an element of randomness and novelty." His example: for a year every new website was a copy of Linear's — "if a model outputs Linear's website every time, that's not the challenge here." Regression-to-the-mean is a feature for code and a failure for design. (Cf. Transformative Creativity / the abstraction-barrier critique: novel-concept generation may be a real ceiling, not just the next capability to fall.)
4. The design↔code abstraction layer (the deep one). Being a better visual designer isn't sufficient; there's "an interplay between the software design and the code being written" that is "visual design but significantly deeper — it's about the abstractions." His rebrand thought-experiment makes it concrete:
- Shallow version: "we have to update 263 components one by one."
- Deep version: understanding that "these two things look different but they're both in lists that have this style that conveys this interaction pattern to the user" — the semantic relationships between elements, not their pixels.
"That is still feeling a little bit out of reach with the current technology." This is the design-system-as-semantic-layer problem: real design competence lives in the maintainable abstraction between look and code, exactly the layer models are weakest on.
Why it matters#
If reasons 1–2 are practical and fading but 3–4 are structural, then design is a durable pocket of human taste longer than code was — the human "feedback mechanism" is not just labeling data, it's the reward function itself. It's also a caution against reading model-produced polish as competence: a model can emit a prod-looking surface (reason-3 mean-reversion to "good-looking") while missing the semantic abstraction (reason 4) that makes the design actually maintainable.
Connections#
- Andrew Ambrosino — articulates the four reasons
- The Verifiability Thesis — reason 1 is this thesis from the design side: design has no cheap objective grader
- Verification as the New Bottleneck — the general shape: capability races ahead where verification is cheap, stalls where it isn't
- Research Taste as the Human Bottleneck — design taste as a durable human residue; the human is the reward function, not just the labeler
- Jagged Intelligence (Ghosts, Not Animals) — design as a current valley of the jagged frontier (a thing AI fails at, per the optimistic read, until it doesn't)
- The Bitter Lesson / Build for the Next Model — "these models will get good at design" is the bitter-lesson bet; reasons 1–2 are the wait-for-the-model gaps, 3–4 the maybe-durable ones
- Living Design System — reason 4's abstraction layer is the design-system problem: semantics between components, not the components themselves
- Transformative Creativity — reason 3 (novelty premium) borders the harder claim that new-concept generation is a real ceiling
- Claude Design — the counter-effort: tooling aimed squarely at closing the model's design gap
Open Questions#
- Are reasons 3–4 (novelty, the abstraction layer) genuine ceilings, or — like reasons 1–2 — just under-invested capabilities that fall once a lab builds the grader?
- Can design be made gradable without a human in the loop (learned taste models, preference data at scale), or does the "human aspect of taste" resist automation the way research taste might?
- Does the design↔code abstraction layer improve with better code-understanding models even if pure visual design stalls — i.e. is reason 4 a coding-capability problem in disguise?
Sources#
- OpenAI Codex lead on the new shape of product work — Ambrosino's four-part answer on why frontier models lag at design
Cited by 10
- Andrew Ambrosino
Product & engineering lead for the Codex desktop app at OpenAI; a designer→engineer→PM→founder generalist whose June 20…
- Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
- Codex
OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Living Design System
`design_system.html` extracted from repos as a portable, human- and machine-readable source of truth; component playgro…
- Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
- Polish No Longer Signals Readiness
Andrew Ambrosino's observation that the medium used to encode process-stage — a production-looking artifact meant late-…
- Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- Transformative Creativity
Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievem…
Related articles
- Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Implementation Abundance Inverts Product Work
Andrew Ambrosino's inversion thesis: when talking to a frontier model can stand up any feature from scratch, implementa…
- Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
