Why AI Lags at Design

Sources#

OpenAI Codex lead on the new shape of product work

Summary#

Andrew Ambrosino (OpenAI Codex) answers a question the wiki keeps circling — why is "this looks like AI design" still a putdown while AI writes production code? — with four reasons frontier models trail at visual/product design, two practical (and fading) and two harder. It's a sharp, first-hand articulation of where the verifiable-reward frontier stops: code has a clean grader ("does it compile, does it do what it's supposed to"); design's grader is human taste, which is expensive to put in a training loop.

Evidence note. practitioner-opinion — an OpenAI product leader's read, explicitly hedged ("I'm not in our research… I'll get yelled at for saying this"), not a research claim.

The four reasons#

1. Design is hard to grade (the load-bearing one). "Creating a loop where you can train the model on what's good design and what's bad design is more tedious and onerous than 'does the code compile.'" Code carries its own verifier; design's verifier is the human aspect of taste, "part of the feedback mechanism you need." This is the verifiability thesis stated from the design side: capability advances fastest where reward is cheap and objective, and design's reward is neither.

2. It sat outside the AI-research flywheel. "Labs historically invest in making their models good at things that accelerate AI research." In the early coding-model era it was obvious that a model writing correct code would accelerate research; "you can't really make the same case for design." So design got less deliberate investment — not because it doesn't matter, but because it isn't in the self-improvement loop. (Practical; Ambrosino expects it to fade — "these models will get pretty good at design.")

3. Design rewards novelty; code rewards known patterns. "In software engineering you almost want it to over-index on known patterns." In design you don't: "there's an element of randomness and novelty." His example: for a year every new website was a copy of Linear's — "if a model outputs Linear's website every time, that's not the challenge here." Regression-to-the-mean is a feature for code and a failure for design. (Cf. Transformative Creativity / the abstraction-barrier critique: novel-concept generation may be a real ceiling, not just the next capability to fall.)

4. The design↔code abstraction layer (the deep one). Being a better visual designer isn't sufficient; there's "an interplay between the software design and the code being written" that is "visual design but significantly deeper — it's about the abstractions." His rebrand thought-experiment makes it concrete:

Shallow version: "we have to update 263 components one by one."
Deep version: understanding that "these two things look different but they're both in lists that have this style that conveys this interaction pattern to the user" — the semantic relationships between elements, not their pixels.

"That is still feeling a little bit out of reach with the current technology." This is the design-system-as-semantic-layer problem: real design competence lives in the maintainable abstraction between look and code, exactly the layer models are weakest on.

Why it matters#

If reasons 1–2 are practical and fading but 3–4 are structural, then design is a durable pocket of human taste longer than code was — the human "feedback mechanism" is not just labeling data, it's the reward function itself. It's also a caution against reading model-produced polish as competence: a model can emit a prod-looking surface (reason-3 mean-reversion to "good-looking") while missing the semantic abstraction (reason 4) that makes the design actually maintainable.

Connections#

Andrew Ambrosino — articulates the four reasons
The Verifiability Thesis — reason 1 is this thesis from the design side: design has no cheap objective grader
Verification as the New Bottleneck — the general shape: capability races ahead where verification is cheap, stalls where it isn't
Research Taste as the Human Bottleneck — design taste as a durable human residue; the human is the reward function, not just the labeler
Jagged Intelligence (Ghosts, Not Animals) — design as a current valley of the jagged frontier (a thing AI fails at, per the optimistic read, until it doesn't)
The Bitter Lesson / Build for the Next Model — "these models will get good at design" is the bitter-lesson bet; reasons 1–2 are the wait-for-the-model gaps, 3–4 the maybe-durable ones
Living Design System — reason 4's abstraction layer is the design-system problem: semantics between components, not the components themselves
Transformative Creativity — reason 3 (novelty premium) borders the harder claim that new-concept generation is a real ceiling
Claude Design — the counter-effort: tooling aimed squarely at closing the model's design gap

Open Questions#

Are reasons 3–4 (novelty, the abstraction layer) genuine ceilings, or — like reasons 1–2 — just under-invested capabilities that fall once a lab builds the grader?
Can design be made gradable without a human in the loop (learned taste models, preference data at scale), or does the "human aspect of taste" resist automation the way research taste might?
Does the design↔code abstraction layer improve with better code-understanding models even if pure visual design stalls — i.e. is reason 4 a coding-capability problem in disguise?

Sources#

OpenAI Codex lead on the new shape of product work — Ambrosino's four-part answer on why frontier models lag at design