Plate IIGovernance & WorkforceHOWARDISM

Returns to Expertise in Agentic Coding

PublishedJune 17, 2026FiledConceptDomainGovernance & WorkforceTagsGovernance Workforce Human AI Collaboration AI Coding Workflow Empirical AnthropicReading11 minSourceAI-synthesised

Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the actions and 5× the output per prompt, reach verified success ~2× as often, and abandon stuck sessions far less; every occupation lands within 7pp of software engineers; gains are concentrated novice→intermediate, with mastery adding little

Illustration for Returns to Expertise in Agentic Coding

Sources#

Agentic coding and persistent returns to expertise

Summary#

The headline finding of Anthropic's economic-research report Agentic coding and persistent returns to expertise (Hitzig, Massenkoff, Lyubich, Heller, McCrory, June 2026): what amplifies an AI coding agent is the user's domain expertise, not their coding proficiency. Across ~400,000 Claude Code sessions, the more a person understands the problem they are solving, the more work the agent does per instruction, the more often the session succeeds, and the more readily it recovers from trouble. Coding background, by contrast, barely matters: in code-producing sessions every major occupation lands within seven percentage points of software engineers. The report's one-line thesis — "coding agents are not substituting for domain expertise; the more understanding a worker brings to an agent, the more quality work the agent is able to do" — is the empirical confirmation of "you can outsource your thinking but not your understanding".

Evidence note. empirical — measured from a privacy-preserving (Clio) analysis of ~400,000 interactive sessions from ~235,000 people, Oct 2025–Apr 2026, with classifiers (Claude Sonnet 4.6) validated against automatic telemetry and regressions with controls and confidence intervals. Two honest caveats keep it short of a clean external benchmark: it is first-party — Anthropic measuring its own product, on its own telemetry, with its own classifiers (not independently reproducible) — and it excludes headless (claude -p), SDK, and third-party-IDE usage, a "substantial share" of real activity. Outcomes are transcript-inferred proxies, not observed real-world results.

Expertise is task-specific, not a résumé#

The expertise rating (a five-point novice→expert scale) is not job title or general ability. "A senior engineer asking their first Rust question is a beginner at Rust. An accountant who has never used Python, but tells Claude exactly which reconciliation rules a script must enforce and catches the edge case it mishandles at month-end close, is an expert at that task." The classifier reads three signals:

Precision of framing — how specifically the user directs the work.
What they ask Claude to verify — experts specify the checks that define "done."
Who corrects whom — does the user correct Claude, or does Claude correct the user.

This is exactly the understanding residue made measurable: the signals are the surface evidence of an internal model good enough to direct, judge, and verify.

The amplification: experts get more agent per prompt#

Expertise scales how much autonomous work each human prompt sets off (the action chain):

User expertise	Actions per prompt	Output per prompt
Novice	~5	~600 words
Expert	~12	~3,200 words

More than twice the actions, five times the output. The gap holds within every work mode and every task-value band, and survives a regression controlling for work mode, task value, month, occupation, and model family: +9% actions and +13% output per expertise level (p < 0.001 at each adjacent step). An expert isn't just luckier — each instruction they give safely unlocks a longer leash.

The success gradient#

The more expertise a session exhibits, the more likely it succeeds, on every measure. Success is defined two ways: judged success (a classifier reads the transcript and decides whether the person got what they set out to do) and the stricter verified success (judged-successful and at least one hard external signal — a matching commit/PR, passing tests, or explicit user affirmation).

Expertise	Verified success	At least partial success
Novice	15%	77%
Intermediate–Expert	28–33%	91–92%

The curve is concave: most of the gain is novice→intermediate; intermediate→expert is modest. A working grasp of the domain captures most of the benefit; deep mastery adds only a little. (These are adjusted rates — comparing sessions of the same work mode, value band, month, subject, and occupation type.)

Two corollaries about recovery, which is where expertise earns its keep:

Recovering from trouble. Among sessions that "hit trouble" (verified failure signals — errors, failed tests, repeated attempts, user frustration), verified success rises from 4% (novice) to 15% (expert); partial success from 60% to 80–81%. Part of the value of expertise is the ability to steer a struggling agent back on course. (Caveat: experts hit trouble less often, so their troubled sessions are on harder problems — estimated task value of a troubled session roughly doubles from novice to expert — so some of the recovery gap reflects novices stuck on routine problems vs. experts stuck on genuinely hard ones.)
Abandonment. A troubled session is abandoned when it is judged failed and zero lines of code were written. 19% of novice sessions end abandoned, against 5–7% for everyone else. The least experienced give up when stuck.

Occupation matters less than expertise#

The complementary half of the story, and the strongest evidence for software democratization: a coding background is becoming less relevant to coding success. Occupation is inferred (the classifier is explicitly told not to treat the act of coding as evidence of a coding job — a lawyer scripting contract-clause checks is mapped to Legal, not Software).

Software-related occupations reach verified success ~30% of sessions overall; other professions ~26%. In code-producing sessions, 34% vs 29% — and partial success 89% vs 88%.
Every one of the ten largest occupations lands within seven points of software engineers on verified success in code-producing sessions; the five-point software/non-software gap has neither widened nor narrowed over seven months.
Management occupations edge out software engineers on verified success. The report's reading: management skills (delegating, specifying, confirming) transfer to directing an agent — "perhaps acting like a manager confers greater success." (Measurement caveat: verified success partly rests on explicit in-transcript confirmation, and managers may simply say when they got what they asked for.) This is the constructive counterpart to HBR's accountability critique — the skill of bounded delegation helps; the org-chart framing of agents-as-employees is what backfires.

What it means for the labor market#

The report frames itself as an early read on knowledge-work transitions. Two readings, in tension only superficially:

Substituting for coding skill. Implementation-heavy work that used to require a coding background is being absorbed; "a coding background [is becoming] less relevant to successful programming." This is the floor rising — "a person with command of a domain, in any field, may now be able to do technical work they previously could not."
Rewarding domain understanding. Simultaneously, the gains accrue to whoever brings the firmer grasp of the problem. "A person without any such expertise will get far less from the same tool." This is the residual human comparative advantage showing up in usage data: not coding, but knowing what to build and being able to verify it.

The report names the metric to watch: if the returns to expertise begin to decrease over time, that signals models are starting to supply the judgment users currently bring — i.e., taste becoming "just another capability". As of this data, the returns are persistent.

Connections#

Implementation Abundance Inverts Product Work — the product-process face of "judgment outlasts cheap execution": as implementation cheapens, curation/taste becomes the expensive step
Role Averaging, Not Role Elimination — the empirical backbone of "specialties don't disappear": domain expertise still decides success even as roles average
Outsource Your Thinking, Not Your Understanding — this is the empirical proof of Karpathy's thesis: success tracks understanding of the problem, not the ability to type code; the non-delegable residue, now measured
Printing Press Software Democratization — Cherny's "the best person to write accounting software is a good accountant, because coding is the easy part" is exactly the every-occupation-within-7pp finding; this is the hard data the analogy was waiting for
Vibe Coding vs. Agentic Engineering — "floor up, ceiling held": occupation-doesn't-matter is the floor rising; expertise-still-decides is the bar that stays
Research Taste as the Human Bottleneck — the "if returns to expertise decrease, the model is supplying judgment" test is the labor-data version of "is taste a durable moat or the next jagged valley?"
Planning / Execution Division of Labor — the mechanism of amplification: expert framing safely lengthens the action chain Claude runs per prompt (5→12 actions)
Agentic Coding Work-Composition Shift — the companion finding from the same study: what the work is and how it shifts over the seven months
AI Employee Framing — managers' edge here (delegation skill transfers) is the constructive flip side of HBR's warning (employee framing diffuses accountability); skill helps, org-chart symbolism hurts
Verification as the New Bottleneck — "what they ask Claude to verify" is one of the three expertise signals; the expert is the one who can specify and check, which is exactly the bottleneck role
Engineer PM Convergence — domain/product understanding as the bottleneck skill, seen in the success data
Jagged Intelligence (Ghosts, Not Animals) — experts recover from the agent's spiky failures; novices abandon — staying in the loop pays measurable dividends
Claude Code — the product the entire study measures
METR — the report cites METR's time-horizon ceiling as the capability frontier this usage sits below
Conversation-to-Delegation Shift — OpenAI's Codex study cites this report (Hitzig et al. 2026) and reaches the same conclusion from usage data: as work becomes delegation, the binding skill is domain understanding + supervision, not execution
Organizational Complements to AI — the cited Hitzig et al. argument restated as economics: supervision/verification/coordination and domain expertise are the binding complements that gate AI's value
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated — the AEI Cadences survey confirms this from the worker's mouth: 15+-year workers report AI can do ~10pp less of their work, naming judgment, context, and relational work as what AI can't touch — tacit expertise as the residual
The Automation–Optimism Link — the mirror gradient: experienced workers are more skeptical of AI's reach, yet heavy delegators are the most optimistic — expertise and enthusiasm pull in opposite directions
Conversation Artifacts — "the human stays involved in high-value work" (more turns and more Claude output as wages rise) is the augmentation reading of expertise amplifying the agent
AI Usage Cadences — off-hours Claude work skewing to higher-wage occupations is consistent with expertise-heavy work being where AI lands first
Anthropic Economic Index — the research program this study belongs to; Cadences is its next report

Open questions#

The forward test the report itself names: do the returns to expertise persist, narrow, or invert as models improve? A decrease would mean models are absorbing the judgment users currently supply.
Outcomes are transcript-inferred (verified success leans on git activity + explicit affirmation). How much of the management edge — and the whole success gradient — is real outcome vs. who-narrates-success-in-the-transcript?
The study excludes headless / SDK / IDE usage (a "substantial share"). Does the returns-to-expertise pattern hold in non-interactive and pipeline use, where there is no human steering mid-session at all?
Is "intermediate captures most of the benefit" stable, or an artifact of current model capability — i.e., will the concave curve flatten further (everyone converges) or steepen (mastery starts to separate again) as models get better?

Derived#

Learning to Co-Work with AI: A Software Engineer's Field Guide — the field guide's "domain depth + taste" skill thesis is what this study measures directly

Sources#

Agentic coding and persistent returns to expertise — Anthropic Economic Research, June 2026; §"The returns to expertise", §"Occupation may matter less than expertise", §"Looking ahead"

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 23

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
AI Employee Framing
Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…
AI Usage Cadences
AEI Cadences report: continuous hourly telemetry reveals AI usage carries the rhythms of daily life — personal use spik…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
The Automation–Optimism Link
AEI Cadences survey finding: people who use Claude in more automated ways are MORE optimistic across all six job-qualit…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Conversation Artifacts
AEI Cadences report: the 'artifact' (the primary output a user takes away) as a new unit of economic analysis — 93% of…
Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Implementation Abundance Inverts Product Work
Andrew Ambrosino's inversion thesis: when talking to a frontier model can stand up any feature from scratch, implementa…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Governance & Workforce
Map of Content for the governance-workforce domain — 16 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
OpenAI
AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deploym…
Organizational Complements to AI
The general-purpose-technology argument that AI's productivity gains depend on complementary workflow/skill/org-design…
Outsource Your Thinking, Not Your Understanding
"You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Role Averaging, Not Role Elimination
Andrew Ambrosino's nuanced OpenAI-side take on role collapse: your role is 'the average of what you spend your time on'…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…

Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Compute Allocator
The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Cited by 23

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
AI Employee Framing
Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…
AI Usage Cadences
AEI Cadences report: continuous hourly telemetry reveals AI usage carries the rhythms of daily life — personal use spik…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
The Automation–Optimism Link
AEI Cadences survey finding: people who use Claude in more automated ways are MORE optimistic across all six job-qualit…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Conversation Artifacts
AEI Cadences report: the 'artifact' (the primary output a user takes away) as a new unit of economic analysis — 93% of…
Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Implementation Abundance Inverts Product Work
Andrew Ambrosino's inversion thesis: when talking to a frontier model can stand up any feature from scratch, implementa…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Governance & Workforce
Map of Content for the governance-workforce domain — 16 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
OpenAI
AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deploym…
Organizational Complements to AI
The general-purpose-technology argument that AI's productivity gains depend on complementary workflow/skill/org-design…
Outsource Your Thinking, Not Your Understanding
"You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Printing Press Software Democratization
Boris Cherny's analogy: 1400s literacy expansion → AI software-writing expansion; domain knowledge displaces coding ski…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Role Averaging, Not Role Elimination
Andrew Ambrosino's nuanced OpenAI-side take on role collapse: your role is 'the average of what you spend your time on'…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…