H
Howardism
Plate IIGovernance & WorkforceHOWARDISM

Returns to Expertise in Agentic Coding

PublishedJune 17, 2026FiledConceptDomainGovernance & WorkforceTagsGovernanceWorkforceHuman AI CollaborationAI Coding WorkflowEmpiricalAnthropicReading11 minSourceAI-synthesised

Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the actions and 5× the output per prompt, reach verified success ~2× as often, and abandon stuck sessions far less; every occupation lands within 7pp of software engineers; gains are concentrated novice→intermediate, with mastery adding little

Illustration for Returns to Expertise in Agentic Coding

Sources#

Summary#

The headline finding of Anthropic's economic-research report Agentic coding and persistent returns to expertise (Hitzig, Massenkoff, Lyubich, Heller, McCrory, June 2026): what amplifies an AI coding agent is the user's domain expertise, not their coding proficiency. Across ~400,000 Claude Code sessions, the more a person understands the problem they are solving, the more work the agent does per instruction, the more often the session succeeds, and the more readily it recovers from trouble. Coding background, by contrast, barely matters: in code-producing sessions every major occupation lands within seven percentage points of software engineers. The report's one-line thesis — "coding agents are not substituting for domain expertise; the more understanding a worker brings to an agent, the more quality work the agent is able to do" — is the empirical confirmation of "you can outsource your thinking but not your understanding".

Evidence note. empirical — measured from a privacy-preserving (Clio) analysis of ~400,000 interactive sessions from ~235,000 people, Oct 2025–Apr 2026, with classifiers (Claude Sonnet 4.6) validated against automatic telemetry and regressions with controls and confidence intervals. Two honest caveats keep it short of a clean external benchmark: it is first-party — Anthropic measuring its own product, on its own telemetry, with its own classifiers (not independently reproducible) — and it excludes headless (claude -p), SDK, and third-party-IDE usage, a "substantial share" of real activity. Outcomes are transcript-inferred proxies, not observed real-world results.

Expertise is task-specific, not a résumé#

The expertise rating (a five-point novice→expert scale) is not job title or general ability. "A senior engineer asking their first Rust question is a beginner at Rust. An accountant who has never used Python, but tells Claude exactly which reconciliation rules a script must enforce and catches the edge case it mishandles at month-end close, is an expert at that task." The classifier reads three signals:

  1. Precision of framing — how specifically the user directs the work.
  2. What they ask Claude to verify — experts specify the checks that define "done."
  3. Who corrects whom — does the user correct Claude, or does Claude correct the user.

This is exactly the understanding residue made measurable: the signals are the surface evidence of an internal model good enough to direct, judge, and verify.

The amplification: experts get more agent per prompt#

Expertise scales how much autonomous work each human prompt sets off (the action chain):

User expertiseActions per promptOutput per prompt
Novice~5~600 words
Expert~12~3,200 words

More than twice the actions, five times the output. The gap holds within every work mode and every task-value band, and survives a regression controlling for work mode, task value, month, occupation, and model family: +9% actions and +13% output per expertise level (p < 0.001 at each adjacent step). An expert isn't just luckier — each instruction they give safely unlocks a longer leash.

The success gradient#

The more expertise a session exhibits, the more likely it succeeds, on every measure. Success is defined two ways: judged success (a classifier reads the transcript and decides whether the person got what they set out to do) and the stricter verified success (judged-successful and at least one hard external signal — a matching commit/PR, passing tests, or explicit user affirmation).

ExpertiseVerified successAt least partial success
Novice15%77%
Intermediate–Expert28–33%91–92%

The curve is concave: most of the gain is novice→intermediate; intermediate→expert is modest. A working grasp of the domain captures most of the benefit; deep mastery adds only a little. (These are adjusted rates — comparing sessions of the same work mode, value band, month, subject, and occupation type.)

Two corollaries about recovery, which is where expertise earns its keep:

  • Recovering from trouble. Among sessions that "hit trouble" (verified failure signals — errors, failed tests, repeated attempts, user frustration), verified success rises from 4% (novice) to 15% (expert); partial success from 60% to 80–81%. Part of the value of expertise is the ability to steer a struggling agent back on course. (Caveat: experts hit trouble less often, so their troubled sessions are on harder problems — estimated task value of a troubled session roughly doubles from novice to expert — so some of the recovery gap reflects novices stuck on routine problems vs. experts stuck on genuinely hard ones.)
  • Abandonment. A troubled session is abandoned when it is judged failed and zero lines of code were written. 19% of novice sessions end abandoned, against 5–7% for everyone else. The least experienced give up when stuck.

Occupation matters less than expertise#

The complementary half of the story, and the strongest evidence for software democratization: a coding background is becoming less relevant to coding success. Occupation is inferred (the classifier is explicitly told not to treat the act of coding as evidence of a coding job — a lawyer scripting contract-clause checks is mapped to Legal, not Software).

  • Software-related occupations reach verified success ~30% of sessions overall; other professions ~26%. In code-producing sessions, 34% vs 29% — and partial success 89% vs 88%.
  • Every one of the ten largest occupations lands within seven points of software engineers on verified success in code-producing sessions; the five-point software/non-software gap has neither widened nor narrowed over seven months.
  • Management occupations edge out software engineers on verified success. The report's reading: management skills (delegating, specifying, confirming) transfer to directing an agent — "perhaps acting like a manager confers greater success." (Measurement caveat: verified success partly rests on explicit in-transcript confirmation, and managers may simply say when they got what they asked for.) This is the constructive counterpart to HBR's accountability critique — the skill of bounded delegation helps; the org-chart framing of agents-as-employees is what backfires.

What it means for the labor market#

The report frames itself as an early read on knowledge-work transitions. Two readings, in tension only superficially:

  • Substituting for coding skill. Implementation-heavy work that used to require a coding background is being absorbed; "a coding background [is becoming] less relevant to successful programming." This is the floor rising — "a person with command of a domain, in any field, may now be able to do technical work they previously could not."
  • Rewarding domain understanding. Simultaneously, the gains accrue to whoever brings the firmer grasp of the problem. "A person without any such expertise will get far less from the same tool." This is the residual human comparative advantage showing up in usage data: not coding, but knowing what to build and being able to verify it.

The report names the metric to watch: if the returns to expertise begin to decrease over time, that signals models are starting to supply the judgment users currently bring — i.e., taste becoming "just another capability". As of this data, the returns are persistent.

Connections#

  • Implementation Abundance Inverts Product Work — the product-process face of "judgment outlasts cheap execution": as implementation cheapens, curation/taste becomes the expensive step
  • Role Averaging, Not Role Elimination — the empirical backbone of "specialties don't disappear": domain expertise still decides success even as roles average
  • Outsource Your Thinking, Not Your Understanding — this is the empirical proof of Karpathy's thesis: success tracks understanding of the problem, not the ability to type code; the non-delegable residue, now measured
  • Printing Press Software DemocratizationCherny's "the best person to write accounting software is a good accountant, because coding is the easy part" is exactly the every-occupation-within-7pp finding; this is the hard data the analogy was waiting for
  • Vibe Coding vs. Agentic Engineering — "floor up, ceiling held": occupation-doesn't-matter is the floor rising; expertise-still-decides is the bar that stays
  • Research Taste as the Human Bottleneck — the "if returns to expertise decrease, the model is supplying judgment" test is the labor-data version of "is taste a durable moat or the next jagged valley?"
  • Planning / Execution Division of Labor — the mechanism of amplification: expert framing safely lengthens the action chain Claude runs per prompt (5→12 actions)
  • Agentic Coding Work-Composition Shift — the companion finding from the same study: what the work is and how it shifts over the seven months
  • AI Employee Framing — managers' edge here (delegation skill transfers) is the constructive flip side of HBR's warning (employee framing diffuses accountability); skill helps, org-chart symbolism hurts
  • Verification as the New Bottleneck — "what they ask Claude to verify" is one of the three expertise signals; the expert is the one who can specify and check, which is exactly the bottleneck role
  • Engineer PM Convergence — domain/product understanding as the bottleneck skill, seen in the success data
  • Jagged Intelligence (Ghosts, Not Animals) — experts recover from the agent's spiky failures; novices abandon — staying in the loop pays measurable dividends
  • Claude Code — the product the entire study measures
  • METR — the report cites METR's time-horizon ceiling as the capability frontier this usage sits below
  • Conversation-to-Delegation Shift — OpenAI's Codex study cites this report (Hitzig et al. 2026) and reaches the same conclusion from usage data: as work becomes delegation, the binding skill is domain understanding + supervision, not execution
  • Organizational Complements to AI — the cited Hitzig et al. argument restated as economics: supervision/verification/coordination and domain expertise are the binding complements that gate AI's value
  • Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated — the AEI Cadences survey confirms this from the worker's mouth: 15+-year workers report AI can do ~10pp less of their work, naming judgment, context, and relational work as what AI can't touch — tacit expertise as the residual
  • The Automation–Optimism Link — the mirror gradient: experienced workers are more skeptical of AI's reach, yet heavy delegators are the most optimistic — expertise and enthusiasm pull in opposite directions
  • Conversation Artifacts — "the human stays involved in high-value work" (more turns and more Claude output as wages rise) is the augmentation reading of expertise amplifying the agent
  • AI Usage Cadences — off-hours Claude work skewing to higher-wage occupations is consistent with expertise-heavy work being where AI lands first
  • Anthropic Economic Index — the research program this study belongs to; Cadences is its next report

Open questions#

  • The forward test the report itself names: do the returns to expertise persist, narrow, or invert as models improve? A decrease would mean models are absorbing the judgment users currently supply.
  • Outcomes are transcript-inferred (verified success leans on git activity + explicit affirmation). How much of the management edge — and the whole success gradient — is real outcome vs. who-narrates-success-in-the-transcript?
  • The study excludes headless / SDK / IDE usage (a "substantial share"). Does the returns-to-expertise pattern hold in non-interactive and pipeline use, where there is no human steering mid-session at all?
  • Is "intermediate captures most of the benefit" stable, or an artifact of current model capability — i.e., will the concave curve flatten further (everyone converges) or steepen (mastery starts to separate again) as models get better?

Derived#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 23
Related articles
  • Conversation-to-Delegation Shift

    OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…

  • Engineer PM Convergence

    Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…

  • Compute Allocator

    The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…

  • Claude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…