Plate IIGovernance & WorkforceHOWARDISM

Conversation-to-Delegation Shift

PublishedJune 26, 2026FiledConceptDomainGovernance & WorkforceTagsGovernance Workforce AI Coding Workflow Engineering Metrics Empirical OpenaiReading10 minSourceAI-synthesised

OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production'), measured by Codex's share of output tokens across three populations — 99.8% OpenAI / 63.3% organizational / 16.5% individual — with adoption spreading beyond developers; standard usage metrics (active users, chats) become less informative as the unit shifts from a conversation to a delegated workflow

Illustration for Conversation-to-Delegation Shift

Sources#

Summary#

The central thesis of OpenAI's The Shift to Agentic AI: Evidence from Codex (Johnston, Holtz, Richmond, Ong, Tambe, Chatterji, June 2026): agentic AI is not a more capable chatbot — it is a different mode of use. Conversational systems (ChatGPT) are used to exchange information: you ask, the model answers. Agentic systems (Codex) are used to carry out work: you delegate a task, the system inspects context, uses tools, and modifies artifacts on your behalf. The paper measures this transition across three populations — individual users, organizational users, and OpenAI's own workers — and finds it is rapid, uneven, and best seen through output tokens, not user counts. This is the Codex-side, cross-population complement to Anthropic's within-Claude-Code composition shift: the same "from asking toward doing" movement, measured by a second frontier lab on a second tool.

Evidence note. empirical — drawn from a privacy-preserving, automated-classifier pipeline over Codex usage logs (researchers never read raw messages); three populations; growth indexed and sampled (3–4% user samples, 0.1% for complexity). Two honest caveats: it is first-party (OpenAI measuring its own product and workers), and OpenAI-internal usage is explicitly not representative of the typical organization — the authors frame it as a frontier preview of low-friction adoption, not a population estimate. Outcomes are classifier-inferred.

The intensive margin: output tokens, not active users#

The paper's methodological pivot is that counting users understates the shift. Many people still open ChatGPT; the ones who adopt Codex use it far more intensively. So the headline measure is Codex's share of output tokens generated across Codex and ChatGPT in the trailing 28 days (as of June 11, 2026):

Population	Codex token share	28-day Codex adoption (extensive margin)
Individual users	16.5%	<1% of active users
Organizational users	63.3%	17.3% of users
OpenAI workers	99.8%	near-universal

The extensive/intensive gap is the story: fewer than 1% of individual active users touched Codex in 28 days, yet they account for 16.5% of individual output tokens — the adopters are unusually heavy. Within OpenAI, Codex has largely replaced ChatGPT as the interface for work. Aggregate weekly-active Codex usage grew more than fivefold between Jan 1 and Jun 1, 2026, and the fastest growth is among non-developers, not the original software-developer base.

The same extensive/intensive split recurs by job function: an organizational engineer generates 26.8% of their average tokens on Codex, but Codex accounts for 88.3% of total tokens generated by engineers who adopt it. Legal users: 1.9% of the average user's tokens, but 17.6% of the function's total. Adoption is shallow but, where it lands, deep.

Asking vs. doing#

The conceptual hinge, drawn explicitly against Chatterji et al. (2025), How People Use ChatGPT (a co-author of this paper): in conversational use, "asking" was nearly half of all prompts — information, advice, learning. Codex usage is "doing": users delegate debugging, refactoring, validation, configuration, document drafting, data analysis. The paper frames these as production, not consultation — "users are asking Codex to do work, not only to provide advice." In an agentic interface, one unit of usage corresponds to a delegated workflow, not a single conversational exchange. (Tool use is the rough proxy: in the week before June 11, 2026, 60.3% of Codex turns vs 21.9% of ChatGPT turns invoked an external tool.)

Measurement obsolescence: what to count when usage is delegation#

The paper's forward-looking claim, and its connection to the wiki's measurement thread: standard measures of AI use — active users, chats, message volume — "may become less informative as agentic systems diffuse." When a single instruction sets off hours of autonomous, tool-using, multi-step work, message counts stop tracking output. The authors name the replacements to track:

delegated task complexity (estimated human-hours per task — see Task Time-Horizon Scaling)
runtime (how long agents work per day — see Parallel Agent Orchestration)
workflow reuse (skills/plugins — see Agentic Work Systematization)
concurrency (parallel agents — see Parallel Agent Orchestration)
production output (tokens generated on the user's behalf)

This is the same "measure what the system actually did, not the proxy" instinct as Telemetry vs. Survey Measurement and Production-Sourced Evaluation, pushed one step further: not just telemetry over survey, but output-and-delegation metrics over interaction-count metrics.

Three populations as a natural experiment#

Because the same model is available to all three groups, the large gaps between them are evidence that adoption depends on context, not capability — the organizational-complements argument. OpenAI workers (cheap marginal usage, high buy-in, training, proximity to the systems) are the frontier preview: what agentic use looks like when adoption frictions are near zero. Their pattern — usage spread across research, planning, communication, recruiting, sales; output tokens up ≥10× in every job function Nov 2025→Jun 2026 (50×+ for researchers) — is the paper's picture of the destination, not the average.

Product, not just model: autonomy by surface#

Anthropic's AEI Cadences report (June 2026) supplies the surface-comparison from the Claude side. Rating AI autonomy on a 1–5 scale, Claude Code shows higher autonomy than chat/Cowork on 26 of 31 output types (average gap +0.37; scripts and code snippets +0.53). Two-thirds of the gap is the same task executed with more delegation — a blog post takes a median 13 rounds of back-and-forth on chat/Cowork but a single prompt on Claude Code — and one-third is the different output mix. Crucially, the gap persists controlling for model: among Sonnet-served conversations, Claude Code still shows +0.26 more autonomy, even though Code runs on Opus far more often (54% vs 10% of chat/Cowork). The report's conclusion — "the product used is likely more important than the underlying model" — is the surface-design counterpart to this page's "agentic AI is a different mode of use, not a better chatbot," and to Harness Shrinkage as Models Improve: the harness/surface, not just the model, sets how much users delegate.

Connections#

Agentic Coding Work-Composition Shift — the Anthropic/Claude-Code telemetry of the same "asking→doing" move; this is the OpenAI/Codex, cross-population complement (two labs, two tools, one shift)
Conversation Artifacts — the artifact companion from the same Cadences report; autonomy and compute co-move across artifacts (r = 0.68)
AI Usage Cadences — the temporal-rhythm chapter of that report; the extensive rhythm of this diffusing delegation
The Automation–Optimism Link — the perceptions chapter: automation share (this page's delegation, survey-measured) predicts worker optimism
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated — the survey's belief-side measures, complementing this behavior-side shift
Anthropic Economic Index — the sibling research program; its returns-to-expertise report is cited by this OpenAI study
Returns to Expertise in Agentic Coding — the companion finding from the report this one cites (Hitzig et al. 2026): as work becomes delegation, the binding human skill is domain understanding + supervision, not execution
Organizational Complements to AI — why the three populations differ so much under one model: value depends on workflow/permission/skill complements, not capability alone
Agentic Work Systematization — one of the three "how" margins: ad-hoc delegation hardening into reusable skill/plugin routines
Parallel Agent Orchestration — the other two "how" margins: concurrency and long-running runtime, the intensive-user workflow
Task Time-Horizon Scaling — delegated task complexity (the >8h-task share rising 2.1%→25.6%) is the usage-side reading of the rising reliable-task-length ceiling
Planning / Execution Division of Labor — delegation is the realized half: how much autonomous work each prompt sets off, and how much users actually hand over
Telemetry vs. Survey Measurement — the measurement-obsolescence point extends this: not only telemetry-over-survey, but delegation-metrics over interaction-counts
Verification as the New Bottleneck — as use shifts from asking to delegating, the human role moves toward review, supervision, and coordination — the bottleneck named here
Engineer PM Convergence — the intensive user "manages a portfolio of agentic work," shifting toward delegation/supervision/integration — the role convergence seen in usage data
AI as Primary Author — delegated production is authorship moving to the agent; the token-share numbers are how far it has moved
Harness Shrinkage as Models Improve — rising delegation is what a shrinking harness enables on the usage side: less hand-holding per task, more handed off
OpenAI — the lab whose Codex telemetry this is, and whose internal usage is the frontier preview
Codex — the agentic tool whose adoption this measures

Open questions#

The token-share metric rewards verbose agentic output. How much of the 99.8% / 63.3% / 16.5% spread is a genuine work shift vs. agentic tools simply emitting more tokens per unit of human intent?
OpenAI-internal is a frontier preview by assumption. Does the external organizational curve actually trace the OpenAI path (the paper's implicit claim), or does it plateau where adoption frictions don't vanish?
"Asking is half of ChatGPT, doing is most of Codex" — but the two tools self-select different work. How much of the asking→doing contrast is the shift itself vs. routing pre-existing "doing" tasks to the tool built for them?

Sources#

The Shift to Agentic AI: Evidence from Codex — Abstract; §1 Introduction (four stylized facts); §3 Who Uses Agentic AI; §6 Conclusion
Anthropic Economic Index report: Cadences — Anthropic Economic Index Cadences (June 2026), Ch. 2 §"How much autonomy does Claude have to decide on its own?" (autonomy-by-surface, model-controlled)
Chatterji et al. (2025), How People Use ChatGPT (NBER WP 34255) — the conversational-AI "asking ≈ half" baseline this contrasts against

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 21

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
Agentic Work Systematization
OpenAI Codex study's 'systematization' margin: the shift from ad-hoc agent use (describe task → agent does it → done) t…
AI as Primary Author
Faros 2026: the assistant→author threshold crossed without a deliberate decision, marked by AI-code acceptance rising 2…
AI Usage Cadences
AEI Cadences report: continuous hourly telemetry reveals AI usage carries the rhythms of daily life — personal use spik…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
The Automation–Optimism Link
AEI Cadences survey finding: people who use Claude in more automated ways are MORE optimistic across all six job-qualit…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Codex
OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…
Conversation Artifacts
AEI Cadences report: the 'artifact' (the primary output a user takes away) as a new unit of economic analysis — 93% of…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Governance & Workforce
Map of Content for the governance-workforce domain — 16 concepts. Curated entry point; see Home for all domains.
OpenAI
AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deploym…
Organizational Complements to AI
The general-purpose-technology argument that AI's productivity gains depend on complementary workflow/skill/org-design…
Parallel Agent Orchestration
OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Returns to Expertise in Agentic Coding
Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Telemetry vs. Survey Measurement
Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…

Returns to Expertise in Agentic Coding
Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…

Returns to Expertise in Agentic Coding
Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…

Cited by 21

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
Agentic Work Systematization
OpenAI Codex study's 'systematization' margin: the shift from ad-hoc agent use (describe task → agent does it → done) t…
AI as Primary Author
Faros 2026: the assistant→author threshold crossed without a deliberate decision, marked by AI-code acceptance rising 2…
AI Usage Cadences
AEI Cadences report: continuous hourly telemetry reveals AI usage carries the rhythms of daily life — personal use spik…
Anthropic Economic Index
Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…
The Automation–Optimism Link
AEI Cadences survey finding: people who use Claude in more automated ways are MORE optimistic across all six job-qualit…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Codex
OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…
Conversation Artifacts
AEI Cadences report: the 'artifact' (the primary output a user takes away) as a new unit of economic analysis — 93% of…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Governance & Workforce
Map of Content for the governance-workforce domain — 16 concepts. Curated entry point; see Home for all domains.
OpenAI
AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deploym…
Organizational Complements to AI
The general-purpose-technology argument that AI's productivity gains depend on complementary workflow/skill/org-design…
Parallel Agent Orchestration
OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Returns to Expertise in Agentic Coding
Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Telemetry vs. Survey Measurement
Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…