Sources#
Summary#
Google Cloud's methodology for engineering agent quality instead of vibe-checking it, shipped (June 2026) as an installable skill that your coding agent drives. The problem it names is the daily reality of agent development: you tweak a prompt, it looks better on three examples, and you have no idea whether you broke ten others — "moved the metric or just moved the vibe." The flywheel is a three-phase loop — Build & Test → Ship & Monitor → Learn & Refine — with the Build & Test phase expanded into five concrete stages, run once in order and then looped (stages 2–5) until quality targets are met. Google states the methodology and its AutoRaters are the same ones it uses on its own models and first-party agents, developed with Google DeepMind. Source is a first-party product blog (vendor-claim): the mechanism descriptions are concrete and the demo cycles are worked in detail, but the results are Google's own demos, not independent measurement.
The five stages#
- Prepare Data — build an eval dataset from existing OTel traces, hand-crafted cases, or synthesized scenarios.
- Run Inference — execute the agent over the dataset to produce traces (skipped if traces already exist, e.g. production sessions).
- Grade — score traces with adaptive AutoRaters (model-based judges that grade a trace and explain why) or custom metrics. The only stage that always runs.
- Analyze Failures — read rubric verdicts to understand why a case failed; cluster with Automatic Loss Analysis when failures number ten or more.
- Optimize & Iterate — apply a targeted fix, re-run 2–4, compare against the previous baseline.
The skill encodes the discipline that most failing cases take several iterations before metrics actually move — and the architectural rule that the optimizer never grades its own work: whatever proposes a fix (coding agent, automated optimizer, human), an independent evaluation service scores it.
The interface: describe a worry, approve a plan#
The developer never touches the eval CLI and never names a metric. The whole interface is a plain-language concern — "I'm worried about whether travel-concierge honors mid-conversation changes… figure out how to test it and propose a plan" — and the skill's job is to translate that goal into the right evaluation: it reads the agent's code, picks metrics, synthesizes scenarios, runs grading, and reports before/after. This is eval-writing itself being automated: the judgment call moves up a level, from authoring the eval to stating the worry and approving the plan.
In Google's worked demo (an ADK multi-agent trip planner), the skill bootstrapped 25 scenarios with the User Simulator across five revision types, graded with two built-in multi-turn AutoRaters plus a purpose-built categorical rubric, found 21% of revisions IGNORED, located the failure precisely, and — after a three-sentence instruction fix the human approved — re-ran the same evaluation to show 21%→5%.
Promote one concern to a stable metric#
The demo's most transferable lesson. Adaptive AutoRaters regenerate a rubric per case per run, so a specific failure lands as one criterion among several, folded into a blended score — real, named, and invisible: in one case the built-in task-success metric scored a comfortable 0.80 while the user's revision was dropped, because four of five generated criteria passed. Detection is not the problem; isolation is. The move is to promote the one concern to its own stable custom metric — here revision_honored with categorical verdicts (HONORED / IGNORED / PARTIAL / NO_REVISION) — that you can count, gate on ("act if >20% come back IGNORED"), and trend cycle over cycle. The working division of labor: adaptive built-ins as the broad-health signal, one stable measure for the behavior you're changing. (See LLM-as-a-Judge for the adaptive-rubric variant this extends, and Failures That Look Like Success for the failure class the blended score hid.)
It works without a hypothesis too#
Pointed cold at a bug-triage agent with just "find a real failure and fix it," the skill ran broad — varied synthetic scenarios, built-in multi-turn metrics — and surfaced a dominant cluster on its own: in 14 of 15 cases the agent did the work correctly but never told the user which tools it had called (its own instruction requested this; the model had quietly treated it as optional). A one-paragraph fix took tool-disclosure from 0% to 96% in one cycle, per Google's demo. "Here's my goal" and "find me a problem" both land.
Two cadences: dev loop and production loop#
- On-demand (dev): no real usage yet → the User Simulator synthesizes scenarios. Explicitly a cold-start bootstrap: "synthetic scenarios get you moving; production data is what makes the loop sharp."
- Continuous (production): the same skill points at real OTel traces — already-complete traces skip Run Inference entirely and are graded in place with the same raters. Online Monitors continuously evaluate live traffic and write quality scores to Cloud Monitoring; when scores drift, failing traces feed the same eval-fix loop. Each production failure is a ready-made test case for the next cycle — Production-Sourced Evaluation operationalized as a product loop rather than a benchmark.
Google's stated direction is to let the skill drive more of the outer loop itself — watching monitors, surfacing regressions, proposing fixes — but today it proposes and a human approves; the shipped version is deliberately not autonomous.
What it is and isn't#
Is: methodology plus orchestration inside your coding agent — metric selection, eval-service invocation, verdict reading, fix proposals, before/after comparison. Distributed as skills (npx skills add …, two packages against the same evaluation service) — methodology shipping in the same unit as org context, the systematization format crossing vendor lines. Isn't: autonomous (human-in-the-loop); a source of ground truth (AutoRaters are sophisticated but model-based — treat scores as directional, trust deltas between runs more than any absolute number); a substitute for real traffic.
Connections#
- Optimizer–Evaluator Decoupling — the flywheel's load-bearing architectural rule; proposer and grader stay separate
- Failures That Look Like Success — the failure class both demo cycles surfaced, and the reason trace-level rubric grading beats output skims
- LLM-as-a-Judge — AutoRaters are the adaptive-rubric variant of the judge primitive; the flywheel adds the stable-metric-promotion discipline on top
- Production-Sourced Evaluation — the production cadence: real traces as eval input, synthetic simulation as explicit cold-start bootstrap
- Evals as Product Spec — the PM skill this automates one level up: the human states the worry and approves the plan; the skill authors the eval
- Loop Engineering — an eval-fix loop packaged as a product-native skill; Google joining the Codex/Claude Code convergence on shipped loop primitives
- Agentic Work Systematization — skills as the distribution unit, here carrying vendor methodology rather than org-specific context
- Compounding Loop Optimization — the same instrument-every-recurring-step discipline, productized for the eval-fix step of agent development
- Verification as the New Bottleneck — the bottleneck this tooling attacks: grading, failure analysis, and regression comparison made cheap enough to loop
- Gemini Enterprise Agent Platform — the platform whose evaluation service, User Simulator, Online Monitors, and AutoRaters the skill orchestrates
- Google DeepMind — co-developer of the AutoRaters
Open questions#
- Both demo cycles fixed agents with instruction-level bugs and showed large one-cycle gains. What does the loop look like on failures that need tool, memory, or architecture changes — does "several iterations before metrics move" dominate in practice?
- The custom rubric is authored by the same coding agent that will later propose fixes. Metric choice is upstream of grading — does decoupling need to extend to who defines the metric, not just who scores it?
- Synthetic User Simulator scenarios bootstrapped the whole first cycle. How much of the 21%→5% delta survives on real-traffic distributions (the representativeness gap Production-Sourced Evaluation names)?
Sources#
- Driving the Agent Quality Flywheel from Your Coding Agent- Google Developers Blog — Melnyk & Dai, Google Developers Blog, 2026-06-30 (
vendor-claim); builds on the Cloud Next '26 agent-quality talk
Cited by 11
- Agentic Work Systematization
OpenAI Codex study's 'systematization' margin: the shift from ad-hoc agent use (describe task → agent does it → done) t…
- Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- Failures That Look Like Success
The quiet agent-failure class where everything reads fine — confident answer, plausible plan, even correct internal sta…
- Gemini Enterprise Agent Platform
*Entity.* Google Cloud's agent platform: the GenAI evaluation service with adaptive AutoRaters (built with DeepMind), U…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- Loop Engineering
Replacing yourself as the agent's prompter by designing the system that prompts it: a recursive-goal loop built from fi…
- AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.
- Optimizer–Evaluator Decoupling
The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never…
- Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Related articles
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- DRACO Benchmark
Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
- Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
