Sources#
Summary#
Google Cloud's platform for building, running, and evaluating agents — in this corpus, the infrastructure underneath the Agent Quality Flywheel. Its evaluation stack is the notable part: a GenAI evaluation service whose AutoRaters (developed with Google DeepMind, and per Google the same ones used on its own models and first-party agents) are adaptive model-based judges — for a multi-turn agent they extract user intent from the conversation, generate per-case rubrics, validate the trace against each criterion, and majority-vote across samples.
Components referenced#
- GenAI evaluation service — the independent grader in the flywheel's optimizer/evaluator split; predefined multi-turn AutoRaters (
multi_turn_task_success,multi_turn_trajectory_quality) plus custom rubric metrics. - User Simulator — synthesizes multi-turn scenarios for cold-start evaluation before real traffic exists.
- Automatic Loss Analysis — clusters failure verdicts when failures number ten or more.
- Online Monitors — continuously evaluate live production traffic and write quality scores to Cloud Monitoring.
- OTel tracing — agents emit OpenTelemetry traces (ADK does by default); production traces double as eval datasets.
- ADK (Agent Development Kit) + agents-cli — Google's agent framework and CLI toolchain; the
adk-samplesagents are the flywheel's demo subjects. - The two skill packages —
google-agents-cli-eval(ADK/agents-cli) andagent-platform-eval-flywheel(Evaluation SDK, any framework), installed vianpx skills add …from skills.sh.
Position in the corpus#
The Google-side counterpart to Claude Code's and Codex's agent stacks — but where those entries anchor building with agents, this platform's corpus role is measuring them: it packages evaluation (judges, simulators, monitors) as the product surface. That the delivery mechanism is a skill driven by whatever coding agent you already use is itself evidence for the skills-as-distribution-unit pattern (Agentic Work Systematization).
Connections#
- Agent Quality Flywheel — the methodology this platform ships and executes
- Google DeepMind — co-developer of the AutoRaters at the evaluation stack's core
- Optimizer–Evaluator Decoupling — the evaluation service is the independent grader the rule requires
Sources#
- Driving the Agent Quality Flywheel from Your Coding Agent- Google Developers Blog — Melnyk & Dai, 2026-06-30 (
vendor-claim)
Cited by 4
- Agent Quality Flywheel
Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 39 entity pages. See Home for concept domains.
Related articles
- Agent Quality Flywheel
Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- Loop Engineering
Replacing yourself as the agent's prompter by designing the system that prompts it: a recursive-goal loop built from fi…
