H
Howardism
Plate IILLM ArchitectureHOWARDISM

DRACO Benchmark

PublishedJune 15, 2026FiledConceptDomainLLM ArchitectureTagsBenchmarksCapability EvaluationDeep ResearchLLM As A JudgeReading6 minSourceAI-synthesised

Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rubrics on accuracy/completeness/objectivity/citation; Perplexity Deep Research leads every domain and axis, Claude Opus 4.6 is the strongest non-Perplexity system, factual accuracy is the universal weak spot

Illustration for DRACO Benchmark

Sources#

Summary#

DRACO (Deep Research Accuracy, Completeness, and Objectivity) is a benchmark of 100 complex, open-ended deep-research tasks spanning 10 domains and requiring information from 40 countries, published by Perplexity (with a Harvard co-author) in February 2026 (arXiv:2602.11685). Its distinguishing feature: the tasks are drawn from real, de-identified production usage of Perplexity Deep Research (Production-Sourced Evaluation) rather than synthetic or hand-authored prompts, then paired with task-specific expert rubrics and graded by an LLM-as-a-judge. It is a benchmark of systems / products, not base models — which is what makes its headline finding (orchestration beats the bare model) legible.

Why it's different (Table 1)#

DRACO is positioned as the first deep-research benchmark to be simultaneously: production-sourced, human-authored, general-domain (not just specialized/technical), and expert-rubric-graded. Prior open-ended benchmarks each miss at least one — DeepResearchEval, ReportBench, DeepScholar-Bench, and DRBench rely on synthetic task generation; others are hand-authored but narrow or lack expert rubrics. None draws directly from a widely-available production deep-research system.

Task construction (5 stages)#

Sourced from production Perplexity Deep Research queries, then reformulated/augmented/filtered so tasks are anonymous, well-specified, bounded, challenging, and representative:

  1. Sampling — 1,000 high-difficulty English queries (Sep–Oct 2025), difficulty proxied by subsequent negative sentiment or a thumbs-down on the prior response.
  2. Pre-processing — LLM reformulation to strip PII and reduce ambiguity; fully automated, no raw query ever seen by a human analyst (privacy by design).
  3. Augmentation — systematic expansion along two axes: context (persona, output format, source specificity) and scope (temporal, cross-entity comparison, geography). Turns ambiguous queries into well-defined tasks reflecting implicit user intent.
  4. Filtering — LLM keeps only tasks that are objective (experts converge on what's good), tractable (bounded), and difficult (needs nontrivial multi-step gathering/synthesis).
  5. Curation — 100 tasks sampled to match the real domain distribution, then manually reviewed by in-house domain experts.

The 10 domains: Finance, Shopping/Product Comparison, Academic, Technology, General Knowledge, UX Design, Law, Medicine, Needle in a Haystack, Personalized Assistant.

Rubric design and grading#

Rubrics were built with 26 recruited domain experts (doctors, attorneys, financial analysts, engineers, designers) over a 4-stage pipeline with LLM assistance, including a saturation test — if the leading system already scored >90% on a task, it was sent back for hardening (~45% of tasks were). Each task carries ~39.3 weighted criteria across four axes; about half target factual accuracy. Criteria are positive (desirable properties) or negative (pitfalls), with the harshest penalties reserved for harmful medical content (down to −500).

AxisWeight Range~Criteria/task
Factual Accuracy−500 to +2020.5
Breadth & Depth of Analysis−100 to +108.6
Presentation Quality−50 to +205.6
Citation Quality−150 to +104.8

Grading uses an open-source LLM-as-a-judge protocol: per-criterion binary MET/UNMET → weighted normalized score (0–100%) and pass rate. Judge = Gemini-3-Pro (chosen via an internal human–LLM alignment study); GPT-5.2 and Sonnet-4.5 corroborate. Rankings are stable across judges; absolute magnitudes vary.

Headline results#

Perplexity Deep Research leads every domain and every rubric axis. Among deep-research systems:

SystemNormalizedPass rate
Perplexity Deep Research (Opus 4.6)70.572.8
Perplexity Deep Research (Opus 4.5)67.270.9
Gemini Deep Research59.062.7
OpenAI Deep Research (o3)52.156.9
OpenAI Deep Research (o4-mini)41.948.0
Claude Opus 4.6 (bare + tools)59.863.1
Claude Opus 4.5 (bare + tools)46.750.2

Three findings that matter for this wiki:

  1. Orchestration > base model. Perplexity (Opus 4.6 base) beats bare Opus 4.6-with-tools by ~10pp — see Deep Research Agents. A live counter-datapoint to Harness Shrinkage as Models Improve.
  2. Claude Opus 4.6 is the strongest non-Perplexity system (59.8% / 63.1%), ahead of Gemini Deep Research and both OpenAI configs. Opus 4.6 ranks second (non-Perplexity) in 5 of 10 domains.
  3. Factual accuracy / citation are the universal weak axes; presentation is strongest everywhere. The Perplexity-vs-second gap is largest in Finance (21.6pp) and smallest in Law (1.6pp).

Limitations (the paper's own)#

Single-turn only (no clarifying-question / multi-turn capability tested); a static snapshot despite an automatable refresh pipeline; text-only (no multimodal); English-only; augmentation risks over-specifying away natural query variability; rubric creation still needs heavy human-expert involvement; and absolute scores depend on the LLM judge (though rankings don't). System-level (black-box) evaluation — no component-level attribution of retrieval vs. planning vs. synthesis.

Connections#

  • Deep Research Agents — the system class DRACO evaluates; home of the orchestration / verification / efficiency findings
  • Production-Sourced Evaluation — DRACO's central methodological contribution: tasks built from real de-identified production traffic
  • LLM-as-a-Judge — the rubric-based binary-verdict grading protocol DRACO uses
  • Task Time-Horizon Scaling — sibling capability benchmark; where METR measures task length a model sustains, DRACO measures research-report quality of agentic systems, and both note benchmark-saturation pressure (DRACO's saturation test discards >90%-solved tasks)
  • Harness Shrinkage as Models Improve — DRACO's orchestration-beats-bare-model result is a counter-datapoint to the shrinking-harness thesis
  • Verification as the New Bottleneck — factual-accuracy weakness across all systems is verification surfacing inside the research product
  • Evals as Product Spec — DRACO is the externalized, large-scale form of "evals as the definition of done," with rubrics standing in for the eval set
  • Perplexity / Anthropic / Google DeepMind — benchmark author; makers of evaluated systems and the judge model

Open questions#

  • The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
  • Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
  • Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 10
  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Deep Research Agents

    Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 29 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Perplexity

    AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…

  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles
  • Deep Research Agents

    Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

  • Perplexity

    AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…

  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…