DRACO Benchmark

Sources#

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Summary#

DRACO (Deep Research Accuracy, Completeness, and Objectivity) is a benchmark of 100 complex, open-ended deep-research tasks spanning 10 domains and requiring information from 40 countries, published by Perplexity (with a Harvard co-author) in February 2026 (arXiv:2602.11685). Its distinguishing feature: the tasks are drawn from real, de-identified production usage of Perplexity Deep Research (Production-Sourced Evaluation) rather than synthetic or hand-authored prompts, then paired with task-specific expert rubrics and graded by an LLM-as-a-judge. It is a benchmark of systems / products, not base models — which is what makes its headline finding (orchestration beats the bare model) legible.

Why it's different (Table 1)#

DRACO is positioned as the first deep-research benchmark to be simultaneously: production-sourced, human-authored, general-domain (not just specialized/technical), and expert-rubric-graded. Prior open-ended benchmarks each miss at least one — DeepResearchEval, ReportBench, DeepScholar-Bench, and DRBench rely on synthetic task generation; others are hand-authored but narrow or lack expert rubrics. None draws directly from a widely-available production deep-research system.

Task construction (5 stages)#

Sourced from production Perplexity Deep Research queries, then reformulated/augmented/filtered so tasks are anonymous, well-specified, bounded, challenging, and representative:

Sampling — 1,000 high-difficulty English queries (Sep–Oct 2025), difficulty proxied by subsequent negative sentiment or a thumbs-down on the prior response.
Pre-processing — LLM reformulation to strip PII and reduce ambiguity; fully automated, no raw query ever seen by a human analyst (privacy by design).
Augmentation — systematic expansion along two axes: context (persona, output format, source specificity) and scope (temporal, cross-entity comparison, geography). Turns ambiguous queries into well-defined tasks reflecting implicit user intent.
Filtering — LLM keeps only tasks that are objective (experts converge on what's good), tractable (bounded), and difficult (needs nontrivial multi-step gathering/synthesis).
Curation — 100 tasks sampled to match the real domain distribution, then manually reviewed by in-house domain experts.

The 10 domains: Finance, Shopping/Product Comparison, Academic, Technology, General Knowledge, UX Design, Law, Medicine, Needle in a Haystack, Personalized Assistant.

Rubric design and grading#

Rubrics were built with 26 recruited domain experts (doctors, attorneys, financial analysts, engineers, designers) over a 4-stage pipeline with LLM assistance, including a saturation test — if the leading system already scored >90% on a task, it was sent back for hardening (~45% of tasks were). Each task carries ~39.3 weighted criteria across four axes; about half target factual accuracy. Criteria are positive (desirable properties) or negative (pitfalls), with the harshest penalties reserved for harmful medical content (down to −500).

Axis	Weight Range	~Criteria/task
Factual Accuracy	−500 to +20	20.5
Breadth & Depth of Analysis	−100 to +10	8.6
Presentation Quality	−50 to +20	5.6
Citation Quality	−150 to +10	4.8

Grading uses an open-source LLM-as-a-judge protocol: per-criterion binary MET/UNMET → weighted normalized score (0–100%) and pass rate. Judge = Gemini-3-Pro (chosen via an internal human–LLM alignment study); GPT-5.2 and Sonnet-4.5 corroborate. Rankings are stable across judges; absolute magnitudes vary.

Headline results#

Perplexity Deep Research leads every domain and every rubric axis. Among deep-research systems:

System	Normalized	Pass rate
Perplexity Deep Research (Opus 4.6)	70.5	72.8
Perplexity Deep Research (Opus 4.5)	67.2	70.9
Gemini Deep Research	59.0	62.7
OpenAI Deep Research (o3)	52.1	56.9
OpenAI Deep Research (o4-mini)	41.9	48.0
Claude Opus 4.6 (bare + tools)	59.8	63.1
Claude Opus 4.5 (bare + tools)	46.7	50.2

Three findings that matter for this wiki:

Orchestration > base model. Perplexity (Opus 4.6 base) beats bare Opus 4.6-with-tools by ~10pp — see Deep Research Agents. A live counter-datapoint to Harness Shrinkage as Models Improve.
Claude Opus 4.6 is the strongest non-Perplexity system (59.8% / 63.1%), ahead of Gemini Deep Research and both OpenAI configs. Opus 4.6 ranks second (non-Perplexity) in 5 of 10 domains.
Factual accuracy / citation are the universal weak axes; presentation is strongest everywhere. The Perplexity-vs-second gap is largest in Finance (21.6pp) and smallest in Law (1.6pp).

Limitations (the paper's own)#

Single-turn only (no clarifying-question / multi-turn capability tested); a static snapshot despite an automatable refresh pipeline; text-only (no multimodal); English-only; augmentation risks over-specifying away natural query variability; rubric creation still needs heavy human-expert involvement; and absolute scores depend on the LLM judge (though rankings don't). System-level (black-box) evaluation — no component-level attribution of retrieval vs. planning vs. synthesis.

Connections#

Deep Research Agents — the system class DRACO evaluates; home of the orchestration / verification / efficiency findings
Production-Sourced Evaluation — DRACO's central methodological contribution: tasks built from real de-identified production traffic
LLM-as-a-Judge — the rubric-based binary-verdict grading protocol DRACO uses
Task Time-Horizon Scaling — sibling capability benchmark; where METR measures task length a model sustains, DRACO measures research-report quality of agentic systems, and both note benchmark-saturation pressure (DRACO's saturation test discards >90%-solved tasks)
Harness Shrinkage as Models Improve — DRACO's orchestration-beats-bare-model result is a counter-datapoint to the shrinking-harness thesis
Verification as the New Bottleneck — factual-accuracy weakness across all systems is verification surfacing inside the research product
Evals as Product Spec — DRACO is the externalized, large-scale form of "evals as the definition of done," with rubrics standing in for the eval set
Perplexity / Anthropic / Google DeepMind — benchmark author; makers of evaluated systems and the judge model

Open questions#

The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?

Sources#

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity — full paper (arXiv:2602.11685, Perplexity + Harvard, Feb 2026): task construction, rubric pipeline, grading protocol, and all result tables