Research·March 6, 2026·4 min read

DR. INFO Sets a New Clinical Standard - Topping the World's Toughest Clinical AI Benchmark

The clinical standard has been set. DR. INFO has topped the world's toughest clinical AI benchmark - beating GPT-5, GPT-5.2, Gemini 2.5 Pro, Grok 3, Claude 3.7 Sonnet, and every clinical AI system on the planet. Built on real evidence. Cited in real time. Designed for real medicine.

The benchmark results are in. Published. Unambiguous.

The Results

0.68

DR. INFO

0.46

GPT-5

0.42

GPT-5.2

0.23

Grok 3

DR. INFO scored 0.68 on OpenAI's HealthBench Hard subset - a benchmark of 1,000 clinically complex scenarios graded by physician rubrics across clinical accuracy, reasoning depth, safety alignment, and communication quality. This is the highest score achieved by any AI system on this benchmark.

Why HealthBench Hard Matters

HealthBench Hard is not a multiple-choice exam. It evaluates AI systems on realistic, risk-sensitive clinical scenarios where ambiguity, incomplete information, and patient safety are central. Developed by OpenAI and graded by physician panels, it represents the most rigorous public evaluation of clinical AI capability available today.

Traditional benchmarks like USMLE test factual recall in structured formats. HealthBench Hard tests what actually matters in clinical practice - contextual judgment, uncertainty navigation, safety signaling, and evidence-grounded reasoning under real-world conditions.

How DR. INFO Outperforms Frontier Models

General-purpose language models generate answers from learned patterns. DR. INFO retrieves and synthesizes answers from authoritative medical sources - clinical guidelines, peer-reviewed literature, FDA documentation, and drug databases - in real time. Every response is grounded in evidence and accompanied by transparent citations.

This retrieval-augmented, evidence-first architecture is why DR. INFO achieves clinical accuracy that general-purpose models cannot match. When the benchmark requires grounded clinical reasoning rather than pattern completion, the difference is decisive.

Published and Verifiable

Both benchmarks referenced in this evaluation are publicly available for independent verification.

[1]Arora R K et al. HealthBench: Evaluating Large Language Models Toward Improved Human Health. OpenAI, 2025.

[2]DR.INFO: Clinical Reasoning Capabilities of LLMs through Evidence Grounded Medical Search. 2025.

Experience the world's leading clinical AI platform. Try DR. INFO for free and see why it outperforms every frontier model on real clinical tasks.

Try DR. INFO for Free→