General-purpose large language models outperform specialized clinical AI tools on medical benchmarks, Nature Medicine study finds

A recent Nature Medicine study found that general-purpose large language models such as GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperformed specialized clinical AI tools like OpenEvidence and UpToDate Expert AI across multiple medical benchmarks:
MedQA (medical knowledge), HealthBench (clinician alignment), and a real clinical queries benchmark (RCQ) built from physician questions in live settings.

In the RCQ evaluation, 12 US clinicians conducted blinded reviews of 1,800 model‑question annotations and found that frontier general LLMs were superior to specialized clinical AI in both accuracy and clinical reasoning.

The study authors concluded that, despite their institutional legitimacy and perceived safety, specialized clinical AI tools currently do not outperform state-of-the-art general-purpose LLMs on knowledge, communication, or clinical alignment, underscoring the need for independent real-world evaluation before such tools are deployed in clinical environments.

Sources:

Evaluating ChatGPT, Claude AI, Bard, and Perplexity - PMC

Comparison of large language models for clinical scenario ...

How Well Do ChatGPT and Claude Perform in Study Selection for ...

Clinically specialized AI tools currently lack an advantage over state ...

General-purpose AI beats out specialized clinical AI in ... - TechTarget

A bi-linguistic comparative analysis of ChatGPT-4, Gemini, and Claude performance on Polish medical–dental final examinations

Nature: General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

WILL REVIEW TOMORROW BUT. TLDR. This study compared ...

Comparative study of the performance of ChatGPT-4, Claude ...

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Evaluating Large Language Models as AI Agents for Cross ...

General-purpose large language models outperform ...

General-purpose AI outperforms specialized medical AI ...

arXiv:2306.02549v1 [cs.CL] 5 Jun 2023

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems - Yen Sia Low, Michael L Jackson, Rebecca J Hyde, Robert E Brown, Neil M Sanghavi, Julian D Baldwin, C William Pike, Jananee Muralidharan, Gavin Hui, Natasha Alexander, Hadeel Hassan, Rahul V Nene, Morgan Pike, Courtney J Pokrzywa, Shivam Vedak, Adam Paul Yan, Dong-han Yao, Amy R Zipursky, Christina Dinh, Philip Ballentine, Dan C Derieg, Vladimir Polony, Rehan N Chawdry, Jordan Davies, Brigham B Hyde, Nigam H Shah, Saurabh Gombar, 2025