General-purpose large language models outperform specialized clinical AI tools on medical benchmarks, Nature Medicine study finds
A recent Nature Medicine study found that general-purpose large language models such as GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperformed specialized clinical AI tools like OpenEvidence and UpToDate Expert AI across multiple medical benchmarks:
MedQA (medical knowledge), HealthBench (clinician alignment), and a real clinical queries benchmark (RCQ) built from physician questions in live settings.
In the RCQ evaluation, 12 US clinicians conducted blinded reviews of 1,800 model‑question annotations and found that frontier general LLMs were superior to specialized clinical AI in both accuracy and clinical reasoning.
The study authors concluded that, despite their institutional legitimacy and perceived safety, specialized clinical AI tools currently do not outperform state-of-the-art general-purpose LLMs on knowledge, communication, or clinical alignment, underscoring the need for independent real-world evaluation before such tools are deployed in clinical environments.
Sources:
Evaluating ChatGPT, Claude AI, Bard, and Perplexity - PMC
Comparison of large language models for clinical scenario ...
How Well Do ChatGPT and Claude Perform in Study Selection for ...
Clinically specialized AI tools currently lack an advantage over state ...
General-purpose AI beats out specialized clinical AI in ... - TechTarget
WILL REVIEW TOMORROW BUT. TLDR. This study compared ...
Comparative study of the performance of ChatGPT-4, Claude ...
Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks
Evaluating Large Language Models as AI Agents for Cross ...
General-purpose large language models outperform ...
General-purpose AI outperforms specialized medical AI ...