Overview
- HealthBench, OpenAI's first major independent healthcare initiative, includes 5,000 physician-designed dialogues to test AI models' medical question-answering capabilities.
- Developed with input from 262 doctors across 60 countries, the dataset supports 49 languages and 26 medical specialties, including 1,000 challenging cases where AI models struggled.
- OpenAI's o3 model achieved the highest performance score at 60%, followed by Grok at 54% and Gemini 2.5 Pro at 52%, with notable weaknesses in context awareness and completeness.
- Experts raised concerns about the use of AI-based grading, warning it may obscure shared errors between models and graders, and called for expanded human oversight and subgroup testing.
- HealthBench aims to standardize AI evaluation in healthcare, but experts emphasize the need for further validation to ensure safety and reliability in diverse global contexts.