MIRA and AMIE: AI systems rival doctors in Nature studies
Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.
TL;DR
- 01Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.
- 02MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026.
- 03Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions.
MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026. MIRA hit the correct diagnosis 88.9 percent across eight disease categories on more than 500 emergency department cases from the MIMIC-IV dataset, while AMIE's first-visit plans were rated appropriate in 95 percent of 100 multi-visit cases versus 72 percent for 21 primary care physicians.
What did the studies test and find?
Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions. MIRA was tested on more than 500 real emergency cases from MIMIC-IV and achieved 88.9 percent correct diagnoses across eight disease categories, and in a head-to-head subset of 311 cases MIRA scored 87.8 percent compared with four experienced specialists at 78.1 percent and a mixed team of residents and specialists at 71.1 percent. AMIE faced 100 multi-visit cases played by actors, and independent reviewers rated AMIE's overall plan appropriate at the first visit in 95 percent of cases, versus 72 percent for the 21 primary care physicians in the study.
Both systems showed strengths and weaknesses by condition. MIRA scored 98.6 percent on appendicitis and 92.3 percent on pancreatitis, while both AI and clinicians performed worse on pneumonia (72.4 percent) and urinary tract infections (77.6 percent). Reviewers found no dangerous drug interactions or incorrect renal dosing in MIRA's recommendations, and MIRA did not miss any cases that required hospital admission.
How do MIRA and AMIE work?
MIRA runs as an autonomous agent inside a sealed virtual electronic health record, able to choose from more than 85,000 options across eleven tools to take histories, order labs and imaging, interpret results, generate differential diagnoses, and write treatment plans. The researchers tested MIRA with a second AI agent acting as the patient and published the source code on GitHub.
AMIE uses a two-agent design: a conversational front end for patient interaction and a background agent that cross-references cases against clinical guidelines. Google's team benchmarked AMIE against the UK's NICE Guidance and BMJ Best Practice, and also created a drug-knowledge benchmark called RxQA, based on two national formularies and verified by licensed pharmacists. AMIE in the study ran on Google's Gemini 1.5 Flash, while MIRA used OpenAI's GPT-4o and o1-preview as base models.
Why does the scaffolding around models matter?
The AMIE paper shows the specialized architecture delivered the biggest gains when paired with an older base model. With Gemini 1.5 Flash the two-agent, guideline-checking setup gave large advantages. When the researchers replaced the base model with Gemini 2.5 Flash, the AMIE system's advantage nearly vanished. The authors note that newer general-purpose models such as Gemini 2.5 Pro, o3, and GPT-5 already score "largely comparable" to the full AMIE system on the RxQA drug test.
Those findings imply the scaffolding compensates for weaknesses in older models by forcing structured reasoning and guideline citation, but the scaffolding becomes less valuable as base models improve. MIRA includes parts meant to connect AI to hospital clinical systems, a feature the paper says would not become obsolete with stronger models.
What are the limits and cautions?
Both teams warned against overinterpreting results. The MIRA authors said the system recommended "care that deviated from best practices" for a "small but non-zero" share of patients, and that simulated patient answers may have been "more structured than real speech of patients in emergency departments." The study cannot rule out that MIMIC-IV was included in training data, which would inflate measured performance. Google's AMIE team called the work a "milestone" but said the case selection and text-only conversations do not reflect a real clinic and that the system is "not ready for real-world translation" because of potential "latent reasoning errors." Jakob Kather, a MIRA co-developer, said, "We are getting a preview of how AI could transform medicine."
Independent experts emphasized the simulation gap. One called the work "some remove from the messy, complex, human world of everyday healthcare," and another argued much of the advantage reflected the precision and completeness of structured plans rather than unequivocal clinical correctness.
What to watch
Look for replication in live clinical workflows and prospective trials that use unstructured patient interactions rather than actor text chats. Also watch whether teams update their systems to newer base models or publish comparisons showing how performance shifts when models like Gemini 2.5 Flash, Gemini 2.5 Pro, o3, or GPT-5 are used instead of older checkpoints.
| Item | ||||
|---|---|---|---|---|
| Dataset / cases tested | More than 500 ED cases from MIMIC-IV | 100 multi-visit cases played by actors | 21 primary care physicians across 100 cases; specialists in 311-case head-to-head | |
| Primary accuracy/performance | 88.9% correct diagnosis across eight disease categories | 95% of first-visit plans rated appropriate | 72% of first-visit plans rated appropriate (21 PCPs); 78.1% specialists in 311-case head-to-head | |
| Head-to-head subset | 311-case subset: 87.8% correct | n/a | Four experienced specialists: 78.1%; mixed team: 71.1% (311-case subset) | |
| Notable condition results | Appendicitis 98.6%; Pancreatitis 92.3%; Pneumonia 72.4%; UTI 77.6% | Outscored physicians on plan accuracy and guideline adherence | Lower accuracy on some conditions; preferred less often than AMIE in actor and reviewer ratings | |
| Base models and tooling | Uses GPT-4o and o1-preview; agent inside sealed EHR; 85,000+ options across eleven tools; source on GitHub | Two-agent design; ran on Gemini 1.5 Flash; RxQA benchmark used | Human clinicians operating within standard practice; compared to guideline benchmarks | |
| Known limits | "Small but non-zero" deviations from best practice; possible MIMIC-IV overlap with training | "Not ready for real-world translation"; latent reasoning errors; gains shrink with Gemini 2.5 Flash | Study context differs from real clinical complexity; structured tests favor complete plans |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetySciRisk-Bench: Benchmarking AI4Science safety across 10 risks
SciRisk-Bench evaluates mainstream and science-oriented LLMs across 7 disciplines, 31 subdisciplines and 10 explicit risk dimensions.
Dario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.
Google DeepMind launches $10M multi-agent AI safety fund
A global call for proposals offers up to $10M to study group behaviours of interacting AI agents, backed by Schmidt Sciences.