AI Safety4 min read

MIRA and AMIE: AI systems rival doctors in Nature studies

Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.

The Brieftide

TL;DR

  • 01Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.
  • 02MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026.
  • 03Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions.

MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026. MIRA hit the correct diagnosis 88.9 percent across eight disease categories on more than 500 emergency department cases from the MIMIC-IV dataset, while AMIE's first-visit plans were rated appropriate in 95 percent of 100 multi-visit cases versus 72 percent for 21 primary care physicians.

What did the studies test and find?

Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions. MIRA was tested on more than 500 real emergency cases from MIMIC-IV and achieved 88.9 percent correct diagnoses across eight disease categories, and in a head-to-head subset of 311 cases MIRA scored 87.8 percent compared with four experienced specialists at 78.1 percent and a mixed team of residents and specialists at 71.1 percent. AMIE faced 100 multi-visit cases played by actors, and independent reviewers rated AMIE's overall plan appropriate at the first visit in 95 percent of cases, versus 72 percent for the 21 primary care physicians in the study.

Both systems showed strengths and weaknesses by condition. MIRA scored 98.6 percent on appendicitis and 92.3 percent on pancreatitis, while both AI and clinicians performed worse on pneumonia (72.4 percent) and urinary tract infections (77.6 percent). Reviewers found no dangerous drug interactions or incorrect renal dosing in MIRA's recommendations, and MIRA did not miss any cases that required hospital admission.

How do MIRA and AMIE work?

MIRA runs as an autonomous agent inside a sealed virtual electronic health record, able to choose from more than 85,000 options across eleven tools to take histories, order labs and imaging, interpret results, generate differential diagnoses, and write treatment plans. The researchers tested MIRA with a second AI agent acting as the patient and published the source code on GitHub.

AMIE uses a two-agent design: a conversational front end for patient interaction and a background agent that cross-references cases against clinical guidelines. Google's team benchmarked AMIE against the UK's NICE Guidance and BMJ Best Practice, and also created a drug-knowledge benchmark called RxQA, based on two national formularies and verified by licensed pharmacists. AMIE in the study ran on Google's Gemini 1.5 Flash, while MIRA used OpenAI's GPT-4o and o1-preview as base models.

Why does the scaffolding around models matter?

The AMIE paper shows the specialized architecture delivered the biggest gains when paired with an older base model. With Gemini 1.5 Flash the two-agent, guideline-checking setup gave large advantages. When the researchers replaced the base model with Gemini 2.5 Flash, the AMIE system's advantage nearly vanished. The authors note that newer general-purpose models such as Gemini 2.5 Pro, o3, and GPT-5 already score "largely comparable" to the full AMIE system on the RxQA drug test.

Those findings imply the scaffolding compensates for weaknesses in older models by forcing structured reasoning and guideline citation, but the scaffolding becomes less valuable as base models improve. MIRA includes parts meant to connect AI to hospital clinical systems, a feature the paper says would not become obsolete with stronger models.

What are the limits and cautions?

Both teams warned against overinterpreting results. The MIRA authors said the system recommended "care that deviated from best practices" for a "small but non-zero" share of patients, and that simulated patient answers may have been "more structured than real speech of patients in emergency departments." The study cannot rule out that MIMIC-IV was included in training data, which would inflate measured performance. Google's AMIE team called the work a "milestone" but said the case selection and text-only conversations do not reflect a real clinic and that the system is "not ready for real-world translation" because of potential "latent reasoning errors." Jakob Kather, a MIRA co-developer, said, "We are getting a preview of how AI could transform medicine."

Independent experts emphasized the simulation gap. One called the work "some remove from the messy, complex, human world of everyday healthcare," and another argued much of the advantage reflected the precision and completeness of structured plans rather than unequivocal clinical correctness.

What to watch

Look for replication in live clinical workflows and prospective trials that use unstructured patient interactions rather than actor text chats. Also watch whether teams update their systems to newer base models or publish comparisons showing how performance shifts when models like Gemini 2.5 Flash, Gemini 2.5 Pro, o3, or GPT-5 are used instead of older checkpoints.

Study comparison: MIRA, AMIE and physician baselines
Item
Dataset / cases testedMore than 500 ED cases from MIMIC-IV100 multi-visit cases played by actors21 primary care physicians across 100 cases; specialists in 311-case head-to-head
Primary accuracy/performance88.9% correct diagnosis across eight disease categories95% of first-visit plans rated appropriate72% of first-visit plans rated appropriate (21 PCPs); 78.1% specialists in 311-case head-to-head
Head-to-head subset311-case subset: 87.8% correctn/aFour experienced specialists: 78.1%; mixed team: 71.1% (311-case subset)
Notable condition resultsAppendicitis 98.6%; Pancreatitis 92.3%; Pneumonia 72.4%; UTI 77.6%Outscored physicians on plan accuracy and guideline adherenceLower accuracy on some conditions; preferred less often than AMIE in actor and reviewer ratings
Base models and toolingUses GPT-4o and o1-preview; agent inside sealed EHR; 85,000+ options across eleven tools; source on GitHubTwo-agent design; ran on Gemini 1.5 Flash; RxQA benchmark usedHuman clinicians operating within standard practice; compared to guideline benchmarks
Known limits"Small but non-zero" deviations from best practice; possible MIMIC-IV overlap with training"Not ready for real-world translation"; latent reasoning errors; gains shrink with Gemini 2.5 FlashStudy context differs from real clinical complexity; structured tests favor complete plans
Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement