AI SafetyJune 18, 20264 min read

MIRA and AMIE: AI systems rival doctors in Nature studies

Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.

The BrieftideJune 18, 2026

TL;DR

01Two Nature papers show MIRA and Google’s AMIE match or beat physicians on simulated cases.
02MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026.
03Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions.

MIRA and Google's AMIE matched or outperformed physicians in two Nature studies published Jun 18, 2026. MIRA hit the correct diagnosis 88.9 percent across eight disease categories on more than 500 emergency department cases from the MIMIC-IV dataset, while AMIE's first-visit plans were rated appropriate in 95 percent of 100 multi-visit cases versus 72 percent for 21 primary care physicians.

What did the studies test and find?

Both papers ran structured simulations rather than live clinics and used public or actor-provided inputs to measure decisions. MIRA was tested on more than 500 real emergency cases from MIMIC-IV and achieved 88.9 percent correct diagnoses across eight disease categories, and in a head-to-head subset of 311 cases MIRA scored 87.8 percent compared with four experienced specialists at 78.1 percent and a mixed team of residents and specialists at 71.1 percent. AMIE faced 100 multi-visit cases played by actors, and independent reviewers rated AMIE's overall plan appropriate at the first visit in 95 percent of cases, versus 72 percent for the 21 primary care physicians in the study.

Both systems showed strengths and weaknesses by condition. MIRA scored 98.6 percent on appendicitis and 92.3 percent on pancreatitis, while both AI and clinicians performed worse on pneumonia (72.4 percent) and urinary tract infections (77.6 percent). Reviewers found no dangerous drug interactions or incorrect renal dosing in MIRA's recommendations, and MIRA did not miss any cases that required hospital admission.

How do MIRA and AMIE work?

MIRA runs as an autonomous agent inside a sealed virtual electronic health record, able to choose from more than 85,000 options across eleven tools to take histories, order labs and imaging, interpret results, generate differential diagnoses, and write treatment plans. The researchers tested MIRA with a second AI agent acting as the patient and published the source code on GitHub.

AMIE uses a two-agent design: a conversational front end for patient interaction and a background agent that cross-references cases against clinical guidelines. Google's team benchmarked AMIE against the UK's NICE Guidance and BMJ Best Practice, and also created a drug-knowledge benchmark called RxQA, based on two national formularies and verified by licensed pharmacists. AMIE in the study ran on Google's Gemini 1.5 Flash, while MIRA used OpenAI's GPT-4o and o1-preview as base models.

Why does the scaffolding around models matter?

The AMIE paper shows the specialized architecture delivered the biggest gains when paired with an older base model. With Gemini 1.5 Flash the two-agent, guideline-checking setup gave large advantages. When the researchers replaced the base model with Gemini 2.5 Flash, the AMIE system's advantage nearly vanished. The authors note that newer general-purpose models such as Gemini 2.5 Pro, o3, and GPT-5 already score "largely comparable" to the full AMIE system on the RxQA drug test.

Those findings imply the scaffolding compensates for weaknesses in older models by forcing structured reasoning and guideline citation, but the scaffolding becomes less valuable as base models improve. MIRA includes parts meant to connect AI to hospital clinical systems, a feature the paper says would not become obsolete with stronger models.

What are the limits and cautions?

Both teams warned against overinterpreting results. The MIRA authors said the system recommended "care that deviated from best practices" for a "small but non-zero" share of patients, and that simulated patient answers may have been "more structured than real speech of patients in emergency departments." The study cannot rule out that MIMIC-IV was included in training data, which would inflate measured performance. Google's AMIE team called the work a "milestone" but said the case selection and text-only conversations do not reflect a real clinic and that the system is "not ready for real-world translation" because of potential "latent reasoning errors." Jakob Kather, a MIRA co-developer, said, "We are getting a preview of how AI could transform medicine."

Independent experts emphasized the simulation gap. One called the work "some remove from the messy, complex, human world of everyday healthcare," and another argued much of the advantage reflected the precision and completeness of structured plans rather than unequivocal clinical correctness.

What to watch

Look for replication in live clinical workflows and prospective trials that use unstructured patient interactions rather than actor text chats. Also watch whether teams update their systems to newer base models or publish comparisons showing how performance shifts when models like Gemini 2.5 Flash, Gemini 2.5 Pro, o3, or GPT-5 are used instead of older checkpoints.

Study comparison: MIRA, AMIE and physician baselines

Item
Dataset / cases tested	More than 500 ED cases from MIMIC-IV	100 multi-visit cases played by actors	21 primary care physicians across 100 cases; specialists in 311-case head-to-head
Primary accuracy/performance	88.9% correct diagnosis across eight disease categories	95% of first-visit plans rated appropriate	72% of first-visit plans rated appropriate (21 PCPs); 78.1% specialists in 311-case head-to-head
Head-to-head subset	311-case subset: 87.8% correct	n/a	Four experienced specialists: 78.1%; mixed team: 71.1% (311-case subset)
Notable condition results	Appendicitis 98.6%; Pancreatitis 92.3%; Pneumonia 72.4%; UTI 77.6%	Outscored physicians on plan accuracy and guideline adherence	Lower accuracy on some conditions; preferred less often than AMIE in actor and reviewer ratings
Base models and tooling	Uses GPT-4o and o1-preview; agent inside sealed EHR; 85,000+ options across eleven tools; source on GitHub	Two-agent design; ran on Gemini 1.5 Flash; RxQA benchmark used	Human clinicians operating within standard practice; compared to guideline benchmarks
Known limits	"Small but non-zero" deviations from best practice; possible MIMIC-IV overlap with training	"Not ready for real-world translation"; latent reasoning errors; gains shrink with Gemini 2.5 Flash	Study context differs from real clinical complexity; structured tests favor complete plans

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

SciRisk-Bench: Benchmarking AI4Science safety across 10 risks

SciRisk-Bench evaluates mainstream and science-oriented LLMs across 7 disciplines, 31 subdisciplines and 10 explicit risk dimensions.

The BrieftideDAILY BRIEF

Dario Amodei's AI playbook: Anthropic's regulation plan

Amodei urges binding third-party audits, federal power to block risky models, export controls.

The BrieftideDAILY BRIEF

Germany approves DE-AISI, an AI security institute based on UK

The National Security Council authorised a German AI Security Institute to test advanced models.

The BrieftideDAILY BRIEF

Google DeepMind launches $10M multi-agent AI safety fund

A global call for proposals offers up to $10M to study group behaviours of interacting AI agents, backed by Schmidt Sciences.