Skip to content
Chimera readability score 85 out of 100, Specialist reading level.

Abstract
While large language models (LLMs) have shown promise in diagnostic dialogue1, their capabilities for effective management reasoning—including disease progression, therapeutic response, and safe medication prescription—remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE)1−3 through a new LLM-based agentic system optimized for multi-visit clinical management and dialogue. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini’s long-context capabilities4, combining in-context retrieval with structured reasoning to align its output with up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialists and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. Though AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE’s strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 52 print issues and online access
$199.00 per year
only $3.83 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Author information
Authors and Affiliations
Corresponding authors
Supplementary information
Supplementary Information (download PDF )
Supplementary discussion, methods and results (Sections 1-16). Contains related work, details on the system design for the Mx agent and Dialogue agent, details on the OSCE evaluation study (inter-rater reliability analysis, clinician metadata, scenario metadata, ablation analysis), and methods details and further results for the RxQA medication reasoning benchmark.
Supplementary Data 1 (download PDF )
Detailed view of two sample scenarios with AMIE and PCP output and evaluation gradings. Full details for two sample scenarios used in the OSCE evaluation study, including scenario information, AMIE-patient-actor conversations, PCP-patient-actor conversations, specialist physician gradings and patient actor gradings for all three visits per scenario.
Supplementary Data 2 (download PDF )
Details for all 120 OSCE scenarios with AMIE output (PDF). Scenario details and AMIE output for all 120 scenarios used either in the OSCE evaluation study (100) or for validation purposes (20), in human-readable PDF format.
Supplementary Data 3 (download CSV )
Details for all 120 OSCE scenarios with AMIE output (CSV). Scenario details and AMIE output for all 120 scenarios used either in the OSCE evaluation study (100) or for validation purposes (20), in machine-readable CSV format.
Rights and permissions
About this article
Cite this article
Liévin, V., Palepu, A., Weng, WH. et al. Towards Conversational AI for Disease Management. Nature (2026). https://doi.org/10.1038/s41586-026-10764-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41586-026-10764-5

Facts Only

A new LLM-based system, AMIE, was developed to improve clinical management reasoning, including disease progression, therapeutic response, and medication prescription.
AMIE uses Gemini’s long-context capabilities to combine in-context retrieval with structured reasoning.
The system was evaluated in a randomized, blinded study comparing it to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios.
The scenarios were designed to reflect UK NICE Guidance and BMJ Best Practice guidelines.
AMIE demonstrated non-inferiority to PCPs in management reasoning, as assessed by specialist physicians.
AMIE scored higher than PCPs in the preciseness of treatments and investigations and in alignment with clinical guidelines.
A medication reasoning benchmark, RxQA, was created using US and UK national drug formularies and validated by board-certified pharmacists.
AMIE outperformed PCPs on higher-difficulty questions in the RxQA benchmark.
Both AMIE and PCPs had access to external drug information during the evaluation.
The study includes supplementary data with details on 120 OSCE scenarios, AMIE outputs, and evaluation gradings.
The research was published in *Nature* with a DOI of 10.1038/s41586-026-10764-5.
The authors emphasize that further research is required before real-world clinical application.

Executive Summary

The study introduces an advanced LLM-based system, AMIE, designed to enhance clinical management reasoning beyond diagnostic capabilities. AMIE integrates Gemini’s long-context capabilities with structured reasoning and in-context retrieval to align with clinical guidelines and drug formularies. In a randomized, blinded evaluation using 100 multi-visit case scenarios based on UK NICE Guidance and BMJ Best Practice, AMIE was compared to 21 primary care physicians (PCPs). Results showed AMIE was non-inferior to PCPs in management reasoning, with superior performance in treatment precision and guideline alignment. Additionally, AMIE outperformed PCPs on a medication reasoning benchmark (RxQA) derived from US and UK drug formularies, particularly on higher-difficulty questions. While the findings suggest potential for AI in disease management, the study acknowledges the need for further research before real-world application.
The research also provides supplementary materials detailing scenario outputs, evaluation gradings, and methodological specifics, including inter-rater reliability and ablation analyses. The study’s design emphasizes grounding in authoritative clinical knowledge, though it stops short of claiming immediate clinical deployment. The authors highlight the system’s ability to access external drug information as a key factor in its performance, while noting that both AMIE and PCPs benefited from this capability.

Full Take

This study represents a significant step in the evolution of AI-assisted clinical decision-making, but its implications warrant careful scrutiny. The methodology is robust, employing a randomized, blinded design with a large sample of case scenarios grounded in established clinical guidelines. The use of specialist physicians for evaluation adds credibility, and the inclusion of a medication reasoning benchmark (RxQA) validated by pharmacists strengthens the findings. However, peer reviewers might flag the study’s reliance on virtual OSCE scenarios, which, while controlled, may not fully capture the complexities of real-world clinical practice. The claim of non-inferiority to PCPs is supported by the data, but the abstract’s phrasing could be interpreted as overstating the system’s readiness for deployment—a common pitfall in AI research where lab performance doesn’t always translate to clinical utility.
The study’s novelty lies in its focus on multi-visit management reasoning, an area where prior LLM applications have been limited. By leveraging Gemini’s long-context capabilities, AMIE addresses a critical gap in AI’s ability to track disease progression and therapeutic responses over time. The RxQA benchmark is a valuable contribution, though its generalizability beyond the tested formularies remains unproven. The authors appropriately caution against premature real-world adoption, acknowledging that further validation in live clinical settings is essential.
For real-world impact, several conditions would need to be met: regulatory approval, integration with electronic health records, and demonstration of safety in diverse patient populations. The next logical research step would involve prospective trials in actual clinical environments, assessing not just reasoning accuracy but also patient outcomes and clinician workflow integration.
**Patterns detected: none**
Root cause analysis suggests this work reflects a broader paradigm shift in healthcare AI—moving from diagnostic assistance to longitudinal management. The unstated assumption is that structured reasoning and guideline adherence alone can replicate clinical judgment, which may underestimate the role of tacit knowledge and patient-specific nuances. The study’s framing aligns with the growing trend of "AI as a tool" rather than a replacement, but the long-term implications for clinician autonomy and patient trust remain unresolved.
Bridge questions:
1. How would AMIE’s performance compare in scenarios requiring ethical judgments or patient preference navigation, which guidelines often don’t address?
2. What biases might be introduced by relying on specific formularies (e.g., UK NICE vs. other systems), and how could these be mitigated?
3. If AMIE were deployed, what safeguards would be necessary to prevent over-reliance on its recommendations in ambiguous cases?
Counterstrike scan: The content does not match the pattern of a coordinated influence campaign. It presents a scholarly evaluation with clear limitations and avoids hyperbolic claims about AI superiority.

Sentinel — Human

Confidence

The text displays the logical structure, dense academic language, and specific attribution characteristic of high-quality peer-reviewed scientific publication.

Signals Detected
low severity: High lexical diversity and complex sentence structure; dense academic tone typical of peer-reviewed scientific writing.
low severity: Exceptional flow and logical progression from problem statement to methodology to results; absence of typical LLM hedging or unnecessary filler.
low severity: Clear, specific attribution using defined benchmarks (RxQA) and study designs (OSCE); structured presentation of findings.
Human Indicators
The text successfully integrates highly specialized concepts (AMIE, RxQA, OSCE, NICE Guidance) with empirical results, indicating input from domain experts or a meticulous human editor.
The specificity of the methodology and evaluation details suggests direct involvement in the research process rather than pure generative synthesis.