Can AI help doctors avoid missed diagnoses? A new study suggests yes

Humans still have important roles to play in medicine, experts stress
This is a human-written story voiced by AI. Got feedback? Take our survey . (See our AI policy here .)
In some of medicine’s toughest cases, the hardest part isn’t choosing the right diagnosis. It’s thinking of it at all. Artificial intelligence may now be better at that than doctors, a new study suggests.
“We’re witnessing a really profound change in technology that will reshape medicine,” Harvard University biomedical data scientist Arjun Manrai said in an April 28 news conference.
That change is driven by advances in large language models, the same technology OpenAI’s ChatGPT is built on. New versions, called reasoning models, can work through complex problems step by step. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, and over half want to use it for this purpose, according to a survey of more than 2,000 clinicians. But how well the technology works in a medical setting has been debated.
Manrai and colleagues tested OpenAI’s o-1 preview model on a range of medical cases, including classic sets of symptoms used in medical training as well as real-world data directly from the charts of 76 patients who visited an emergency room in Boston. Across those clinical reasoning tests, the AI model was more likely than physicians to include the correct diagnosis, or something very close to it, among its possible answers, the researchers report April 30 in Science.
Not all researchers are convinced that this means we should trust AI with our diagnoses, arguing that AI reasoning is still far from what human doctors can do. “When we say clinical reasoning, it doesn’t mean the same thing as moral reasoning,” says Arya Rao, a researcher at Harvard Medical School, who was not involved in the study. “These models have been optimized to do this kind of sequential thought that we call reasoning, but it’s not at all the same thing as how we teach medical students to reason.”
Manrai is not opposed to the critique, noting AI technology should assist rather than replace people in medical roles. “Ultimately, I think humans want humans to guide them … through challenging treatment decisions,” he said.
Is AI better at medical diagnoses?
An AI model outperforms doctors on identifying correct diagnoses
Researchers looked at three methods for diagnosing patient cases: AI models built on large language models (dark blue), specialized software for determining a diagnosis (light blue) and human clinicians (brown). The AI reasoning model o1-preview outperformed them all, including the correct diagnosis in its response almost 80 percent of the time. Some of these data came from prior studies, so not all of the systems were looking at the exact same cases. But all of the systems examined some subset of a long-running series of challenging real-world patient cases published in the New England Journal of Medicine.
Still, the results show that this type of AI “works for making diagnoses in the real world,” coauthor Adam Rodman, a doctor at Beth Israel Deaconess Medical Center in Boston, said at the news conference.
He described a patient who came into the emergency room with what seemed like routine respiratory symptoms and had recently undergone an organ transplant and was immunosuppressed. The patient turned out to have a dangerous flesh-eating infection requiring surgery. “The model actually was suspicious of this [infection] from the very beginning, probably 12 to 24 hours before the human physician would have become suspicious of this,” Rodman said.
Rao applauds the team for presenting [AI] “as an extension of a physician, not a replacement.” She calls the study “rigorous and thoughtful.” However, she does not think there’s enough evidence to say that AI models have aced clinical reasoning.
Her team released a study April 13 that tested 21 AI models at each step of the process toward reaching a diagnosis. Reasoning models got the highest scores overall. But when Rao’s team drilled down to identify which parts of the diagnostic process were trickiest for AI, the researchers found a weak point that persisted from the oldest models to the newest. That’s the process of considering several different uncertain diagnoses.
AI models based on LLMs tend to jump to conclusions. “Their reasoning is brittle precisely where uncertainty and nuance matter most,” Rao and her team wrote in their paper. Their conclusion was that LLMs are not yet ready to make decisions in medical settings.
These two studies evaluated different AI models in different ways. Yet, the results aren’t as opposed as they may seem on the surface, both teams say. They agree that the next step should be more research.
Manrai’s team is planning clinical trials to help answer the question: “How do we safely and thoughtfully integrate [AI] into care?” Rao likes that approach. So many people “don’t have enough access to care,” she says. Someday, she notes, “I think AI can be a great equalizer.”

Facts Only

A study published April 30 in *Science* tested OpenAI’s o1-preview AI model on medical diagnosis tasks.
The study was led by Arjun Manrai, a biomedical data scientist at Harvard University.
The AI was evaluated using classic medical training cases and real-world data from 76 patients at an emergency room in Boston.
The AI model included the correct diagnosis or a close variant in nearly 80% of cases, outperforming human clinicians.
A separate survey found that by 2025, 1 in 5 doctors and nurses worldwide used AI for second opinions on complex cases.
Arya Rao, a researcher at Harvard Medical School not involved in the study, published a paper on April 13 testing 21 AI models on diagnostic reasoning steps.
Rao’s study identified a weakness in AI models’ ability to handle multiple uncertain diagnoses.
The AI model in Manrai’s study flagged a flesh-eating infection in a patient 12–24 hours before human physicians became suspicious.
Both research teams agree that more studies are needed to assess AI’s role in clinical settings.
Manrai’s team plans clinical trials to explore safe integration of AI into medical care.

Executive Summary

A new study published in *Science* suggests that AI models, specifically OpenAI’s o1-preview, may outperform human physicians in identifying correct diagnoses in complex medical cases. Researchers tested the AI on both classic medical training scenarios and real-world patient data from an emergency room in Boston, finding that the AI included the correct diagnosis or a close variant in nearly 80% of cases—higher than human clinicians. The study highlights the potential of AI as a diagnostic aid, particularly in cases where rare or overlooked conditions might be missed. However, other researchers, like Arya Rao of Harvard Medical School, caution that AI reasoning lacks the nuance and uncertainty management of human clinical judgment, particularly in weighing multiple uncertain diagnoses. While the AI showed promise in specific cases, such as flagging a flesh-eating infection earlier than a human physician, concerns remain about overreliance on models that may "jump to conclusions." Both sides agree that further research, including clinical trials, is needed to determine how AI can be safely integrated into medical practice without replacing human expertise.

Full Take

This study lands in a fraught but rapidly evolving debate about AI’s role in medicine. The strongest version of the narrative—AI as a diagnostic equalizer—rests on compelling data: the model outperformed humans in controlled tests and even caught a life-threatening infection earlier than physicians. The methodology appears rigorous, using both standardized cases and real-world data, though the sample size (76 patients) and the lack of direct comparison across all systems leave room for skepticism. The counterpoint from Rao’s team is equally valid: AI’s "brittle reasoning" under uncertainty is a critical flaw, especially in medicine, where nuance and moral judgment matter as much as pattern recognition.
Patterns detected: none. The article avoids emotional exploitation or distortion, presenting both the promise and limitations of AI in medicine. However, the root cause of the tension lies in an unstated assumption: that diagnostic accuracy is the sole metric of success. In reality, medicine requires trust, empathy, and adaptive reasoning—qualities AI lacks. The implications are profound: AI could democratize expertise in underserved areas, but overreliance risks eroding patient-doctor relationships and introducing new biases from training data.
Key questions remain: How do we measure AI’s diagnostic value beyond accuracy? What safeguards are needed to prevent overconfidence in AI suggestions? And how do we ensure AI augments, rather than replaces, human judgment? The next step should be longitudinal studies in real clinical settings, not just controlled tests, to assess AI’s impact on patient outcomes and physician workflows.
Counterstrike scan: A coordinated influence campaign might exaggerate AI’s readiness to replace doctors or downplay its limitations. This article does neither, presenting a balanced view with clear acknowledgment of unresolved challenges. No structural alignment with manipulation tactics is detected.

Sentinel — Human

Confidence

This analysis exhibits characteristics of high-quality human-authored journalism, effectively synthesizing complex academic debate while maintaining specific, traceable citations.

Signals Detected

Moderate sentence length variance and varied rhetorical phrasing, suggesting human editorial influence.

The text successfully navigates complex, competing academic viewpoints, demonstrating contextual nuance beyond simple summation.

Specific, verifiable references to researchers (Manrai, Rao, Rodman) and precise study outcomes (dates, specific model names) indicate deep grounding.

No immediate signs of LLM confabulation; all claims are linked back to attributed research or named experts.

Human Indicators

The interplay between opposing expert opinions (e.g., Manrai's optimism vs. Rao's skepticism) is rich and nuanced.

The article successfully bridges technical AI concepts with complex ethical/clinical reasoning debates, which requires complex human synthesis.

The specific citations and cross-referencing of multiple studies (e.g., the April 13 study) suggest rigorous, non-generic sourcing.