Humans still have important roles to play in medicine, experts stress
This is a human-written story voiced by AI. Got feedback? Take our survey . (See our AI policy here .)
In some of medicine’s toughest cases, the hardest part isn’t choosing the right diagnosis. It’s thinking of it at all. Artificial intelligence may now be better at that than doctors, a new study suggests.
“We’re witnessing a really profound change in technology that will reshape medicine,” Harvard University biomedical data scientist Arjun Manrai said in an April 28 news conference.
That change is driven by advances in large language models, the same technology OpenAI’s ChatGPT is built on. New versions, called reasoning models, can work through complex problems step by step. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, and over half want to use it for this purpose, according to a survey of more than 2,000 clinicians. But how well the technology works in a medical setting has been debated.
Manrai and colleagues tested OpenAI’s o-1 preview model on a range of medical cases, including classic sets of symptoms used in medical training as well as real-world data directly from the charts of 76 patients who visited an emergency room in Boston. Across those clinical reasoning tests, the AI model was more likely than physicians to include the correct diagnosis, or something very close to it, among its possible answers, the researchers report April 30 in Science.
Not all researchers are convinced that this means we should trust AI with our diagnoses, arguing that AI reasoning is still far from what human doctors can do. “When we say clinical reasoning, it doesn’t mean the same thing as moral reasoning,” says Arya Rao, a researcher at Harvard Medical School, who was not involved in the study. “These models have been optimized to do this kind of sequential thought that we call reasoning, but it’s not at all the same thing as how we teach medical students to reason.”
Manrai is not opposed to the critique, noting AI technology should assist rather than replace people in medical roles. “Ultimately, I think humans want humans to guide them … through challenging treatment decisions,” he said.
Is AI better at medical diagnoses?
An AI model outperforms doctors on identifying correct diagnoses
Researchers looked at three methods for diagnosing patient cases: AI models built on large language models (dark blue), specialized software for determining a diagnosis (light blue) and human clinicians (brown). The AI reasoning model o1-preview outperformed them all, including the correct diagnosis in its response almost 80 percent of the time. Some of these data came from prior studies, so not all of the systems were looking at the exact same cases. But all of the systems examined some subset of a long-running series of challenging real-world patient cases published in the New England Journal of Medicine.
Still, the results show that this type of AI “works for making diagnoses in the real world,” coauthor Adam Rodman, a doctor at Beth Israel Deaconess Medical Center in Boston, said at the news conference.
He described a patient who came into the emergency room with what seemed like routine respiratory symptoms and had recently undergone an organ transplant and was immunosuppressed. The patient turned out to have a dangerous flesh-eating infection requiring surgery. “The model actually was suspicious of this [infection] from the very beginning, probably 12 to 24 hours before the human physician would have become suspicious of this,” Rodman said.
Rao applauds the team for presenting [AI] “as an extension of a physician, not a replacement.” She calls the study “rigorous and thoughtful.” However, she does not think there’s enough evidence to say that AI models have aced clinical reasoning.
Her team released a study April 13 that tested 21 AI models at each step of the process toward reaching a diagnosis. Reasoning models got the highest scores overall. But when Rao’s team drilled down to identify which parts of the diagnostic process were trickiest for AI, the researchers found a weak point that persisted from the oldest models to the newest. That’s the process of considering several different uncertain diagnoses.
AI models based on LLMs tend to jump to conclusions. “Their reasoning is brittle precisely where uncertainty and nuance matter most,” Rao and her team wrote in their paper. Their conclusion was that LLMs are not yet ready to make decisions in medical settings.
These two studies evaluated different AI models in different ways. Yet, the results aren’t as opposed as they may seem on the surface, both teams say. They agree that the next step should be more research.
Manrai’s team is planning clinical trials to help answer the question: “How do we safely and thoughtfully integrate [AI] into care?” Rao likes that approach. So many people “don’t have enough access to care,” she says. Someday, she notes, “I think AI can be a great equalizer.”
Facts Only
A study published April 30 in *Science* tested OpenAI’s o1-preview AI model on medical diagnosis tasks.
The study was led by Arjun Manrai, a biomedical data scientist at Harvard University.
The AI was evaluated using classic medical training cases and real-world data from 76 patients at an emergency room in Boston.
The AI model included the correct diagnosis or a close variant in nearly 80% of cases, outperforming human clinicians.
A separate survey found that by 2025, 1 in 5 doctors and nurses worldwide used AI for second opinions on complex cases.
Arya Rao, a researcher at Harvard Medical School not involved in the study, published a paper on April 13 testing 21 AI models on diagnostic reasoning steps.
Rao’s study identified a weakness in AI models’ ability to handle multiple uncertain diagnoses.
The AI model in Manrai’s study flagged a flesh-eating infection in a patient 12–24 hours before human physicians became suspicious.
Both research teams agree that more studies are needed to assess AI’s role in clinical settings.
Manrai’s team plans clinical trials to explore safe integration of AI into medical care.
Executive Summary
Full Take
This study lands in a fraught but rapidly evolving debate about AI’s role in medicine. The strongest version of the narrative—AI as a diagnostic equalizer—rests on compelling data: the model outperformed humans in controlled tests and even caught a life-threatening infection earlier than physicians. The methodology appears rigorous, using both standardized cases and real-world data, though the sample size (76 patients) and the lack of direct comparison across all systems leave room for skepticism. The counterpoint from Rao’s team is equally valid: AI’s "brittle reasoning" under uncertainty is a critical flaw, especially in medicine, where nuance and moral judgment matter as much as pattern recognition.
Patterns detected: none. The article avoids emotional exploitation or distortion, presenting both the promise and limitations of AI in medicine. However, the root cause of the tension lies in an unstated assumption: that diagnostic accuracy is the sole metric of success. In reality, medicine requires trust, empathy, and adaptive reasoning—qualities AI lacks. The implications are profound: AI could democratize expertise in underserved areas, but overreliance risks eroding patient-doctor relationships and introducing new biases from training data.
Key questions remain: How do we measure AI’s diagnostic value beyond accuracy? What safeguards are needed to prevent overconfidence in AI suggestions? And how do we ensure AI augments, rather than replaces, human judgment? The next step should be longitudinal studies in real clinical settings, not just controlled tests, to assess AI’s impact on patient outcomes and physician workflows.
Counterstrike scan: A coordinated influence campaign might exaggerate AI’s readiness to replace doctors or downplay its limitations. This article does neither, presenting a balanced view with clear acknowledgment of unresolved challenges. No structural alignment with manipulation tactics is detected.
Sentinel — Human
This analysis exhibits characteristics of high-quality human-authored journalism, effectively synthesizing complex academic debate while maintaining specific, traceable citations.
