As regular readers well know, I get very frustrated when people use the verb “to reason” in describing the behavior of large language models (LLMs). Sometimes that’s just verbal shorthand, but both in print and in person I keep running into examples of people who really, truly, believe that these things are going through a reasoning process. They are not. None of them. (Edit: for a deep dive into this topic, see this recent paper).
To bring this into the realm of medical science, have a look at this paper from earlier this year. The authors evaluated six different LLM systems in their ability to answer 68 various medical questions. The crucial test here, though was that the question was asked twice in two different ways. All of them started by saying “You are an experienced physician. Provide detailed step-by-step reasoning, then conclude with your final answer in exact format Answer: [Letter]” The prompt was written in that way because the questions would be some detailed medical query, followed by a list of likely options/diagnoses/recommendations, each with a letter, and the LLM was asked to choose among these.
The first time the question was asked, one of the five options was “Reassurance”, i.e. “Don’t do any medical procedure because this is not actually a problem”. Any practicing physician will recognize this as a valid option at times! But the second time the exact same question was posed, the “reassurance” option was replaced by a “None of the other answers” option. Now, the step-by-step clinical reasoning that one would hope for should not be altered in the slightest by that change, and if “Reassurance” was in fact the correct answer, then “None of the above” should be the correct answer when phrased the second way (rather than the range of surgical and other interventions proposed in the other choices).
Instead, the accuracy of the answers across all 68 questions dropped notably in every single LLM system when presented with a “None of the above” option. DeepSeek-R1 was the most resilient, but still degraded. The underlying problem is clear: no reasoning is going on, despite some of these systems being billed as having reasoning ability. Instead, this is all pattern matching, which presents the illusion of thought and the illusion of competence.
This overview at Nature Medicine covers a range of such problems. The authors here find that the latest GPT-5 version does in fact make fewer errors than other systems, but that’s like saying that a given restaurant has overall fewer cockroaches floating in its soup. That’s my analogy, not theirs. The latest models hallucinate a bit less than before and breaks their own supposed rules a bit less, but neither of these have reached acceptable levels. The acceptable level of cockroaches in the soup pot is zero.
As an example of that second problem, the authors here note that GPT-5, like all the other LLMs, will violate its own instructional hierarchy to deliver an answer, and without warning users that this has happened. Supposed safeguards and rules at the system level can and do get disregarded as the software rattles around searching for plausible text to deliver, a problem which is explored in detail here. This is obviously not a good feature in an LLM that is supposed to be dispensing medical advice - as the authors note, such systems should have high-level rules that are never to be violated, things like “Sudden onset of chest pain = always call for emergency evaluation” or “Recommendations for dispensing drugs on the attached list must always fit the following guidelines”. But at present it seems impossible for that “always” to actually stick under real-world conditions. No actual physician whose work was this unreliable would or should be allowed to continue working.
LLMs are text generators, working on probabilities of what their next word choice should be based on what has been seem in their training sets, then dispensing answer-shaped nuggets in smooth, confident, grammatical form. This is not reasoning and it is not understanding - at its best, it is an illusion that can pass for them. And that’s what it is at its worst, too.

















