Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM — including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet — corrects these false presuppositions more than 43% of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to 80%, but at the cost of misidentifying presuppositions in 41% of Cancer-Myth-NFP questions and causing a 10% relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
We selected 25 oncology-related questions from CancerCare's website, focusing specifically on treatment advice and side effects that require medical expertise. Three oncology physicians evaluated answers provided by three leading language models (GPT-4-Turbo, Gemini-1,5-Pro, Llama-3.1-405B) and human medical social workers, assessing each response at both the overall and paragraph levels. Frontier language models generally outperformed human social workers, though physicians noted LLM answers were often overly generic and frequently failed to correct false assumptions embedded in patient questions. Such uncorrected misconceptions pose significant risks to patient safety in medical contexts. However, there is currently insufficient data to systematically evaluate this issue, motivating the creation of our specialized Cancer-Myth dataset.

To evaluate LLM and medical agent performance in handling patient questions with embedded misconceptions, we compile a collection of 994 common cancer myths and develop an adversarial Cancer-Myth of 585 examples. We initialized the adversarial datasets with a few failure examples from our previous CancerCare study. Using an LLM generator, we created patient questions for each myth, integrating false presuppositions with complex patient details to challenge the models. The LLM responder answers these questions, while a verifier evaluates the response's ability to address false presuppositions effectively. Responses that fail to correct the presuppositions are added to the adversarial set, while successful ones are placed in the non-adversarial set, for use in subsequent generator prompting rounds.

We perform three separate runs over the entire set of myths, each targeting GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, respectively. The generated questions are categorized into 7 categories (as below), and then finally reviewed by physicians to ensure their relevance and reliability. During this process, we also collect a 150-question Cancer-Myth-NFP set, questions that LLMs flagged as containing false presuppositions but that physicians confirmed contain no false presuppositions. This set lets us measure how often a mitigation strategy over-corrects on benign questions.

The statistics of categories and data generators in Cancer-Myth are listed below.

We employ a three-point scoring rubric evaluated by GPT-4o:
We compute two metrics: the average Presupposition Correction Score (PCS) and the proportion of fully corrected answers, Presupposition Correction Rate (PCR). We evaluate 17 models across the GPT, Claude, Gemini, DeepSeek, LLaMA, and Qwen families, plus the multi-agent medical framework MDAgents.
GPT-5 performs best with a PCR of 42.1%, closely followed by Gemini-2.5-Pro (41.4%) and Claude-4-Sonnet (40.0%). Still, no frontier LLM corrects false presuppositions in more than 43% of cases. Surprisingly, GPT-4o achieves only 5.8% PCR despite being strong at general medical QA, and MDAgents performs worse than standalone GPT-4o, indicating agent-based orchestration does not prevent LLMs from overlooking false presuppositions.

Cross-model analysis reveals asymmetries in adversarial effectiveness. Questions generated by Gemini-1.5-Pro result in the lowest PCS scores across all evaluated models, indicating that its adversarial prompts are the most universally challenging.
In contrast, prompts generated by GPT-4o are less effective at misleading other models, particularly Gemini-1.5-Pro, which maintains a near-zero PCS score (0.03) when evaluated on GPT-4o-generated data.

Models consistently fail on questions involving misconceptions about limited treatment options and inevitable side effects. These misconceptions often reflect rigid or emotionally charged beliefs, such as assuming that a specific cancer can only be treated with surgery, or that an advanced-stage diagnosis means no treatment is available.
More capable models tend to perform better on other categories, such as Causal Misattribution, Underestimated Risk, and No Symptom, No Disease.
Can we just tell the model to watch out for false presuppositions? We evaluate two mitigation strategies:
The Cancer-Myth-NFP set is critical here: it lets us measure whether a mitigation has merely made the model over-cautious, flagging false presuppositions where there are none.
GEPA pushes Gemini-2.5-Pro from 41% to 88% on Cancer-Myth, but its accuracy on Cancer-Myth-NFP collapses from 96% to 68%, with 5–15% relative drops on the other medical benchmarks. The monitoring agent is even more extreme: GPT-4o w/ MDAgents jumps from 2% to 81% on Cancer-Myth, but its NFP accuracy crashes from 90% to 35%. It now flags false presuppositions in most benign questions. Prompting alone is not a reliable remedy.
GEPA and agent-based precaution improve Cancer-Myth accuracy but reduce performance on every other benchmark, especially Cancer-Myth-NFP.
@inproceedings{zhu2026cancermyth,
title={{C}ancer-{M}yth: Evaluating Large Language Models on Patient Questions with False Presuppositions},
author={Wang Bill Zhu and Tianqi Chen and Xinyan Velocity Yu and Ching Ying Lin and Jade Law and Mazen Jizzini and Jorge J. Nieva and Ruishan Liu and Robin Jia},
booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
year={2026},
}