Language-Specific Differences in Large Language Model Diagnostic Reasoning: A Translation-Controlled Clinical Vignette Study.
PubMed2026-05-25
Background: Large language models (LLMs) are increasingly being evaluated for clinically relevant diagnostic tasks, yet their performance may vary across languages. We aimed to determine whether input language influences LLM diagnostic reasoning in vignette-based clinical tasks and to inform multilingual predeployment evaluation for non-English healthcare systems. Methods: In this translation-controlled in silico study, 30 real-patient's clinical vignettes were presented in paired English- and Polish-language conditions using back-translated prompts and cases. Six LLMs were evaluated with a structured reflection framework adapted from medical education. The study included 720 rater-level evaluations and 360 unique model-language-vignette responses. Responses were independently scored by 2 physician raters, with major discrepancies adjudicated by a third physician. The primary outcome was total rubric score. Secondary outcomes included differential diagnosis quality, justification, appropriateness of additional examinations, final diagnosis, and triage accuracy. Exploratory analyses assessed the number and cost of recommended examinations. Results: The effect of language differed significantly by model. Qwen2.5, Llama3.3, Meditron3, and OpenBioLLM performed significantly better in English, with the largest gap observed for Qwen2.5. GPT-5 and Bielik showed no statistically detectable English-Polish difference in overall score in this sample. Language-related differences were most evident in differential diagnosis quality, justification, and examination planning rather than in final diagnosis alone. Exploratory economic analyses suggested model- and language-dependent differences in testing burden, with broader suggested workups generally associated with higher diagnostic costs. Language robustness was not a consistent property of clinically evaluated LLMs. Performance differences were concentrated in reasoning and workup domains that are safety-relevant if these systems are used clinically. Conclusions: Multilingual clinical performance of LLMs is strongly model dependent. Language-specific evaluation should be considered before deployment in non-English healthcare systems.
Journal of clinical medicine
查看原文 ↗Show Your Work: Verbatim Evidence Requirements and Automated Assessment of Large Language Models for Biomedical Text Processing of Trial Eligibility Criteria.
PubMed2026-05-01
Introduction Large language models (LLMs) are used for biomedical text processing, but decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable quote affects performance for trial eligibility-scope classification from abstracts. Methods We used 200 randomized controlled trials and provided models with the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Flagship models from three vendors (OpenAI, Google, and Anthropic) were queried in two conditions: Label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings, and a separate judge step used an LLM to rate whether each quote supported the assigned label. Results Evidence requirements modestly reduced coverage, i.e., non-invalid non-abstained outputs (GPT-5.2 86.2% to 84.3%, Gemini 3 flash preview 98.3% to 92.8%, Claude Opus 4.5 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss' kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. Conclusion Substring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.
Evaluating Large Language Models for Automated Evidence Synthesis in Neuroimaging AI: A Multi-Model Benchmark.
PubMed2026-05-30
Background: Data extraction for systematic reviews is highly resource-intensive. This study evaluated four frontier large language models (LLMs) on complex structured metadata extraction from specialized neuroimaging artificial intelligence (AI) literature to determine their performance in automated evidence synthesis. Methods: We compared Google Gemini 3 Pro Preview, Anthropic Claude Opus 4.5, Perplexity Sonar Pro, and OpenAI GPT 5.2. Using a standardized prompt, each model extracted 22 variables from 91 peer-reviewed neuroimaging AI articles. The variables were stratified into low-, medium-, and high-complexity tiers. The performance was measured via the exact-match accuracy against a consensus-based expert ground truth. Results: The overall exact-match accuracy was moderate. Gemini 3 Pro Preview achieved the highest overall rate (56.4%), followed by Sonar Pro (52.1%), Claude Opus 4.5 (51.3%), and GPT 5.2 (46.5%). Gemini significantly outperformed all other models (p < 0.001). The performance declined dramatically as the variable complexity increased. Across models, the accuracy was 88.9-92.9% for low-complexity categorical fields, 47.0-63.3% for medium-complexity text extraction, and 2.7-15.5% for high-complexity variables requiring clinical judgment or multi-section synthesis. The most common type of error was misclassification. All four models scored 0% on the main performance metric, but this reflected a representational mismatch with the ground truth rather than extraction failure, indicating that the exact-match accuracy underestimates the true semantic performance. Conclusions: Frontier LLMs can effectively automate the retrieval of simple categorical data, but have serious difficulties with methodological variables that are complex. Although extraction can be fully automated for low-complexity fields, human review remains essential for context-dependent variables that require clinical judgment.
Phenotyping Prostate Cancer in a National Health System Using Large Language Models.
PubMed2026-04-01
Large language models (LLMs) may improve extraction of prognostic variables in prostate cancer from unstructured clinical text compared with traditional, rule-based natural language processing.
We used iterative prompt engineering with few-shot examples to develop LLM prompts for 30 phenotypes from prostate biopsy, radical prostatectomy (RP), and transurethral resection of the prostate (TURP) pathology reports, as well as magnetic resonance imaging (MRI) pelvis, computed tomography [CT] abdomen/pelvis, Tc-99m bone scan, and prostate-specific membrane antigen [PSMA] PET/CT reports. Data were drawn from >130 Veterans Affairs facilities (1999-2025). Inference was performed with Llama 3.3 70B or GPT-4o depending on the task. Performance was evaluated on independent test sets with metrics including overall accuracy, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and macro-F1.
Pathology extraction tasks achieved near-perfect accuracy. For prostate biopsy reports, exact extraction of total cores and involved cores was highly accurate (total cores: accuracy 98.0% [95% CI, 93.0 to 99.5]; involved cores: accuracy 95.0% [95% CI, 88.8 to 98.5]). Performance was similarly strong for RP and TURP reports. On MRI pelvis, extraction of PIRADS scores (accuracy 98.0% [95% CI, 93.0 to 99.5]), lesion locations (accuracy 100% [95% CI, 96.3 to 100]), and lesion dimensions (accuracy 100% [95% CI, 96.3 to 100]) was excellent. For PSMA PET/CT, PPVs were 100% (95% CI, 93.5 to 100) for nodal metastases and 97.9% (95% CI, 89.9 to 99.6) for bone metastases; Tc-99m bone scan performance was comparable. Lower PPVs were observed for nodal and bone metastases on pelvic MRI (84.2%-86.4%) and CT (88.0%-90.3%), largely due to ambiguous language in radiology report texts.
LLMs can reliably extract key prostate cancer phenotypes from unstructured text across multiple pathology and radiology report types. Ambiguous or indeterminate language remains the principal challenge for optimal performance.
Multi-Hardware Benchmarking of Open-Source Large Language Models with Retrieval-Augmented Generation for Mitsubishi FX-Series PLC Instruction List Code Generation.
PubMed2026-06-05
Smart manufacturing relies on programmable logic controllers (PLCs) that translate sensor inputs into actuator commands. Generating PLC programs in legacy textual languages such as Mitsubishi FX-series Instruction List (IL) remains an expert-only task, and IL's deprecation in IEC 61131-3 Edition 3.0 leaves it under-represented in the corpora that train modern large language models (LLMs). We benchmark ten open-source LLMs (five vendors, 7B-122B parameters) in both LLM-only and Retrieval-Augmented Generation (RAG) configurations on a frozen 285-question dataset; the pipeline uses ChromaDB with all-MiniLM-L6-v2 embeddings and Maximal Marginal Relevance (MMR) retrieval (k=3, λ=0.5). To move beyond lexical similarity we introduce a three-tier static syntax checker (Lexical/Syntactic/Semantic) calibrated against a 93.3% ground-truth pass rate. RAG raises the syntactic pass rate by +6.7 to +61.1 percentage points across all ten models; the best configuration, qwen3.5:122b with RAG, reaches 95.8%, exceeding the ground-truth baseline. Two outliers (llama3.3:70b at +6.7 pp, gpt-oss:120b at +25.6 pp) are reported rather than excluded. The results indicate that for deprecated-but-deployed industrial languages a curated dialect corpus paired with a locally-hosted open-source LLM is more effective than scaling raw model size, supporting reproducible, on-premise industrial-monitoring and code-generation tooling for sustainable smart manufacturing.
Large language models for optimizing clinical trial recruitment in ICUs: application to ventilator-induced diaphragm dysfunction.
PubMed2026-06-11
Ventilator-induced diaphragm dysfunction (VIDD) is a frequent and under-recognized consequence of prolonged mechanical ventilation in intensive-care unit (ICU) patients. Identifying eligible candidates for clinical trials targeting VIDD remains a major operational challenge. This study evaluates the use of large language models (LLMs) to automate patient prescreening from ICU discharge summaries and estimate recruitment capacity for a future phase 2 trial.
We developed an LLM-based prescreening pipeline to assess trial eligibility criteria from ICU discharge summaries, which was deployed to screen all 2024 ICU stays. Stays that were flagged as potentially eligible underwent expert adjudication. An enriched set of 50 ICU stays was independently annotated by six clinicians to define a reference standard, which was used to evaluate criterion-level model performances using F1-scores.
The best-performing model was GPT-OSS:120B with a criterion-level F1-score of 0.82. When applied to consecutive 1,342 ICU stays from Montpellier University Hospital in 2024, the selected model identified 532 patients with ≥ 3 days of mechanical ventilation. After applying exclusion criteria, 185 patients remained potentially eligible. Expert review confirmed 133 patients as eligible, resulting in a positive predictive value of 72% (95% CI 65-78). The LLM-assisted workflow resulted in an estimated 86% reduction in clinician review time. The LLM-based prescreening pipelines achieved criterion-level F1 scores ranging from 0.73 to 0.82, with GPT-OSS:120B demonstrating the highest performance.
LLM-based prescreening offers a promising approach for identifying trial candidates in critical care, prioritizing candidates for clinician review. Future deployments should include targeted expert validation and ongoing monitoring to ensure safety and generalizability.
Performance of Multimodal Large Language Models in Detection and Position Assessment of Thoracic Devices on Chest Radiographs.
PubMed2026-05-23
Background: Accurate identification and positioning of thoracic devices on chest radiographs is critical for patient safety in intensive care. Multimodal large language models (LLMs) offer potentially generalizable automated evaluation, but their performance in this domain is underexplored. Methods: Three multimodal LLMs (GPT-4o, gpt-4o-2024-08-06; Gemini 3.1 Flash Lite Preview; Claude Sonnet 4.6) were evaluated on 4813 chest radiographs from the RANZCR CLiP dataset for device presence and positioning of ETT, NGT, CVC, and Swan-Ganz catheters. Performance was quantified with 95% Wilson confidence intervals, balanced accuracy, MCC, Cochran's Q, Bonferroni-corrected McNemar, and Cohen's/Fleiss' kappa. Six additional analyses were performed: a blinded paired reader study (n = 377; two board-certified radiologists, blinded to ground truth and to all LLM outputs), external validation on PadChest (n = 200, device-presence detection only-PadChest lacks granular position labels), three-variant prompt-sensitivity analysis (n = 103), repeat-inference stability across three runs (n = 50), systematic error taxonomy, and a failure-case analysis. Results: Device-presence performance varied widely across models; abnormal-position sensitivity was uniformly poor (MCC ≤ 0.028; balanced accuracy 0.41-0.53). Inter-model agreement was poor to slight (Fleiss' κ: 0.005-0.383 for presence; -0.280 to -0.025 for classification). Radiologists numerically outperformed all three LLMs in 42/42 paired comparisons; the superiority was statistically significant after Bonferroni correction in 33/42 (32/42 at p < 0.001). PadChest replicated the negative finding for device-presence detection (malposition not externally validated). Prompts and inference stochasticity introduced 2-3× sensitivity swings and run-to-run κ from 0.20 to 0.85. Case failures concentrated systematically in multi-device cases (p < 0.0001) but not in abnormal-position cases (p = 0.14). Conclusions: Current general-purpose multimodal LLMs are not yet reliable for autonomous thoracic-device assessment; their failure patterns are structurally characterizable across models, prompts, and case types and support, at most a circumscribed role, as adjunct device-presence screening tools. The findings do not generalize to purpose-built, regulator-approved clinical AI systems.
SCRIPT: Stratified clinical risk prediction from pathology reports using large language models.
PubMed2026-08-01
Accurate risk stratification in oncology is essential for guiding treatment decisions, yet current algorithms rely on a narrow set of structured variables, and hence potentially ignore the rich signal in narrative pathology reports. These reports contain nuanced morphological descriptions and expert clinical judgmentThis narrative information remains largely unused in clinical decision-making as it gets lost in "prose" text-based reports. We hypothesized that large language models (LLMs) could extract prognostic information from complete free-text pathology reports and convert it into a binary survival biomarker.
We used the open-weight LLaMA 3.3 70B model to generate risk scores directly from publicly available pathology reports across three gastrointestinal cancer types. The model was prompted to synthesize the complete narrative reports into a binary prognostic score. We evaluated associations between the LLM-generated scores and survival outcomes, including overall survival, progression-free survival, and disease-specific survival.
In colorectal cancer, LLM-generated risk scores demonstrated significant prognostic value for overall survival (Hazard ratio (HR) = 2.77, 95% confidence interval (CI) = 1.92-3.97, p < 0.001), progression-free survival (HR = 2.93, 95% CI = 2.11-4.08, p < 0.001), and disease-specific survival (HR = 5.85, 95% CI = 3.66-9.36, p < 0.001). Multivariate analysis confirmed the LLM-generated risk score as an independent prognostic factor for progression-free survival.
LLMs can turn narrative pathology reports into a single, independent survival biomarker. This approach leverages routinely available free-text documentation without requiring additional tissue analysis or pathologist workload, providing a deployable method to enhance risk stratification for treatment decision-making.
Knowledge-Augmented Large Language Model for Multimodal Electronic Health Record-Based Risk Prediction: Development and Validation Study.
PubMed2026-06-12
Accurate clinical outcome prediction using electronic health records (EHRs) is crucial for patient care and resource allocation. EHRs include both structured data and rich, unstructured clinical notes. However, prior machine learning methods struggle with the multimodality, long context of notes, and severe class imbalance in clinical tasks.
This study aimed to introduce and evaluate KAMELEON (Knowledge-Augmented Multimodal EHR Learning for Outcome Prediction), a unified, 2-stage hybrid framework that integrates diverse EHR modalities and external biomedical knowledge to enhance clinical risk prediction.
This study used the publicly available, deidentified Medical Information Mart for Intensive Care-III dataset, which includes structured and unstructured data for over 40,000 intensive care unit patients. The 2 tasks studied were 30-day readmission (403/10,031, 4% positive rate) and in-hospital mortality prediction (2423/17,903, 13% positive rate). Train-test splits were patient-disjoint (80:20). Performance was evaluated against general and medical large language models (LLMs) and structured baselines. Key metrics included the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and macro F1-score.
The KAMELEON framework consistently outperformed all existing baselines. 30-day readmission: the KAMELEON-balanced random forests model achieved an AUROC of 0.85 and a sensitivity (recall) of 0.79. Ablation analysis shows the critical role of the LLM-generated reasoning, with its removal causing the AUROC to drop from 0.85 to 0.7 and sensitivity to fall by over 80%. In-hospital mortality: the KAMELEON-extreme gradient boosting model achieved an AUROC of 0.92 and an AUPRC of 0.650. Unstructured-only models showed limited ability to discern mortality, with AUROC values near chance (around 0.51-0.53).
To our knowledge, KAMELEON represents one of the first systematic frameworks to enhance LLMs for health care prediction through graph-guided knowledge retrieval combined with structured machine learning. The framework demonstrates superior performance across both prediction tasks, highlighting the synergistic value of combining diverse data modalities and LLM reasoning for robust clinical risk estimation.
Large Language Models Utility for Rapid On-Site Evaluation in Interventional Pulmonology.
PubMed2026-05-28
Background/Objectives: Rapid on-site evaluation (ROSE) is a valuable technique in interventional procedures to immediately assess the adequacy and quality of biopsy specimens at the time they are obtained. The integration of artificial intelligence (AI) into ROSE workflows has demonstrated diagnostic accuracy comparable to that of experienced cytologists. However, clinical implementation of AI-based ROSE models is limited by complex and expensive development. In contrast, the use of free or near-free global large language models (LLMs) offers a significant advantage, making diagnostic support more accessible. Assess the diagnostic accuracy of the LLMs ChatGPT and Gemini in evaluating cytological smears during interventional pulmonology procedures. Methods: Retrospective evaluation of the efficacy of LLMs for assessment of cytological smears obtained from adult patients who underwent interventional bronchoscopic and ultrasound-guided biopsies between 2020 and 2025. Images of ROSE-prepared samples were analyzed by ChatGPT-4o, ChatGPT-5, ChatGPT-5 "thinking", and Gemini 2.5 models. Results: Forty-eight procedures in 47 patients (mean age 65 years) were analyzed; 79% of biopsies were malignant. Using the final histopathology report as reference, cytologists achieved balanced accuracy of 0.75 (Gwet's AC1 = 0.53, sensitivity 0.71, specificity 0.78). ChatGPT-5 "thinking" showed high concordance (accuracy 0.65, Gwet's AC1 = 0.81, sensitivity 1.00, specificity 0.30). Gemini reached an accuracy of 0.59 (Gwet's AC1 = 0.76, sensitivity 0.97, specificity 0.20). Conclusions: To our knowledge, this study is the first to evaluate LLM-assisted ROSE in interventional pulmonology. The results suggest the potential feasibility of integrating this AI technology into the workflow within the pulmonary division. Larger prospective studies are needed to confirm effects on diagnostic yield.
A Large Language Model for Extracting Post-marketing Adverse Drug Events from Clinical Notes in the Electronic Health Record.
PubMed2026-06-12
Clinical notes contain a vast amount of potentially useful information about adverse drug event (ADE) signals that never reach pharmacovigilance databases. Traditional rule-based or sentence-level models often miss subtle causal cues and generate excess false positives.
To build a large language model (LLM) pipeline that reads entire electronic health record (EHR) notes, identifies drug-event pairs with a "reasonable possibility" of causation, and infers other important properties of each ADR, such whether the event is "serious" or "unlabeled".
We adopted a two-pass workflow using model "OpenAI o1": Pass 1 screens each note for ADEs; Pass 2 adds 20 structured fields. A diverse sample of 372 deidentified notes from physicians and pharmacists at the University of California, San Francisco (UCSF), from 31 specialty/setting cells, yielded 191 ADEs. One medical expert reviewed each ADE for validity, seriousness, and label status. Another expert created a gold standard manually curated ADE set on 100 of the 372 ADEs to give us a percent "recall" estimate. A third expert met with the first expert to arrive at a consensus on the validity of LLM ADEs validated by the first expert but not found in the gold standard ADEs, giving us a estimate of "accuracy".
Of 191 ADEs, 180 were true positives (94.2% precision) with 84.1% recall (F1 = 88.9%). Seriousness was correct in 100% and label status in 93.9% of cases. Medical Dictionary for Regulatory Activities (MedDRA) lowest level term (LLT) coding was correct in 92.5% of valid ADEs; errors were mostly non-existent LLTs. Of all valid ADEs, 12.2% met FDA "serious" criteria, 15.0% were unlabeled, and 8.9% were "failure of efficacy." On the first pass, 84.9% of notes contained no ADEs, keeping inference costs to USD $0.18 per note and $0.35 per validated ADE. The model inferred some ADEs not mentioned by physicians, e.g., tacrolimus-associated hypomagnesemia.
While the LLM evaluated in this study is not perfect, it can transform free-text EHR notes into ADEs with 94% accuracy, and such data, when statistically analyzed in aggregate, can lead to new safety signals of potential drug side effects. Integrated with platforms like Sentinel in the USA, or Darwin EU, in the European Union, this approach could rapidly surface rare, serious, and unlabeled ADEs for further regulatory analysis.
Reasoning or reciting? A temporal contamination audit of large language models in clinical medicine.
PubMed2026-06-11
Evaluate whether large language models reason or simply regurgitate training data in clinical diagnosis.
We audited 2000 clinical case reports from PubMed Central: 1000 from 2021 to 2022 (within training data) and 1000 from 2025 (after training cutoffs). Five frontier LLMs generated diagnoses evaluated by an independent AI judge validated against physician consensus (n = 10 000 evaluations).
Diagnostic accuracy was virtually identical across temporal cohorts (66.8% contaminated vs 66.9% clean), directly contradicting the memorization hypothesis. Lexical similarity was uniformly low (mean ROUGE-L 0.057), and semantic similarity measured by BERTScore showed no memorization signal (F1 0.8182 contaminated vs 0.8195 clean, Δ = +0.0013), confirming that models generate novel reasoning rather than regurgitating training data.
This large-scale audit, using both lexical and semantic similarity metrics, provides compelling evidence that LLMs engage in genuine clinical reasoning rather than regurgitating memorized training data.
Models demonstrated equivalent accuracy on cases they could not have seen during training, suggesting they have internalized generalizable medical knowledge rather than memorizing specific cases.
Journal of the American Medical Informatics Association : JAMIA
Performance of large language models in interpreting evidence-based clinical guidelines for lumbar disc herniation with radiculopathy.
PubMed2026-06-12
Large language models (LLMs) are increasingly used as clinical information tools; however, their ability to accurately interpret evidence-based spine guidelines remains unclear. This study compared the performance of ChatGPT-5.1, Gemini, and Perplexity in interpreting the North American Spine Society (NASS) guideline for lumbar disc herniation with radiculopathy.
Nineteen open-ended clinical questions derived from the NASS guideline were submitted to each LLM under standardized conditions. Responses were evaluated by two blinded clinicians using validated Likert scales for clinical accuracy (1-5), reliability, and usability (1-7). Semantic similarity to guideline-based answers was assessed using the Universal Sentence Encoder, surface-level textual similarity using ROUGE-L F1 scores, and readability using multiple established readability indices. Reference reliability was analyzed using the Reference Hallucination Score.
Perplexity demonstrated significantly higher clinical accuracy (3.95 ± 0.70) compared with ChatGPT-5.1 (3.45 ± 0.68) and Gemini (3.50 ± 0.65) (p = 0.018). Reliability and usability scores were also highest for Perplexity (4.85 ± 1.05 and 4.75 ± 0.95, respectively; both p < 0.01). Semantic similarity scores were greater for Perplexity (0.71 ± 0.06) than for ChatGPT-5.1 (0.64 ± 0.07) (p < 0.001), whereas Gemini achieved the highest ROUGE-L F1 scores (0.14 ± 0.04; p < 0.001). Readability indices were comparable across models, indicating similar levels of textual complexity. ChatGPT-5.1 exhibited the highest reference hallucination (8.10 ± 2.85), while Perplexity showed the lowest (4.15 ± 2.70) (p < 0.001).
LLMs show significant variability in guideline-based clinical reasoning. Although none should be used as independent decision-making tools, reference-oriented models may provide more reliable adjunctive support for evidence-based spine practice.
Don't stop the heart: a performance analysis of large language models and potassium dosing.
PubMed2026-06-04
Electrolyte replacement is ubiquitous in the acute care setting, but its familiarity cannot belie that even small dosing errors with potassium can cause lethal cardiac arrhythmias. Recently, MedAgentBench offered a benchmark for agentic artificial intelligence (AI) including the ability to correctly dose potassium based on a single rule; however, this does not adequately reflect the clinical complexity or safety concerns of an agent that has been used as the lethal injection. The purpose of this analysis was to a probe leaderboard large language model (LLM) capabilities to follow basic dosing rules to safely replace potassium in a series of clinician-annotated cases.
Using a clinician panel, we developed a series of dosing principles and 20 clinical cases reflective of the complexity of potassium replacement. External clinicians were surveyed to assess practice variability and agreement to clinician panel answers. We tested GPT-5-chat with each case in triplicate, with and without the clinician curated dosing principles, and prompted the model to answer six questions involving potassium goals, dosing, route, lab frequency, concurrent interventions, and the model's perceived level of confidence for the output and complexity of the case. The primary outcome was the rate of appropriate recommendations in comparison to clinician answers.
A total of 54 clinicians reviewed the 20 hypokalemia cases and hypokalemia dosing guideline. Clinicians expressed "highly agree" or "somewhat agree" for 66.8% of the cases evaluated when asked if they agree with the guideline-recommended management. When given the potassium dosing guideline, total errors dropped from 165 to 104, and average accuracy improved from 45% to 65% with GPT-5-Chat. GPT-5-Chat conveyed a high level of confidence for 100% of responses, while labeling 80% and 76% of cases as highly complex with and without the criteria, respectively. Potential harm scores were considerable in both groups, however, a notable reduction in severity scores occurred with the dosing guidance document. Recommendations on concurrent interventions and dosing had the highest rate of errors in both groups.
Benchmarks must appropriately reflect clinical complexity to be considered valuable for the deployment of agentic artificial intelligence tools in the healthcare domain. GPT-5-Chat assessment on a comprehensive medication management task for potassium replacement showed improvement with dosing guidance, yet unfit benchmarking performance.
Performance of Large Language Models in Answering Healthcare Delivery Questions: A Quantitative Cross-Sectional Study.
PubMed2026-06-01
The use of Large Language Model (LLM)-based chatbots across various fields has yielded positive outcomes. Understanding the health service delivery system offers numerous benefits. This study aimed to analyze the performance of LLMs in answering healthcare delivery questions.
A validated questionnaire relevant to the research context was administered to a sample of LLM-based chatbots. The chatbots evaluated in this study included GPT-4.1-mini, Gemini 2.5, Copilot 2025, and Perplexity. A written prompt was provided to facilitate response generation by the chatbots. To analyze and compare the performance of the AI models in addressing the research questions, confusion matrices were constructed, and key metrics-sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy-were calculated.
The initial assessment of the chatbots showed perfect sensitivity (1.00), accurately identifying all true positives without false negatives. Specificity varied, with ChatGPT and Perplexity at 0.50, Gemini at 0.43, and Copilot at 0.33. Positive predictive values (PPV) ranged from 0.67 (Gemini) to 0.75 (ChatGPT and Perplexity), while negative predictive values (NPV) were uniformly perfect (1.00). Overall accuracy was highest for ChatGPT and Perplexity (0.80), with Gemini and Copilot at 0.73. In the second round, sensitivity remained perfect for all chatbots. Gemini achieved the highest specificity (0.80), followed by ChatGPT (0.67), Perplexity (0.60), and Copilot (0.50). PPVs improved, ranging from 0.75 (Copilot) to 0.91 (Gemini). NPVs remained perfect (1.00) across all models. Overall accuracy led by Gemini (0.93), with ChatGPT and Perplexity both at 0.87, and Copilot at 0.80.
ChatGPT and Perplexity showed the highest initial performance, while the second round revealed improvements in most chatbots, especially in specificity and accuracy, with Gemini performing best. Further research is needed for deeper insights.
Note-Level Phenotyping of Multiple-Sclerosis Notes by a Large Language Model Achieves near Human-Level Agreement.
PubMed2026-05-25
Background/Objectives: Clinical phenotyping from narrative electronic health records (EHRs) often relies on multi-stage pipelines involving span-level extraction, ontology mapping, and aggregation. Large language models (LLMs) may enable direct document-level abstraction of clinically meaningful phenotype features from complete notes. We evaluated whether GPT-5.2 could approximate human annotation for note-level multiple sclerosis (MS) phenotyping and compared its performance with human annotators, a locally run open-source LLM, HPO-based extraction tools, and a supervised clinical transformer encoder. Methods: We analyzed 100 de-identified MS neurology progress notes from a single academic medical center. Each note was annotated for the presence or absence of 17 predefined neurological phenotype categories. Two human annotators independently labeled all notes using a multi-label note-level framework in Prodigy, and disagreements were adjudicated to create a reference annotation set. GPT-5.2 was evaluated in a zero-shot setting using structured JSON output. Comparator methods included Llama-3.1 8B, Doc2Hpo, ClinPhen, PhenoSnap, and BioClinical ModernBERT. Performance was assessed using agreement, precision, recall, F1, Matthews correlation coefficient, and false-positive and false-negative assignments per note. Results: Human-human agreement was generally high, although lower for rare or ambiguously documented features. GPT-5.2 achieved the strongest automated performance, with macro-precision 0.734, macro-recall 0.921, macro-F1 0.801, and macro-averaged MCC 0.777, approaching human annotator performance. GPT-5.2 showed the lowest false-negative count per note but more false-positive assignments than either human annotator, reflecting a sensitive but more inclusive annotation profile. Llama-3.1 8B performed competitively among automated methods, whereas HPO-based extraction tools and BioClinical ModernBERT showed lower performance on this low-resource note-level task. Secondary review of GPT-5.2 discordant assignments found no clear hallucinations and suggested that some apparent false positives reflected phenotype evidence missed in the human-derived reference set. Conclusions: GPT-5.2 achieved near-human performance for document-level recognition of MS phenotype categories from narrative neurology notes. Direct note-level abstraction may provide a scalable approach for research and population-health phenotyping of large EHR note corpora.
Consensus-Level and Cluster-Adjusted Evaluation of a Large Language Model for Diagnostic Extraction from Musculoskeletal Radiology Reports.
PubMed2026-05-22
Purpose: Large language models (LLMs) may reduce administrative workload in radiology by automating structured diagnostic extraction from text reports. This study evaluates the accuracy of ChatGPT-4.0 when extracting correct diagnoses from musculoskeletal (MSK) radiology text reports, and compares its performance with that of experienced human readers, using cluster-adjusted and consensus-level analyses. Materials and Methods: Twenty-three multimodal MSK imaging cases (X-ray, ultrasound, CT, and MRI) were analysed. Ten human readers and ChatGPT-4.0 (10 independent iterations) provided primary (1st) and secondary (2nd) diagnoses from six predefined options. We analysed data at the individual-reader level using cluster-adjusted generalised estimating equations (GEE) and at the case level using majority consensus with exact McNemar testing. Within-case (α_case) and within-reader (α_reader) correlations and design effects were calculated to assess clustering and implications for sample size. Results: For 1st diagnoses, AI accuracy was 0.957 (95%-CI 0.922-0.976) versus 0.865 (95%-CI 0.815-0.903) for human readers (absolute difference -0.091; OR 3.43, 95%-CI 1.07-11.02; p = 0.038). Within-case correlation (α case = 0.247) exceeded within-reader correlation (α reader ≈ 0); this resulted in a design effect of 5.7 and an effective sample size of 80.7. At the consensus level, discordance occurred in 2/23 cases (8.7%), with no significant difference between methods (McNemar p = 1.00). When 1st and 2nd diagnoses were combined, both systems achieved 23/23 correct consensus classifications. Interrater reliability between AI and human classifications was almost perfect (Gwet's AC1 = 0.836-0.927). Conclusions and Key points: In this structured MSK text-report setting, ChatGPT-4.0 achieved diagnostic accuracy comparable to that of experienced radiologists, with modest individual-reader advantages that disappeared under consensus aggregation. Clustering analysis indicates that variability is primarily case-driven, suggesting that future validation studies will benefit more from expanding case numbers than reader numbers. Our data suggest that large performance divergences between AI and human consensus are unlikely in similar structured diagnostic contexts.
Bridging the outcome documentation gap in epilepsy surgery: Validating large language model agents for automated Engel and International League Against Epilepsy scoring from clinical notes.
PubMed2026-06-12
Timely and accurate classification of postepilepsy surgery outcomes using Engel and International League Against Epilepsy (ILAE) scales is essential for clinical follow-up, yet electronic health record documentation often lacks the structured detail needed for reliable scoring. This study aimed to validate large language model (LLM) agents for autonomous extraction of standardized postsurgical outcomes from unstructured follow-up notes.
We performed a retrospective validation study of deidentified postoperative epilepsy follow-up notes from patients who underwent epilepsy-related surgery or neuromodulation between 2000 and 2025 (n = 170). Each note was processed once with two fixed GPT-4-turbo prompt configurations: a concise definition-based prompt and a context-aware prompt incorporating temporal, causal, and adherence logic. Human-adjudicated consensus served as the reference standard. Prespecified metrics included exact score agreement, clinically adjacent agreement, ordinal distance, Wilson 95% confidence intervals (CIs), and paired tests comparing prompt configurations.
Valid follow-up intervals were available for 170 cases; the median time from surgery to analyzed note was 32.7 months (interquartile range = 9.6-97.9). Human reviewers achieved 91.2% raw agreement for Engel major class (Cohen kappa = .86, 95% bootstrap CI = .79-.92) and 83.5% raw agreement for ILAE category (quadratic weighted kappa = .93, 95% CI = .89-.96). The definition-based prompt achieved 56.5% exact Engel subclass agreement (95% CI = 49.0-63.7) and 60.6% exact ILAE agreement (95% CI = 53.1-67.6). The context-aware prompt improved exact agreement to 94.7% for Engel (95% CI = 90.2-97.2) and 93.5% for ILAE (95% CI = 88.8-96.3), with lower ordinal distance for both scales (paired sign tests p < .001).
The meaningful finding is not that a general LLM can recite outcome definitions, but that a context-aware LLM agent can apply seizure-outcome logic to heterogeneous real-world notes with high agreement against adjudicated human consensus. Definition-only prompting remained unreliable in nuanced categories, supporting the need for explicit clinical reasoning structure, auditability, and privacy-preserving deployment.
Quantifying Evidence for Competing Biomedical Hypotheses using Large Language Models and Bayesian Analysis.
PubMed2026-06-07
Science fundamentally depends on the generation and testing of hypotheses, many of them controversial. An explosion in scientific literature has made evaluating hypotheses even within a domain a problem of scale, and risks slowing an already extensive consensus-building process. While this challenge has prompted interest in automated hypothesis evaluation tools, existing methods have not yet proven effective for comparing hypotheses. Here, we introduce KM-GPT-DCH, an algorithm that combines co-occurrence methods with large language models (LLMs) to develop a transparent and reproducible literature-based algorithm to compare controversial hypotheses using a structured scoring approach with Bayesian methods to estimate confidence. When testing the algorithm on historical controversial hypotheses previously decided, KM-GPT-DCH chooses the correct hypothesis with high confidence several years before the scientific community or public do so. We further apply the algorithm to compare twenty unresolved controversial hypothesis pairs providing guidance for future research. The method can help researchers and the public to evaluate biomedical hypotheses such as "Is it more likely that monoamine deficiency or inflammation causes depression?" It can also be used to assess and visualize historical trends in the scientific literature. A web-based implementation of the algorithm is freely available at https://skim.morgridge.org .
bioRxiv : the preprint server for biology
Research on fine-tuning algorithms for Large Language Models integrating Uncertainty Modeling and External Memory Augmentation.
PubMed2026-01-01
This paper proposes a parameter-efficient fine-tuning framework that integrates uncertainty modeling with external memory augmentation, aiming to improve robustness, confidence calibration, and contextual completeness in downstream natural language processing tasks. From the methodological perspective, the uncertainty modeling module explicitly characterizes uncertainty in inputs and intermediate representations through feature-level estimation, cross-layer propagation, and confidence calibration, thereby enhancing training stability and reducing the influence of noisy signals. Meanwhile, the external memory augmentation module employs key-value retrieval and gated fusion mechanisms to provide reusable contextual support, alleviating information loss caused by limited contextual summarization and improving representation quality under heterogeneous evaluation settings. Extensive experiments and ablation studies were conducted on text classification and named entity recognition tasks across multiple public benchmark datasets, using GPT-2 Small, GPT-2 Medium, and LLaMA3-8B as backbone models. The results demonstrate that the proposed framework consistently outperforms several mainstream fine-tuning methods in terms of accuracy, F1 score, and robustness, while also showing stable behavior under learning-rate sensitivity and missing-information settings. Overall, this study provides a novel perspective for efficient and interpretable fine-tuning paradigms, achieving a favorable balance among performance improvement, parameter efficiency, and deployment feasibility, and offering a practical basis for future extensions to more complex downstream scenarios.