Trigeminal neuralgia (TN) caused by vertebrobasilar dolichoectasia (VBD) is a rare but particularly challenging entity. Microvascular decompression (MVD) is considered the most definitive treatment; however, outcomes in this subgroup remain incompletely characterized. We conducted a systematic review and meta-analysis following PRISMA guidelines. PubMed, Embase, Scopus, and Web of Science were searched from inception through August 22, 2025. Eligible studies reported on patients with VBD-TN undergoing MVD with extractable data on pain outcomes, recurrence, salvage interventions, or complications. Complete relief was defined as Barrow Neurological Institute (BNI)-I, while adequate relief included BNI-I to IIIb. Thirteen studies involving 315 patients were analyzed. The mean age ranged from 54.0 to 67.3 years, with 57.8% (182/315) being males. The pooled initial complete pain relief rate was 95.8% (95% CI, 92.3-98.2), with sustained relief at the last follow-up in 92.6% (95% CI, 88.4-96.1). Adequate relief was nearly universal, at 99.9% (95% CI, 98.2-100%) initially and 95.9% (95% CI, 91.8-98.8%) at the last follow-up. Pain recurrence occurred in 5.5% (95% CI, 2.9-8.9%), and salvage procedures were required in 1.3% (95% CI, 0.2-3.1%). The permanent morbidity was low at 2.4% (95% CI, 0.8-4.8%). Meta-regression indicated that prior ablative procedures were associated with higher complication rates, whereas V2 involvement predicted better long-term pain control. MVD appears to provide effective and durable pain relief for selected patients with VBD-TN, with low permanent morbidity but a clinically meaningful overall complication burden. Given the retrospective nature of the available evidence, MVD should be considered a promising treatment option rather than a definitive standard of care.
The author examined whether a large language model (LLM) can help identify noncompliance with the Mental Health Parity and Addiction Equity Act (MHPAEA) in health insurance plan documents. Using Anthropic's Claude 3.5 Sonnet between December 1, 2024, and January 31, 2025, the author analyzed primary documentation for the Essential Health Benefits benchmark plans for 2026. An LLM prompt was first validated, and the author assessed the LLM's positive predictive value (PPV) in applying that prompt to identify areas of potential MHPAEA noncompliance. The LLM then prioritized the top 10 areas of noncompliance among those accurately identified. The LLM identified on average 3.8 areas of potential noncompliance per document, with an average PPV of 49%. The findings indicate that LLMs currently have a relatively poor PPV in regulatory oversight tasks but may help improve efficiency by enabling rapid identification of potential MHPAEA noncompliance to prioritize areas for further review.
Recent advancements in large language models (LLMs) have accelerated their integration into clinical domains, including laboratory medicine. The performance of LLMs in answering board-level laboratory medicine questions has not been comprehensively evaluated. Given the importance of diagnostic accuracy in this field, rigorous and objective evaluations of LLM capabilities are essential. We assessed 12 LLMs from OpenAI, Anthropic, and Google using 320 Korean Residency Examination questions (2021-2024) spanning six laboratory medicine subspecialties. Standardized prompts were provided via their application programming interfaces under deterministic settings (temperature=0). Questions were administered thrice to assess response reproducibility. Outputs were compared with validated answers and analyzed for accuracy, reasoning quality, and error typology. Google's Gemini 2.0 Pro achieved the highest accuracy (80.0%), followed by OpenAI's GPT-4.5 (77.2%) and Anthropic's Claude 3.7 Sonnet (74.1%). Accuracy decreased as the difficulty of questions increased (78.0% for easy vs. 45.1% for challenging). Subspecialty performance varied. Al models underperformed on questions on transfusion medicine (mean accuracy: 38.8%), primarily because of limitations in domain-specific and regional knowledge representations. Incorrect answers primarily resulted from reasoning errors. Reproducibility exceeded 95% for most models; however, some residual non-determinism appeared even with greedy decoding (temperature=0). LLMs demonstrated substantial potential for integration into laboratory medicine, particularly in clinical chemistry and immunology. Performance inconsistencies (particularly for high-difficulty questions) and knowledge gaps (notably for transfusion medicine) highlight the necessity for further development-potentially including domain-specific fine-tuning and retrieval-augmented generation integration-and robust expert oversight before clinical application.
This study compares OpenAI's GPT-4o and Anthropic's Claude 4 in the generation of formative and summative feedback in Objective Structured Clinical Examinations (OSCEs) within Qpercom's assessment platform. A stratified sample of 51 anonymized student records was analyzed, comparing examiner-facing (pre-verification/preview) and student-facing (portfolio) feedback across both models. While both systems delivered actionable suggestions, Claude 4 consistently outperformed GPT-4o in alignment with examiner data, absence of hallucinations, and preservation of critical learning points-especially for underperforming and mid-performing students. This evidence-based evaluation recommends Claude 4 as the safer and more effective AI solution for high-stakes educational settings.
暂无摘要(点击查看详情)
Nitrogen fixation in oxygenic cyanobacteria depends on a system of genes that protect oxygen-sensitive nitrogenase, many of which likely remain uncharacterized. Here we predict FOX (fixation in the presence of oxygen) gene candidates in Anabaena sp. PCC 7120 by integrating nitrogen step-down RNA-seq (0/6/12/21 hours), quantitative proteomics, promoter architecture, genomic context, and reciprocal-best-hit conservation across diazotrophic and non-diazotrophic cyanobacteria. Using 68 literature-validated FOX genes and 835 conserved non-essential genes as a proxy negative class, we trained logistic regression, Random Forest, and XGBoost models and evaluated them using 20 repeated stratified 80/20 train–test splits. The best models achieved ROC–AUC up to 0.80 and average precision up to 0.55 and precision among the top 20 ranked genes reached 0.39 versus a 0.075 prevalence baseline. Model interpretation highlights late step-down induction, diazotroph-biased conservation, and genomic neighborhood signals as leading predictors. We generated genome-wide FOX probability scores used primarily for candidate ranking, nominating conserved genes spanning heterocyst envelope processes as well as broader redox, metabolism, and electron-pool regulation. We release these predictions and a public web-based optimizer that applies comparative-bioinformatics filters and size constraints to propose candidate accessory-gene complements for experimental testing and heterologous reconstitution efforts.
Large language models (LLMs) show promise for guiding appropriate diagnostic imaging modality selection according to ACR criteria. This study compared seven LLMs - OpenEvidence (OpenEvidence Inc., Miami, Florida); OpenAI's GPT-5 Thinking and GPT-5 (OpenAI, San Francisco, California); Anthropic's Opus 4.1 and Sonnet 4.5 (Anthropic, San Francisco, California); and Google's Gemini 2.5 Pro and 2,5 Flash (Google LLC, Mountain View, California) - using 50 clinical vignettes to assess accuracy amd clinical reasoning in formulating imaging modality recommendations. Fifty text-based clinical vignettes were created from ACR guidelines, featuring five variants of 10 different medical complaints with subtle symptomatic or demographic alterations. A 3-point Likert scale was used to evaluate four performance metrics: imaging appropriateness, technical specificity, clinical rationale strength, and citation quality. Readability and word count were also assessed. Two blinded, independent reviewers rated the LLM outputs, with discrepancies resolved via consensus. A third reviewer was included for persistent disagreements. Analysis involved Friedman's test followed by pairwise Wilcoxon signed-rank testing with Holm correction (P < .05). Friedman testing demonstrated significant differences across all performance domains (P≤ .031). Appropriateness scores (range 1.60-1.88 out of 2.00) revealed no significant pairwise differences. Technical specificity (range 1.82-2.00) and clinical rationale (range 1.52-1.88) showed no significant pairwise differences. Citation quality (range 0.40-2.00) was the most variable; Gemini 2.5 Pro and Gemini 2.5 Flash hallucinated citations in 80% and 76% of prompts, respectively, performing worse than all other models (P < .001). Readability scores ranged from 15.27 to 22.19, and word counts from 90.10 to 195.02. All LLMs selected appropriate imaging modalities using reasonable clinical justification. Citation validity varied widely. Ensuring congruence between clinical reasoning and cited sources is essential before successful implementation.
Transcription factors (TFs) and their target genes form regulatory networks that control gene expression and influence diverse biological processes and disease outcomes. Although multiple computational methods and curated databases have been developed to identify TF-target interactions, they often require specialized expertise. Large language models (LLMs) chatbots offer a more accessible alternative for querying TF-target interactions. In this study, we benchmarked four prominent LLMs, Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.0 Pro, OpenAI's GPT-4o, and Meta's Llama3 8b, using 8432 literature-curated human TF-target interactions. We examined four regulatory categories: bidirectional, ambiguous, self-regulated, and unidirectional interactions. Under single-turn queries, Claude 3.5 Sonnet and GPT-4o outperformed the others, with balanced accuracies reaching 50.0 ± 7.6% (GPT-4o, self-regulated) and 48.2 ± 1.0% (Claude 3.5 Sonnet, unidirectional). Zero-temperature settings generally enhanced reproducibility, and multi-turn prompting improved performance for most models, increasing Claude 3.5 Sonnet's accuracy on self-regulated pairs by 32.6%. Excluding TF-target pairs with all unknown regulation types also generally improved accuracy, with unidirectional regulation reaching near 70% balanced accuracy in some cases. We also benchmarked Anthropic's Claude 3.5 Sonnet, Google's Gemini 2.0 Flash, OpenAI's GPT-4o, and Meta's Llama3 using 5148 experimentally derived TF-target interactions. Claude 3.5 Sonnet consistently outperformed the other models across conditions. Our findings highlight that prompt engineering and strategic use of model parameters consistently influence LLM chatbots' performance on TF-target identifications. This study establishes a benchmarking framework and demonstrates the potential of pre-trained general-purpose LLMs to support regulatory biology research, especially for researchers without extensive computational expertise. The literature-based TF-target interactions ground truth were obtained from TRRUST v2 human dataset (www.grnpedia.org/trrust). The experimental derived TF-target interactions ground truth were obtained from TFLink Home Sapiens small-scale interaction table (https://tflink.net/). Processed TF-target interactions data and the analytical pipeline has been compiled as an interactive Python notebook file and is available at https://github.com/pengpclab/LLM-TF-interactions.
Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context. A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen's kappa. ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer-Recall of Knowledge (SBA-R), Single Best Answer-Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items. Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.
Publicly available artificial intelligence (AI) Vision Language Models (VLMs) are constantly improving. The advent of vision capabilities on these models could enhance radiology workflows. Evaluating their performance in radiological image interpretation is vital to their potential integration into practice. This study aims to evaluate the proficiency and consistency of the publicly available VLMs, Anthropic's Claude and OpenAI's GPT, across multiple iterations in basic image interpretation tasks. Subsets from publicly available datasets, ROCOv2 and MURAv1.1, were used to evaluate 6 VLMs. A system prompt and image were input into each model three times. The outputs were compared to the dataset captions to evaluate each model's accuracy in recognising the modality, anatomy, and detecting fractures on radiographs. The consistency of the output across iterations was also analysed. Evaluation of the ROCOv2 dataset showed high accuracy in modality recognition, with some models achieving 100%. Anatomical recognition ranged between 61% and 85% accuracy across all models tested. On the MURAv1.1 dataset, Claude-3.5-Sonnet had the highest anatomical recognition with 57% accuracy, while GPT-4o had the best fracture detection with 62% accuracy. Claude-3.5-Sonnet was the most consistent model, with 83% and 92% consistency in anatomy and fracture detection, respectively. Given Claude and GPT's current accuracy and reliability, the integration of these models into clinical settings is not yet feasible. This study highlights the need for ongoing development and establishment of standardised testing techniques to ensure these models achieve reliable performance.
Large language models (LLMs) offer substantial promise for improving health care; however, some risks warrant evaluation and discussion. This study assessed the effectiveness of safeguards in foundational LLMs against malicious instruction into health disinformation chatbots. Five foundational LLMs-OpenAI's GPT-4o, Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, Meta's Llama 3.2-90B Vision, and xAI's Grok Beta-were evaluated via their application programming interfaces (APIs). Each API received system-level instructions to produce incorrect responses to health queries, delivered in a formal, authoritative, convincing, and scientific tone. Ten health questions were posed to each customized chatbot in duplicate. Exploratory analyses assessed the feasibility of creating a customized generative pretrained transformer (GPT) within the OpenAI GPT Store and searched to identify if any publicly accessible GPTs in the store seemed to respond with disinformation. Of the 100 health queries posed across the 5 customized LLM API chatbots, 88 (88%) responses were health disinformation. Four of the 5 chatbots (GPT-4o, Gemini 1.5 Pro, Llama 3.2-90B Vision, and Grok Beta) generated disinformation in 100% (20 of 20) of their responses, whereas Claude 3.5 Sonnet responded with disinformation in 40% (8 of 20). The disinformation included claimed vaccine-autism links, HIV being airborne, cancer-curing diets, sunscreen risks, genetically modified organism conspiracies, attention deficit-hyperactivity disorder and depression myths, garlic replacing antibiotics, and 5G causing infertility. Exploratory analyses further showed that the OpenAI GPT Store could currently be instructed to generate similar disinformation. Overall, LLM APIs and the OpenAI GPT Store were shown to be vulnerable to malicious system-level instructions to covertly create health disinformation chatbots. These findings highlight the urgent need for robust output screening safeguards to ensure public health safety in an era of rapidly evolving technologies.
The demands of intensified aquaculture production and escalating disease prevalence underscore the need for efficacious probiotic strategies to enhance fish health. This study focused on isolating and characterising potential probiotics from the gut microbiota of the emerging aquaculture species jade perch (Scortum barcoo). Eighty-seven lactic acid bacteria and 149 other bacteria were isolated from the digestive tract of five adult jade perch. The screening revealed that 24 Enterococcus hirae isolates inhibited the freshwater pathogens Aeromonas sobria and Streptococcus iniae. Co-incubating E. hirae with the host gut suspensions demonstrated a two- to five-fold increase in the size of growth inhibition zones compared to the results when using gut suspensions from tilapia (a non-host), indicating host-specificity. Genome analysis of the lead isolate, E. hirae R44, predicted the presence of antimicrobial compounds like enterolysin A, class II lanthipeptide, and terpenes, which underlay its antibacterial attributes. Isolate R44 exhibited desirable probiotic characteristics, including survival at pH values within the range of 3 to 12, bile tolerance, antioxidant activity, ampicillin sensitivity, and absence of transferable antimicrobial resistance genes and virulence factors commonly associated with hospital Enterococcus strains (IS16, hylEfm, and esp). This study offers a foundation for sourcing host-adapted probiotics from underexplored aquaculture species. Characterisation of novel probiotics like E. hirae R44 can expedite the development of disease mitigation strategies to support aquaculture intensification.
Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.
Background: Large language models (LLMs) have the potential to enhance information processing and clinical reasoning in the healthcare industry but are hindered by inaccuracies and hallucinations. The retrieval-augmented generation (RAG) technique may address these problems by integrating external knowledge sources. Methods: We developed a RAG-based chatbot called Thyro-GenAI by integrating a database of textbooks and guidelines with LLM. Thyro-GenAI and three service LLMs: OpenAI's ChatGPT-4o, Perplexity AI's ChatGPT-4o, and Anthropic's Claude 3.5 Sonnet, were asked personalized clinical questions about thyroid disease. Three thyroid specialists assessed the quality of the generated responses and references without being blinded, which allowed them to interact with different chatbot interfaces. Results: Thyro-GenAI achieved the highest inverse-weighted mean rank for overall response quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.0, 2.3, 2.8, and 1.9, respectively. Thyro-GenAI also achieved the second-highest inverse-weighted mean rank for overall reference quality. The overall inverse-weighted mean rankings for Thyro-GenAI, ChatGPT, Perplexity, and Claude were 3.1, 2.3, 3.2, and 1.8, respectively. Conclusions: Thyro-GenAI produced patient-specific clinical reasoning output based on a vector database, with fewer hallucinations and more reliability, compared to service LLMs. This emphasis on evidence-based responses ensures its safety and validity, addressing a critical limitation of existing LLMs. By integrating RAG with LLMs, it has the potential to support frontline clinical decision-making, especially helping first-line physicians by offering reliable decision support while managing thyroid disease patients.
The healthcare industry faces dual revolutions with the widespread adoption of mobile healthcare applications and the emergence of generative artificial intelligence (Gen AI). The Veterans Administration and other military healthcare providers particularly stand to benefit from these technologies given their unique challenges serving veterans. This case study explored how Gen AI might help bridge the historical gap between healthcare providers and software developers in creating more effective healthcare applications for veterans. The study utilized Anthropic's Claude 3.5 Sonnet, a large language model, to assist in developing requirements for a hypothetical healthcare application, Annie Pro. The process included uploading relevant documentation into the AI's context window and conducting an unstructured interview with the AI over an approximately 6-hour period, generating 80 pages of conversational text and 23 multi-page artifacts. Eight software developers were consulted to provide informal qualitative feedback on the resulting 26-page requirements document. The Gen AI demonstrated utility in requirements gathering, technical specification development, project planning, user flow mapping, and interface design. The AI showed particular strength in rapidly incorporating new requirements and explaining technical concepts to nontechnical stakeholders. Software developers reviewing the final product universally praised its value as a starting point for development, although some expressed concern about overly prescriptive technical specifications. This study suggests that Gen AI can effectively support healthcare providers in developing software requirements. While the technology shows promise in improving provider-developer communication, careful attention must be paid to avoid false confidence and over-specification. Future studies should look to replicate these results across different healthcare contexts and with different AI models as the technology continues to evolve rapidly.
Generative artificial intelligence (GenAI) systems like Anthropic's Claude and OpenAI's ChatGPT are rapidly being adopted in various sectors, including health care, offering potential benefits for clinical support, administrative efficiency, and patient information access. However, real-world adoption patterns and the extent to which GenAI is used for health care-related tasks remain poorly understood and distinct from performance benchmarks in controlled settings. Understanding these organic usage patterns is key for assessing GenAI's impact on health care delivery and patient-provider dynamics. This study aimed to quantify the real-world frequency and scope of health care-related tasks performed using Anthropic's Claude GenAI. We sought to (1) measure the proportion of Claude interactions related to health care tasks versus other domains; (2) identify specific health care occupations (as per O*NET classifications) with high associated interaction volumes; (3) assess the breadth of task adoption within roles using a "digital adoption rate"; and (4) interpret these findings considering the inherent ambiguity regarding user identity (ie, professionals vs public) in the dataset. We performed a cross-sectional analysis of more than 4 million anonymized user conversations with Claude (ie, including both free and pro subscribers) from December 2024 to January 2025, using a publicly available dataset from Anthropic's Economic Index research. Interactions were preclassified by Anthropic's proprietary Clio model into standardized occupational tasks mapped to the US Department of Labor's O*NET database. The dataset did not allow differentiation between health care professionals and the general public as users. We focused on interactions mapped to O*NET Healthcare Practitioners and Technical Occupations. Main outcomes included the proportion of interactions per health care occupation, proportion of overall health care interaction versus other categories, and the digital adoption rate (ie, distinct tasks performed via GenAI divided by the total possible tasks per occupation). Health care-related tasks accounted for 2.58% of total analyzed GenAI conversations, significantly lower than domains such as computing (37.22%). Within health care, interaction frequency varied notably by role. Occupations emphasizing patient education and guidance exhibited the highest proportion, including dietitians and nutritionists (6.61% of health care conversations), nurse practitioners (5.63%), music therapists (4.54%), and clinical nurse specialists (4.53%). Digital adoption rates (task breadth) ranged widely across top health care roles (13.33%-65%), averaging 16.92%, below the global average (21.13%). Tasks associated with medical records and health information technicians had the highest adoption rate (65.0%). GenAI tools are being adopted for a measurable subset of health care-related tasks, with usage concentrated in specific, often patient-facing roles. The critical limitation of user anonymity prevents definitive conclusions regarding whether usage primarily reflects patient information-seeking behavior (potentially driven by access needs) or professional workflow assistance. This ambiguity necessitates caution when interpreting current GenAI adoption. Our findings emphasize the urgent need for strategies addressing potential impacts on clinical workflows, patient decision-making, information quality, and health equity. Future research must aim to differentiate user types, while stakeholders should develop targeted guidance for both safe patient use and responsible professional integration.
To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case. A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response. All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%). RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
Large Language Models (LLMs) hold promise for clinical decision support, but their real-world performance varies. We compared three leading models (OpenAI's "o1" Large Reasoning Model (LRM), Anthropic's Claude-3.5-Sonnet, and Meta's Llama-3.2-70B) to human experts in an emergency internal medicine setting. We conducted a prospective comparative study on 73 anonymized patient cases from the Emergency Internal Medicine ward of the University Hospital Split, Croatia (June-September 2024). Two independent internal medicine specialists, blinded to model identity, graded the LLM-generated reports in two steps: (1) they evaluated the relevance of recommended diagnostic tests based on the patient's signs, symptoms, and medical history; (2) after reviewing the actual diagnostic test results, they assessed each model's final diagnosis, therapy plan, and follow-up recommendations. The same evaluative framework was applied to human-authored reports. Likert scales (1-4 or 1-3) were used, and statistical comparisons included the Friedman and Wilcoxon signed-rank tests. The o1 model achieved a mean final rating (3.63) statistically indistinguishable from human physicians (3.67; p = 0.62). Claude-3.5-Sonnet (3.38) and Llama-3.2-70B (3.23) scored significantly lower (p < 0.01 vs. o1), largely due to errors in therapy planning and non-medication recommendations. Despite this gap, all three models demonstrated ≥90 % accuracy in final diagnoses and patient admission decisions. The o1 model correctly classified all abnormal lab values (100 %), while Claude-3.5-Sonnet and Llama-3.2-70B showed minor errors (99.5 % and 99 % accuracy, respectively). When evaluated on real-world emergency cases, an advanced LLM with enhanced reasoning (o1) can match expert-level clinical performance, underscoring its potential utility as a decision-support tool.
This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes. Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage. Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models. This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education. Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.
A few weeks ago, a colleague of mine needed to collect and format some data from a website, and he asked the latest version of Anthropic's generative AI system, Claude, for help. Claude cheerfully agreed to perform the task, generated a computer program to download the data, and handed over perfectly formatted results. The only problem? My colleague immediately noticed that the data Claude delivered was entirely fabricated.