The aim of this study was to assess the quality and readability of ChatGPT and Gemini's responses to frequently asked questions about early intervention for individuals with at-risk infants. Ten frequently asked questions about early intervention were selected by three researchers (a child development specialist, a physiotherapist, and a midwife) from a list generated by ChatGPT and Gemini. Questions were sent to ChatGPT version 4.0 and Gemini 1.5, and initial responses were recorded without follow-up queries. Ten independent experts (two special education specialists, two child development specialists, two physiotherapists, two midwives, and two pediatricians) The quality of ChatGPT and Gemini's responses was assessed using a four-grade rating system. Readability levels were analyzed using the Flesch-Kincaid Grade Level through WordCalc software. One of the answers given by ChatGPT was of higher quality than Gemini (p=0.025), while one answer given by Gemini was of higher quality than ChatGPT (p=0.033). The answers to the other questions were of similar quality, with Gemini having a lower level. This study compares the quality and readability of the answers given by artificial intelligence-based language models to demonstrate their potential to appeal to different user groups. While the models generally provided answers of similar quality, quantitative differences in readability were observed, suggesting potential suitability for different audiences. These findings contribute to understanding the role of AI tools in health communication.
Large language models (LLMs) have demonstrated expert-level performance on medical licensing examinations, but most benchmarks focus on final accuracy, obscuring model-specific behaviors. Critical gaps remain in understanding model efficiency (latency), the efficacy of tiered "rescue" protocols for error correction, and the systematic correlation between performance and human-rated question difficulty. The German M2 exam, paired with the AMBOSS platform's user-data-driven difficulty ratings, provides a unique opportunity to map AI performance directly against human cognitive load. This study aimed to move beyond singular accuracy scores by (1) evaluating and comparing the baseline (Tier 1) accuracy and response latency of next-generation rapid-response LLMs; (2) analyzing the efficacy of a two-tiered rescue (Tier 2) protocol in correcting initial errors; and (3) correlating model performance with the user-data-driven Amboss difficulty rating. We evaluated four LLMs (Gemini 2.5 Flash/Pro and ChatGPT 5 Instant/Thinking) on the complete 316-item German M2 (Fall 2024) medical exam, including all multimodal (image-based) questions. A zero-shot copy-paste prompting strategy was utilized, and outputs were evaluated against ground-truth answers using a strict exact-match criterion. A two-tiered protocol was used: Tier 1 (Flash/Instant) provided baseline responses. If incorrect, a Tier 2 (Pro/Thinking) model was deployed as a "rescue." Performance was analyzed using McNemar's test, Wilcoxon signed-rank test, Fisher's exact test, and logistic regression. Baseline (Tier 1) accuracy was identical at 91.46% (95% CI 87.85-94.06; n = 289/316) for both Gemini 2.5 Flash and ChatGPT 5 Instant, with 27 errors each. However, Gemini Flash (Mean=1.57s) was significantly faster than ChatGPT Instant (Mean = 2.07s; P < .001). Additionally, ChatGPT Instant expended significantly more time on incorrect answers compared to correct ones (P = .002), whereas Gemini Flash showed no such hesitation (P = .814). The Tier 2 rescue rate for ChatGPT 5 Thinking (48.15%, 13/27; 95% CI 30.74-66.01) was higher, though not statistically significant (P = .406), than for Gemini 2.5 Pro (33.33%, 9/27; 95% CI 18.64-52.18). This rescue protocol elevated final accuracy to 94.30% (95% CI 91.18-96.37) for the Gemini system and 95.57% (95% CI 92.70-97.34) for the ChatGPT system (P = .481). A strong, inverse relationship with difficulty was found: for every one-point difficulty increase, the odds of a correct Tier 1 response decreased by 42.1% (OR 0.579, 95% CI 0.425-0.788; P < .001) for Gemini Flash and 47.7% (OR 0.523, 95% CI 0.379-0.720; P < .001) for ChatGPT Instant. This negative correlation persisted even after the rescue (P = .013 and P = .006, respectively). Expert-level LLM performance on the German M2 exam masks a critical, systematic vulnerability: a significant decrease in accuracy directly correlated with increased question difficulty. A two-tiered "rescue" system is an effective strategy to mitigate these difficulty-based failures and achieve >95% accuracy, rivaling the best-performing, full-capacity models. We conclude that a simple reliance on a single model is insufficient; hierarchical systems that manage query difficulty are essential for safe and effective integration into medical education.
This study compared the performance of three large language models, ChatGPT-5 Plus, Gemini 2.5 Pro, and SuperGrok 4, in identifying anatomical structures on radiographic images using standardized anatomical terminology. Thirty radiographs from different body regions were selected from an open-access atlas and analyzed by the models in Normal and Thinking modes using standardized prompts based on Terminologia Anatomica (version 2.07). Responses were evaluated independently by two anatomists using a 0-2 scoring system. Overall accuracy across both modes and models ranged from 47.4% to 85.7%. Data were analyzed using Friedman and Wilcoxon signed-rank tests. Temporal response consistency was assessed with weighted kappa coefficients. Gemini 2.5 Pro and ChatGPT-5 Plus significantly outperformed SuperGrok 4 in both modes. In Normal mode, Gemini 2.5 Pro achieved the highest overall accuracy (82.7%), significantly exceeding ChatGPT-5 Plus (60.7%, p = 0.001) and SuperGrok 4 (47.4%, p < 0.001). In Thinking mode, accuracies were 85.7% for Gemini 2.5 Pro, 77.6% for ChatGPT-5 Plus, and 49.5% for SuperGrok 4. Gemini 2.5 Pro demonstrated a significant advantage over ChatGPT-5 Plus only in Normal mode (p = 0.001), whereas Thinking mode significantly improved performance only for ChatGPT-5 Plus (p = 0.01). Temporal stability analysis showed high response consistency for Gemini 2.5 Pro and SuperGrok 4 across all modes (r > 0.94, p < 0.001). Conversely, ChatGPT-5 Plus' stability decreased from substantial agreement in normal mode (r = 0.697, p < 0.001) to moderate agreement in Thinking mode (r = 0.539, p < 0.001). Despite their educational potential, these models need refinement to reliably identify anatomical structures on radiographic images.
Scar is an inevitable pathological product of tissue injury repair, and pathological scars often occur in exposed areas, bringing severe psychological burden and economic losses to patients. With the popularization of digital healthcare, patients increasingly rely on artificial intelligence (AI) for self-consultation, but the core capabilities of free generative AI in scar management have not been systematically evaluated. This study compared and evaluated the comprehensive performance of ChatGPT-5.4 mini and Gemini 3 Flash in answering clinical and psychological questions of scar patients, investigated multi-dimensional differences, and provided support for the application of AI in patient education. Fifteen core questions from scar patients were extracted and input into ChatGPT-5.4 mini and Gemini 3 Flash, respectively. The DISCERN-AI scale and Global Quality Scale (GQS) were used for evaluation, while multiple standardized tools were applied to quantify text readability and complexity. All data were subjected to a normality test and difference analysis using SPSS software. Both models demonstrated high clinical reliability, with no significant difference in target topic clarity (P=0.806). ChatGPT had better overall quality, with a GQS score of 4.8 (4.5, 4.9), which was significantly higher than Gemini's 4.6 (4.4, 4.7) (P=0.033). ChatGPT was also more rigorous in stating medical limitations and uncertain treatment options (5.0 versus 4.5, P<0.05). In contrast, Gemini performed better in patient demand relevance and empathy (4.5 versus 4.0, P=0.026). Both models achieved moderate scores in shared decision-making support. Readability analysis showed that the reading thresholds of both models were excessively high, far exceeding the internationally recommended 6th- to 8th-grade standard for patient education materials. ChatGPT-5.4 mini and Gemini 3 Flash have complementary advantages and potential as auxiliary tools for digital health education in scar patients, but both have a serious readability gap. For future large-scale applications, readability prompt intervention should be introduced, and it should be clearly stated that AI cannot replace professional diagnosis and treatment to ensure the inclusiveness and safety of digital medical information.
Large language models are used for meal planning, but their suitability for restrictive diets remains unclear. We hypothesized that ChatGPT would generate gluten-free (GF) menus with lower energy and micronutrient adequacy than Gemini. In this exploratory prompt-based comparison, ChatGPT (GPT-4o) and Gemini (2.5 Flash) each generated six daily 2000 kcal GF menus for a hypothetical 30-year-old woman with celiac disease using a standardized prompt. Generation was repeated after 1 wk for test-retest and sensitivity analyses (12 menus/model). The nutrient composition was analyzed using BeBiS v9.0. Micronutrient adequacy was summarized using nutrient adequacy ratios (NARs; truncated at 1.00) and the mean adequacy ratio. Diet quality was assessed with the Diet Quality Index-International, and carbon footprint was estimated using sustainability assessment of food and diets factors. Between-model differences were assessed using t tests or exact Mann-Whitney U tests with multiplicity adjustment; short-term output variability was characterized using intraclass correlation coefficients and Bland-Altman plots. In the initial dataset, ChatGPT menus provided less energy than Gemini menus (-379.8 kcal/d; Padj = 0.002) and showed lower mean adequacy ratio (-0.05; Padj = 0.002), mainly reflecting calcium and iron shortfalls. Diet Quality Index-International scores were similar between models, and absolute carbon footprint was numerically lower in ChatGPT menus but not statistically significant (Padj = 0.089). 1-wk output variability was observed across outcomes. In this exploratory prompt-based study, ChatGPT and Gemini generated GF menus with different nutrient adequacy profiles under the tested conditions, whereas diet quality and carbon footprint were comparable. Given the limited prompt sample and observed output variability, these findings should be interpreted as hypothesis-generating. AI-generated GF menu outputs require standardized prompting, independent nutrient verification, GF safety assessment, and dietitian review before clinical use.
Session dialogue assessment based on machine learning is gradually becoming an effective solution for therapeutic alliance measurement which is an important factor for successful psychotherapy. However, most existing models assume clean and pre-structured dialogue transcripts, whereas real-world counseling documentation often contains heterogeneous case reports. This gap limits the applicability of current automated assessment models in realistic documentation scenarios. In this work, we propose a framework for automated working alliance assessment from complex, multilingual reports. First, language-specific BERT models are fine-tuned to process case reports across different languages, enabling accurate speaker role delineation and dialogue structuring. Second, Gemini-2.5-Flash is leveraged to annotate the dialogues with working alliance ratings. Third, a hybrid feature representation strategy is then developed to jointly capture linguistic style and semantic content from the counseling dialogues. Furthermore, an entropy-based mutual information analysis is conducted to identify the most informative linguistic features. Finally, the extracted hybrid features serve as inputs to XGBoost for alliance assessment. In experiments, the proposed framework shows better performance in the comparison with SOTA methods and generalization ability.
This study aims to compare emergency physicians' performance with that of general-purpose large language models (LLMs), such as ChatGPT and Gemini, for pneumothorax (PTX) detection on chest radiographs (CXRs). This single-center, retrospective study of adults was conducted between January 2015 and February 2025 and included 265 PTX cases and 267 non-PTX controls. Exclusions included diagnoses made only by computed tomography, absence of CXR, initial treatment at another center, or incomplete data. Thirteen emergency physicians independently and blindly reviewed CXRs and recorded a binary decision. ChatGPT and Gemini evaluated the same images with a standardized yes/no prompt, with memory cleared between cases to prevent carryover. The primary outcome was LLM diagnostic performance for PTX, while the secondary outcome compared LLMs with physicians. ChatGPT and Gemini exhibited distinct diagnostic performance profiles for PTX detection on CXRs. Gemini demonstrated a sensitivity of 52.5%, whereas ChatGPT demonstrated a sensitivity of 44.5%. Conversely, ChatGPT achieved a specificity of 95.5% and an overall accuracy of 70.1%, while Gemini demonstrated a specificity of 79.0% and an accuracy of 65.8%. Agreement with the reference standard was moderate for ChatGPT, with a kappa value of 0.401, and fair for Gemini, with a kappa value of 0.315. Increasing case difficulty was associated with a reduction in diagnostic accuracy for both models, with correlation coefficients of - 0.438 for ChatGPT and - 0.274 for Gemini. For contextual clinical comparison, emergency physicians demonstrated a sensitivity of 64.5%, a specificity of 99.6%, and an overall accuracy of 82.1%. This study demonstrates model-specific differences in PTX detection by general-purpose AI systems, with Gemini showing higher sensitivity and ChatGPT showing superior specificity and accuracy, both declining with increasing case difficulty. Physician performance remained higher, but was secondary for context. Despite their accessibility and low cost, these models should be considered only adjunctive tools until task-specific optimization and clinical validation are achieved.
Objective: Online patient education materials (OPEMs) are important resources for patients seeking health information. While the National Institutes of Health (NIH) and American Medical Association (AMA) recommend a sixth-grade readability level for OPEMs, commonly available material often exceeds such criteria. Large language models (LLMs), such as ChatGPT and Gemini, have emerged as tools for health education with potential applications in simplification of health material. This study assesses the utility of ChatGPT and Gemini in enhancing the readability of OPEMs for peripheral nerve surgeries. Methods: Eleven common peripheral nerve surgeries were used as online search terms. The first 20 unique search results were assessed; results were excluded if they did not include patient-facing material. ChatGPT and Gemini were instructed to rewrite the text of the OPEM at or below a sixth-grade reading level. Readability metrics were calculated for original OPEMs, alongside ChatGPT and Gemini rewrites. LLM responses were reviewed for accuracy/quality (five-point scale) and comprehensiveness (three-point scale) using predefined criteria. Results: A total of 220 websites were assessed. In total, 155 OPEMs met the inclusion criteria; 65 websites were excluded because they were academic journal articles or other provider-facing materials. The average Flesch-Kincaid grade level (FKGL) of OPEMs was 11.3, significantly greater than the NIH/AMA-sixth grade recommendations (p < 0.001). The average FKGL of ChatGPT rewrites was significantly lower than that of OPEMs (11.3 vs. 7.5, p < 0.001), as was the average FKGL of Gemini rewrites (11.3 vs. 5.6, p < 0.001). ChatGPT rewrites were of higher accuracy/quality (4.5/5.0 vs. 4.0/5.0, p < 0.001) and comprehensiveness (2.0/3.0 vs. 1.0/3.0, p < 0.001) relative to Gemini rewrites. Conclusions: The readability of online patient education materials for peripheral nerve surgery significantly exceeded NIH/AMA recommendations. ChatGPT and Gemini were able to significantly simplify the reading level of these OPEMs. LLMs may serve as tools to improve the readability of peripheral nerve surgery OPEMs.
Social media platforms such as X (formerly Twitter) are increasingly used by journals, authors, and institutions to promote newly published research. Well-designed posts can enhance visibility, accelerate knowledge translation, and increase altmetric attention. However, creating accurate and policy-compliant content is time-intensive. Large language models (LLMs) offer a potential solution, yet systematic evaluations of their performance in post-publication promotion remain limited. We conducted a blinded, crossed, offline evaluation of four LLMs: GPT-5 (OpenAI), Gemini 2.5 Pro (Google DeepMind), Grok-3 (xAI), and Perplexity Pro (Perplexity AI), tasked with generating X-style posts (≤ 260 characters) for 36 open access articles from The Lancet Public Health, The Lancet Planetary Health, and Annual Review of Public Health. Posts were generated using a standardized system and user prompt. A single blinded rater scored outputs using a five-domain rubric (factual accuracy, clarity, policy compliance, call-to-action quality, structure/metadata; maximum score 10). Secondary measures included character count, hashtag use, and readability (Flesch-Kincaid Grade Level). General linear models with Bonferroni-adjusted post hoc tests and non-parametric analyses were applied. All four models achieved perfect factual accuracy and no policy violations. Mean total quality scores differed significantly by model, P < 0.001. GPT-5 (9.60) and Perplexity Pro (9.60) performed best, followed by Gemini 2.5 Pro (9.47), while Grok-3 scored lower (8.80). Domain analyses showed Grok-3 underperformed in call-to-action quality (1.40 vs. ≥1.97 in other models, P < 0.001) and produced significantly shorter posts (median 194 characters, P < 0.001). Perplexity Pro scored highest for policy compliance, while GPT-5 and Gemini 2.5 Pro achieved superior structural scores. Readability varied: GPT-5 8.9 (7.3-9.2) and Perplexity Pro 7.3 (6.5-8.8) generated more complex outputs, whereas Gemini 2.5 Pro 5.1 (4.8-6.5) and Grok-3 4.5 (3.6-6.3) produced more accessible posts. LLMs can reliably generate accurate and policy-compliant social media posts for research promotion, with differences in style and readability that may inform audience targeting. GPT-5, Gemini 2.5 Pro, and Perplexity Pro produced high-quality outputs, while Grok-3 underperformed across several domains. These findings highlight the potential of LLMs as scalable first-draft tools for post-publication promotion, capable of improving the reach and accessibility of scientific research. Careful model selection, tailored to audience and communication goals, together with human oversight, remains essential.
Background: Large language models (LLMs) are increasingly consulted for clinical guidance, yet their reliability in protocol-sensitive domains remains insufficiently characterized. This study evaluated the ability of widely accessible LLMs to reproduce guideline-defined decision thresholds in vital pulp therapy (VPT), with emphasis on guideline-concordance accuracy, professional-role prompting, short-term response stability, and decision-level error directionality. Methods: Twenty-six binary yes/no questions were derived from an internationally recognized evidence-based guideline for VPT. Four LLMs-GPT-5, GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash-were queried under non-prompted and professional-role-prompted conditions by two independent operators across three daily sessions over three consecutive days. Descriptive analyses were complemented by mixed-effects logistic regression in R to account for repeated responses clustered within guideline-derived questions. Results: Overall guideline-concordance accuracy was high across models. Gemini showed the highest observed accuracy under non-prompted conditions; DeepSeek showed the highest under prompted conditions. In the mixed-effects model, Gemini demonstrated significantly higher odds of guideline-concordant responses than GPT-5 under non-prompted conditions, whereas DeepSeek outperformed GPT-5 and GPT-4o under prompted conditions. The model × prompt interaction showed a trend toward significance but did not reach the conventional threshold. Day and within-day time point were not significantly associated with accuracy, supporting short-term response stability. Error-direction analysis revealed model-specific patterns: Gemini showed consistently low false-positive rates but increased false-negative responses under prompted conditions; DeepSeek showed reduced false-positive and no false-negative responses under prompted conditions. Conclusions: Average accuracy alone is insufficient to characterize the reliability of LLM-generated clinical guidance. Evaluation in protocol-sensitive domains should incorporate guideline-concordance, prompt responsiveness, short-term stability, and decision-level error directionality.
To build a visual question answering (VQA) dataset for fine-tuning and evaluating vision-language models (VLMs) in myopic maculopathy (MM). Cross-sectional study. Colour fundus photographs (CFPs) from two publicly available datasets were graded using META-PM classification system. GPT-5 was used to generate clinical captions, true/false [TFQ] and open-ended [OEQ] question-answer pairs, all of which were manually verified. InternVL3-8B was fine-tuned on this dataset and evaluated against Gemini 3 Pro, Claude Sonnet 4.5, Qwen3-VL-30B-A3B-Instruct, and pre-trained InternVL3-8B. OEQ responses was evaluated by GPT-5 using a three-level scoring system (0, completely incorrect; 0.5, partially correct; 1, fully correct) and summarized as weighted accuracy. Overall accuracy was defined as the arithmetic mean of the TFQ and OEQ accuracies. MM-VQA comprises 2,591 CFPs and 19,648 question-answer pairs. Fine-tuned InternVL3-8B model achieved an overall accuracy of 0.746, surpassing Claude Sonnet 4.5 (0.596), Qwen3-VL-30B-A3B-Instruct (0.566), and pre-trained InternVL3-8B (0.428) (all P < 0.001), while showing no significant difference compared with Gemini 3 Pro (0.724, P = 0.642). For TFQ, the fine-tuned model reached an accuracy of 0.919, outperforming Gemini 3 Pro (0.881), Qwen3-VL-30B-A3B-Instruct (0.834), Claude Sonnet 4.5 (0.796), and the pretrained model (0.696) (all P < 0.001). On OEQ, it also ranked highest (0.572), outperforming Gemini 3 Pro (0.567, P = 0.044), Claude Sonnet 4.5 (0.395, P < 0.001), Qwen3-VL-30B-A3B-Instruct (0.297, P < 0.001) and the pre-trained model (0.160, P < 0.001). This study provides a valuable VQA dataset for MM, supporting the development of disease-specialised VLMs in ophthalmology.
This study aimed to evaluate the validity and reliability of responses generated by GPT-4o, Microsoft Copilot, Google Gemini, and DeepSeek to 20 frequently asked patient questions about tooth whitening. Twenty common questions about tooth whitening were selected based on clinical experience and AI-generated suggestions. Each question was submitted three times to each chatbot through its official web interface. The responses were evaluated by two professors and four specialists in restorative dentistry using a five-point Likert scale based on a modified Global Quality Score. Validity was analyzed considering low-threshold and high-threshold criteria. Reliability was tested using Cronbach's alpha coefficient, whereas inter-rater reliability was calculated utilizing the intraclass correlation coefficient. In the low-threshold validity analysis, GPT-4o and DeepSeek yielded the highest validity rate by providing valid responses to all 20 questions. Microsoft Copilot and Google Gemini showed lower validity rates. No significant difference was found among the chatbots in low-threshold validity rates. In the high-threshold validity analysis, GPT-4o and DeepSeek showed the highest valid response rates, whereas Google Gemini and Microsoft Copilot showed lower rates. No significant difference was found among the chatbots in high-threshold validity rates. In the reliability analysis, the highest internal consistency was observed for DeepSeek, followed by Microsoft Copilot, Google Gemini, and GPT-4o. The evaluated chatbots showed different performance levels in terms of the validity and reliability of their responses to frequently asked patient questions about tooth whitening. GPT-4o and DeepSeek yielded the highest rates in the low-threshold and high-threshold validity analyses, whereas DeepSeek showed the highest internal consistency. This study indicated that the evaluated AI chatbots generated generally valid but variable responses to frequently asked patient questions about tooth whitening. The findings support the professionally supervised use of chatbot-generated information as supplementary patient education material in dentistry.
The present study evaluated the diagnostic performance of three general-purpose multimodal large language models (MLLMs)-Claude 4.5 Sonnet, GPT-5.2 Thinking, and Gemini 3.0 Pro-in detecting apical periodontitis on periapical radiographs. One hundred twenty periapical radiographs were included (60 apical periodontitis-positive and 60 healthy), retrieved from routine clinical records at a university dental clinic. Teeth with root canal fillings and radiographs showing metallic artifacts were excluded to minimize superimposition in the apical region. Two experienced clinicians independently assessed all images to establish the reference standard based on two-dimensional periapical radiographic interpretation without CBCT confirmation. Disagreements were resolved by consensus, and interobserver agreement was high (Cohen's κ = 0.88). The same coded images were subsequently evaluated by the three MLLMs in separate sessions using a standardized prompt, without any example or training images. Model outputs were restricted to binary "present/absent" responses. Accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. Differences among models were analyzed using Cochran's Q test (α = 0.05), followed by Bonferroni-adjusted McNemar tests. Gemini 3.0 Pro achieved the highest overall accuracy (71.67%). This performance was significantly superior to that of Claude 4.5 Sonnet (Bonferroni-adjusted p = 0.0015). Claude 4.5 Sonnet demonstrated very high sensitivity (95.00%) but low specificity (13.33%), reflecting a strong false-positive tendency. In contrast, GPT-5.2 Thinking showed high specificity (98.33%) with markedly reduced sensitivity (20.00%). Gemini 3.0 Pro maintained high sensitivity (95.00%) while achieving moderate specificity (48.33%). Positive and negative predictive values (PPV/NPV) were 52.29%/72.73% for Claude, 92.31%/55.14% for GPT-5.2 Thinking, and 64.77%/90.62% for Gemini 3.0 Pro. Under zero-shot conditions, the evaluated MLLMs did not demonstrate sufficient reliability to replace clinical judgment in the detection of apical periodontitis on periapical radiographs. Although diagnostic performance varied among models, none achieved a level of diagnostic performance sufficient for independent clinical use. At present, these systems appear more suitable as clinician-supervised decision-support tools. Further research is needed to improve diagnostic reliability and validate performance across diverse clinical settings.
Recent advancements in vision language models (VLMs) have opened new avenues for analyzing complex visual data. Models such as ChatGPT, Gemini, Llama and LLaVA have gained prominence for their ability to process both visual and textual data, excelling in tasks like natural image captioning, visual question answering (VQA), and reasoning. Similarly, the Segment Anything Model (SAM) by Meta has demonstrated remarkable segmentation capabilities. Given the importance of microscopy images in fields like biology, medicine, and materials science-where visual data is often analyzed alongside textual information from captions, reports, or literature-it is critical to evaluate the effectiveness of these models on such data. This study assesses the capabilities of ChatGPT-5, Gemini-2.5Pro, Llama-3.2V, LLaVA-1.5 and SAM-2 on classification, segmentation, counting, and VQA tasks using microscopy images. ChatGPT and Gemini excelled in comprehending microscopy images, while SAM performed well in object isolation. Although their performance falls short of domain expert accuracy, particularly when faced with complexities such as impurities, overlaps, and irrelevant artifacts, these models show clear gains compared to prior versions. These findings highlight the promise of VLMs in scientific image analysis and the need for further advancements to meet the demands of expert-level tasks.
We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT (GPT-5.5 Thinking) and Google Gemini Advanced (Gemini 1.5 Pro), we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.
Large language models (LLMs) are increasingly used by patients seeking cardiovascular health information through digital platforms. However, their accuracy and suitability for providing guidance on heterogeneous diseases such as cardiomyopathies and heart failure remain inadequately evaluated. This study systematically benchmarked state-of-the-art LLMs on patient-oriented heart failure and cardiomyopathy queries regarding clinical appropriateness and comprehensibility. Six prominent LLM Chatbots were tested on 50 expert-curated questions covering disease understanding and lifestyle advice. A web-based evaluation platform randomized and blinded responses for assessment by twelve reviewers (cardiologists, medical students, AI auto-graders). Responses were rated on a 1-5 Likert scale across nine domains, including appropriateness, readability, and empathy. Reviewers also chose their preferred model per question. Linguistic complexity and output length varied substantially. Gemini provided the most readable responses (Flesch-Kincaid Grade 11.3 ± 1.9) but was among the most verbose (668.7 ± 116.1 words). Across 2,700 ratings, Gemini received the highest composite mean rating (4.41 ± 0.77), excelling in completeness and factual reliability, followed by Grok (4.23 ± 0.76). Confabulation avoidance scored consistently high across all models (4.49 ± 0.02), while conciseness scored lowest (3.81 ± 0.05). Consistently, evaluators selected Gemini as their preferred information source in 43.7%, followed by Grok (30.3%). Rating tendencies varied by evaluator group: Auto-graders gave the highest average scores (mean 4.58 ± 0.60), followed by students (4.10 ± 0.88), while experts were more conservative (3.79 ± 0.93). All LLMs showed good accuracy avoiding medical misinformation, though variability exists in readability and comprehensiveness. While major factual errors or hallucinations were rare in our blinded evaluation, they were not entirely absent.
Background/Objectives: Lung cancer is a leading cause of cancer-related mortality worldwide. As patients increasingly utilize large language models (LLMs) for health information, evaluating the readability and patient-centeredness of these tools is critical. This study aims to compare the performance of ChatGPT-4o mini, Microsoft Copilot, and Google Gemini in providing lung cancer information, focusing on their utility for individuals with limited health literacy. Methods: In this cross-sectional study (March 2026), 30 responses to ten standardized lung cancer-related queries were analyzed. Outputs were assessed using JAMA benchmarks and mDISCERN for quality, the SMOG index for readability, and PEMAT-P for understandability and actionability. Inter-rater reliability was analyzed using intraclass correlation coefficients (ICCs). Results: ChatGPT-4o mini demonstrated superior readability, achieving a sixth-grade level (SMOG: 6.23 ± 0.72, p < 0.001). Gemini achieved higher JAMA scores, indicating stronger academic rigour. While PEMAT-P scores were highest for ChatGPT (63.7%), all models exhibited moderate mDISCERN quality. Inter-rater reliability was excellent for JAMA (ICC = 1.000) and PEMAT-P (ICC = 0.883), though moderate for mDISCERN (ICC = 0.365), reflecting inherent interpretative subjectivity in qualitative assessment. No hallucinations were observed. Conclusions: Current LLMs exhibit a trade-off between accessibility and academic rigour: ChatGPT favours patient-friendly readability, while Gemini emphasizes structured content. The observed inter-rater variability in mDISCERN underscores the complexity of standardizing qualitative AI evaluation. These findings suggest that LLMs function best as complementary aids rather than substitutes for physician-led communication.
Background/Objectives: Diabetes mellitus and hypertension are major chronic conditions that markedly affect patients' health and quality of life worldwide. With the rapid development of technology, there has been a growing interest in exploring the potential role of artificial intelligence (AI) in the management of such diseases. This study aims to assess the accuracy and reliability of artificial intelligence tools in providing information for diabetes mellitus and hypertension management. Methods: This study assessed the accuracy and reliability of the information provided by major AI tools such as ChatGPT, Gemini, POE, Claude, Consensus, and Perplexity. Twenty questions that are essential for the management of diabetes mellitus and hypertension were constructed based on the chapters of the respective guidelines and were fed to the AI tools. The outcomes were compared with evidence-based treatment guidelines, such as those from the American Diabetes Association (ADA), the American Heart Association (AHA), the European Society of Cardiology (ESC), and the National Institute for Health and Care Excellence (NICE). Answers were classified into "accurate ", "inaccurate", and "accurate with missing information". Three rounds of six-week intervals were conducted to assess accuracy and reliability. In addition, they were conducted to evaluate data updates by comparing answers across the rounds. Results: In round one of the evaluations, ChatGPT and Poe showed the highest accuracy, both at 65% (95% CI: 41.0-83.7), followed by Claude at 60% (95% CI: 41.0-83.7). ChatGPT had the lowest inaccuracy rate at 5% (95% CI: 1.75-33.1), while Claude demonstrated the smallest percentage of responses with missing information at only 6%. (95% CI: 12.8-54.3). In round 2, Claude markedly outperformed all other tools, achieving an accuracy rate of 95% (95% CI: 73.0-99.7) and no responses with missing information (0%). In round 3, ChatGPT came second with 70% (95% CI: 45.70-87.2) accuracy and maintained the lowest inaccuracy rate of 5% (95% CI: 0.26-26.9). Consensus had the largest inaccuracy rate at 40% (95% CI: 20.0-63.6) and the lowest accuracy rate at 40% (95% CI: 20.0-63.6). Overall, statistically significant pairwise comparisons showed that Cloud in the second round has the highest accuracy compared to Poe (p = 0.0154), Gemini (p = 0.0421), Consensus (p = 0.0035), and Perplexity (p = 0.0302). In the assessment of performance shift from round 1 to round 2, Claude achieved the greatest improvement in accuracy at 40%. In the assessment of performance shift from round 2 to round 3, Poe improved the most with an accuracy increase of 25%, while ChatGPT followed with 20%. When evaluating the unprompted and guideline-prompted questions for all AI tools using McNemar's test, it did not reveal a statistically significant distinction in the proportion of accurate responses (p > 0.05). Conclusions: Throughout the three rounds, ChatGPT maintained the best performance, with the fewest missing data. Claude and Poe followed, showing high accuracy with relatively low inaccuracy rates. On the other hand, Perplexity and Gemini performed moderately, while Consensus had the lowest accuracy.
Introduction Patients' utilization of the internet as a resource for obtaining medical information continues to expand, with increased prevalence and access to educational materials. One method of obtaining medical information online is artificial intelligence (AI)-generated patient education materials (PEMs). As such, the medical community has a fundamental obligation to assess the accuracy, quality, and readability of AI-generated PEMs as patient resources - a critical step in promoting health literacy, combating misinformation, and, ultimately, empowering patients. Given that the perceived severity of patellar tendon ruptures (PTR) can vary, providing clear information is important to support informed decision-making. This study aimed to evaluate and compare the readability and quality of AI-generated responses to patient questions about patellar tendon repair, using four different AI chatbots: ChatGPT 3.5, ChatGPT 4, Gemini 1.0, and Perplexity. Methods There were no significant differences in readability among the four different chatbots, and they all provided responses that were better than the average American reading level. The mean DISCERN scores were as follows: Perplexity (64.2±9.2), ChatGPT 3.5 (49±7.97), Gemini 1.0 (59.2±7.43), and ChatGPT 4 (52±6.28). Even though Perplexity demonstrated the highest mean DISCERN scores among the evaluated AI models, no statistically significant differences in readability were observed among the four chatbots, although results approached significance (p = 0.075). Question 15 of the DISCERN criteria, regarding shared decision-making, was consistently rated at a high level across each AI tool, with an average rating of 4.2 out of 5.  Results There were no significant differences in readability among the four different chatbots and they all provided responses that averaged above the average American reading level. The mean DISCERN scores were as follows: Perplexity (64.2±9.2), ChatGPT 3.5 (49±7.97), Gemini 1.0 (59.2±7.43), and ChatGPT 4 (52±6.28). Perplexity's score was statistically significant when compared to ChatGPT3.5, indicating that the responses of Perplexity were more accurate and reliable than ChatGPT3.5. Question 15 of the DISCERN criteria, regarding shared decision-making, was consistently rated at a high level across each AI tool, with an average rating of 4.2 out of 5.  Conclusion This study found that readability remains consistent across various AI tools, while the quality of the information may vary. Perplexity outperformed ChatGPT 3.5 in providing accurate information on patellar tendon ruptures. AI tools demonstrated variability in informational quality scores, although these differences were not statistically significant, highlighting the importance of carefully evaluating AI-generated content before using it as a patient education resource.
Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs-GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite-for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia of undetermined significance (AUS) cytology. Methods: Text prompting, few-shot learning, fine-tuning, and a hybrid strategy combining fine-tuning with few-shot learning were evaluated for each model. Performance was assessed using the Digital Database of Thyroid Images (DDTI; n = 80), a 1000-image test subset of TN5000, and an institutional AUS cohort with surgical pathology (n = 84). In the AUS cohort, the best-performing strategy was compared with the consensus classification of three endocrinologists and the American Thyroid Association (ATA) ultrasound risk stratification. Results: For GPT-4o, the hybrid strategy achieved the highest area under the receiver operating characteristic curve (AUC) in DDTI (0.866), TN5000 (0.689), and the AUS cohort (0.836). In the AUS cohort, its specificity was higher than that of endocrinologist consensus and ATA risk stratification when only high-suspicion nodules were classified as malignant (95.1% vs. 70.7% and 70.7%; p = 0.002 and p = 0.001, respectively), while sensitivity did not differ significantly (72.1% vs. 74.4% and 79.1%, respectively; both p > 0.05). However, the hybrid model misclassified 12 of 43 malignant nodules, corresponding to a false-negative rate of 27.9%. When high- and intermediate-suspicion ATA categories were classified as malignant, ATA sensitivity increased to 83.7% and specificity decreased to 56.1%; the hybrid model had a higher AUC than ATA risk stratification (0.836 vs. 0.749; p = 0.017). For Gemini 2.5 Flash-Lite, few-shot learning, fine-tuning, and the hybrid strategy did not improve AUC relative to text prompting in any dataset. Conclusions: The hybrid strategy produced the most consistent performance gains for GPT-4o across the three datasets but did not improve Gemini 2.5 Flash-Lite. The optimized GPT-4o model achieved high specificity in the diagnostically challenging AUS cohort, although its false-negative rate limits its use as a stand-alone diagnostic tool. Further validation in larger, prospective multicenter cohorts is required before clinical use.