Urban environments are shaped by intricate interactions among water, soil, air, and infrastructure, where traditional models often fail to capture nonlinear, non-Euclidean dynamics. Spatiotemporal graph learning (STGL) has emerged as a powerful framework to represent such complexity, enabling accurate forecasting and real-time decision support from urban districts to national and even global scales. This review provides the first comprehensive synthesis of STGL tailored to urban environments. We summarize advances in graph construction, spatial and temporal modeling, and fusion strategies, and examine applications across urban water systems, soil and agriculture, air quality, and urban risk. Landmark case studies, including Microsoft's Aurora, NVIDIA's Earth-2, and Google's GraphCast/GenCast, demonstrate STGL's potential as a foundation model for environmental intelligence. We conclude by identifying key limitations and outlining future directions, emphasizing federated learning, machine unlearning, and meta-learning to enhance next-generation STGL frameworks that ultimately support resilient and adaptive urban environments.
General-domain large language models (LLMs) have emerged as valuable tools in healthcare, however, their ability to understand and perform tasks based on data stored in tabular form has not been explored in Ophthalmology. We aimed to assess OpenAI's Generative Pre-trained Transformer 4o (GPT-4o) performance within real emergency department (ED) eye-related encounters extracted from electronic medical records in tabular format. We input the excel spreadsheet containing the data on 1,419 unique eye-related ED encounters, divided into (1) chief complaint (CC), history of present illness (HPI), and eye examination; (2) CC and eye examination; (3) eye examination only, into GPT-4o via Microsoft's Azure OpenAI Service using chain-of-thought (CoT) prompting and evaluated the diagnosis and assessment performance of the LLM on the presented data. GPT-4o answers were reviewed by board-certified ophthalmologists and classified as (1) GPT-4o provided a correct diagnosis and assessment; (2) GPT-4o provided an incorrect diagnosis and assessment; (3) GPT-4o unable to provide a correct diagnosis as the encounter documentation was incorrect; (4) GPT-4o unable to provide a correct diagnosis as it required ancillary tests. A sample of encounters were reviewed by a second board-certified ophthalmologist for inter-grader agreement assessment. Average accuracy rates were used to evaluate performance and compare statistical significance across scenarios. A second CoT prompting was performed after providing the LLM with the final encounter diagnosis to evaluate disagreement/inconsistencies between the presented documentation and the reported diagnosis. GPT-4o (CoT) overall accuracy was 0.76 (95% confidence interval [CI], 0.74-0.79); no significant difference was found in accuracy when GPT-4o was presented with CC, HPI and eye findings vs. CC and eye findings vs. eye findings only (P = 0.675). The inter-grader agreement kappa was 0.841 (P < 0.001). GPT-4o identified that 6.6% of all encounters did not have EMR documentation that supported the final encounter diagnosis. When encounters with incorrect EMR documentation and encounters with requirements for ancillary tests (5.2%) were excluded, GPT-4o accuracy was 0.87 (95% CI, 0.85-0.89). GPT-4o could accurately synthesize tabular data and provide assessments and diagnoses in real-world ophthalmology encounters, in addition to identify encounters with documentation that did not support the final ED encounter diagnosis. This capability has the potential to support the clinician's diagnosis. What is known General-domain large language models (LLMs) have emerged as valuable tools in healthcare. In ophthalmology, prior LLM studies have focused on text-based inquiries of a limited number of sample cases. What is new OpenAI’s Generative Pre-trained Transformer 4o (GPT-4o) could accurately synthesize and provide diagnoses for real-world ophthalmology scenarios presented as tabular data. Additionally, GPT-4o was able to detect incorrect documentation and flag inconsistencies, contradictions, or incomplete information, which can help ensure that the EMR documentation supports the clinician’s diagnosis.
Here, we summarize the work that Microsoft's philanthropic Artificial Intelligence (AI) for Good Lab has completed in the realm of promoting public and population health. In particular, after providing examples of how the AI for Good Lab has articulated the value of using AI to improve public and population health, we provide examples and references of the work demonstrating how the Lab has: applied Artificial Intelligence (AI) to improve maternal, fetal, and infant health; leveraged large language models to improve population health; and applied AI to improve rural health and healthcare. We also summarize what we have learned through our work, finding that: getting the question right and ensuring the limitations of any analysis are understood is important; collaboration across public, private, and educational institutions with subject matter experts will be the most effective and efficient way to harness this new technology; and that focusing on metrics that reflect health, and not just the accuracy of the model, is the most impactful way to improve the health of populations, worldwide.
The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot. Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers. Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries. This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.
Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.
The purpose of this article is to explore what cognitive research can reveal about the way in which the neural system processes information. To that end, a comprehensive review of cognitive/behavioral and neuroscience models and findings is presented along with ideas as to how the human neural system has evolved. The representation of information in short-term memory (STM) is ascribed to stable oscillatory patterns across hierarchically structured functional networks of neocortical areas. These oscillatory patterns are primarily shaped by information in long-term memory (LTM) that is stored in the synaptic connections between neurons and, consequently, between neural areas. It is argued for the first time that the non-sensory and non-motor information processing stages revealed by behavioral research involve the change of potentially brain-wide oscillatory patterns that follow the reconfiguring of temporary neural networks. These network configurations can be governed by hub areas in the perceptual cortices (serving stimulus identification), the hippocampus (declarative memory), and the basal ganglia and prefrontal cortex (motor behavior, STM, and information processing). These ideas are integrated into a tentative neural Three-Level Systems (TLS) architecture comprising evolutionarily older perceptual and motor systems that are linked by a flexible central processing system located in the evolutionarily more recent association cortex.
Introduction Artificial intelligence (AI) is becoming more integrated in different research assignments, and this ongoing development opens opportunities to optimize resources, e.g., using AI in resource-intensive and time-consuming tasks like qualitative analysis of interview data. We aimed to test if Microsoft's Copilot could perform a content analysis on interview data using Graneheim and Lundman's method comparable to human analysis. Methodology We used a company-protected version of Microsoft's AI-powered assistant Copilot, which is based on large language models. The company-protected Copilot version ensured data security. A manual analysis of six interviews was conducted before this study using Graneheim and Lundman's method of content analysis. We conducted four analyses using Copilot and compared the results with those obtained through manual analysis. Copilot was prompted to use Graneheim and Lundman's method, and we tried providing it with an objective and a context. Results When prompted to use Graneheim and Lundman's method, Copilot was able to perform content analyses with high resemblance to the manual one, especially in terms of selecting meaningful units, as well as when coding them, which is within the descriptive analysis. It could also create subthemes and overarching themes resembling the manual ones; however, the interpretive analysis lacked nuances compared to the manual one. Copilot produced more accurate manifest content when only given Graneheim and Lundman's method. When given the objective, the analysis was shorter with fewer meaningful units. When given the context of the interviews, Copilot over-interpreted, and the analysis was mainly descriptive. Conclusions Copilot was able to perform a content analysis very similar to the manual one regarding the descriptive analyses on the manifest content using Graneheim and Lundman's method. However, it's interpretation of latent content lacked nuance - a limitation Copilot itself acknowledged. Copilot performed best when guided by the methodological framework alone, rather than the study's objective or context. While content analysis remains, a co-creative process requiring manual input, especially during interpretation, Copilot shows promising potential in supporting the early stages of analysis focused on manifest content.
Multiple choice questions (MCQs) are an important and integral component of ophthalmology residency training evaluation and board certification; however, high-quality questions are difficult and time-consuming to draft. To evaluate whether general-domain large language models (LLMs), particularly OpenAI's Generative Pre-trained Transformer 4 (GPT-4), can reliably generate high-quality, novel, and readable MCQs comparable to those of a committee of experienced examination writers. This survey study, conducted from September 2024 to April 2025, assesses LLM performance in generating MCQs based on the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) compared with a committee of human experts. Ten expert ophthalmologists, who were masked to the generation source, independently evaluated MCQs using a 10-point Likert scale (1 = extremely poor; 10 = criterion standard quality) across 5 criteria: appropriateness, clarity and specificity, relevance, discriminative power, and suitability for trainees. Relevant BCSC content and AAO question-writing guidelines were input into GPT-4o via Microsoft's Azure OpenAI Service, and structured prompts were used to generate MCQs. The primary outcomes were median scores and statistical comparisons using the bootstrapping method; string similarity scores based on Levenshtein distance (0-100, with 100 indicating identical content) between LLM-MCQs and the entire BCSC question bank; Flesch Reading Ease metric for readability; and intraclass correlation coefficient (ICC) for inter-rater agreement are reported. The 10 graders had between 1 and 28 years of clinical experience in ophthalmology (median [IQR] experience, 6 years [3-15 years]). Questions generated by GPT-4 and a committee of experts received median scores of 9 and 9 in combined scores, appropriateness, clarity and specificity, and relevance (difference, 0; 95% CI, 0-0; P > .99); 8 and 9 in discriminative power (difference, 1; 95% CI, -1 to 1; P = .52); and 8 and 8 in suitability for trainees (difference, 0; 95% CI, -1 to 0; P > .99), respectively. Nearly 95% of LLM-MCQs had similarity scores less than 60, indicating most LLM-MCQs had limited or no resemblance to existing content. Interrater reliability was moderate (ICC, 0.63; P < .001), and mean (SD) readability scores were similar across sources (37.14 [22.54] vs 42.60 [22.84]; P > .99). In this survey study, results indicate that an LLM could be used to develop ophthalmology board-style MCQs and expand examination banks to further support ophthalmology residency training. Despite most questions having a low similarity score, the quality, novelty, and readability of the LLM-generated questions need to be further assessed.
This study evaluates the accuracy of single camera markerless motion capture (SCMoCap) using Microsoft's Azure Kinect, enhanced with inverse kinematics (IK) via OpenSim, for upper limb movement analysis. Twelve healthy adults performed ten upper-limb tasks, recorded simultaneously by OptiTrack (marker-based) and Azure Kinect (markerless) from frontal and sagittal views. Joint angles were calculated using two methods: (1) direct kinematics based on body coordinate frames and (2) inverse kinematics using OpenSim's IK tool with anatomical keypoints. Accuracy was evaluated using root mean square error (RMSE) and Bland-Altman analysis. Results indicated that the IK method slightly improved joint angle agreement with OptiTrack for simpler movements, with an average RMSE of 8° for shoulder elevation in the sagittal plane compared to 9° with the coordinate frame method. However, both methods had higher RMSEs for rotational measurements, with IK and coordinate frame methods at 21° for shoulder rotation in the sagittal plane. Forearm pronation-supination measurements were unreliable due to tracking limitations. These findings suggest that Kinect with IK improves accuracy for simpler movements but struggles with rotational joint mechanics. Future research should focus on enhancing markerless tracking algorithms to fully realise the benefits of IK.
The expansion of the Wildland-Urban Interface (WUI) demands precise mapping to effectively mitigate wildfire risk. However, the absence of national building footprint databases presents a significant challenge. This study, focused on mainland Portugal, proposes a semi-automated, multi-criteria filtering framework to refine global open-source building datasets-specifically Microsoft's Global Building Footprints. The method integrates regional adaptability and spatial metrics such as area thresholds and proximity analyses, using Portugal's official Geographic Buildings Location Database as a reference. The framework prioritizes residential structures by excluding anomalies-such as industrial facilities, photovoltaic arrays, and transmission lines-through dynamically adjusted thresholds at various administrative levels (e.g., municipal and NUTS-2). The filtering process reduced the number of building footprints from approximately 5.6 million to around 3.0 million. We mapped the WUI across Portugal using both the original dataset (WUI_MSB) and the filtered dataset (WUI_MSB_F) to compare outcomes. The WUI was classified into Intermix and Interface types. Buildings that did not meet the minimum criteria to be considered part of the WUI were categorized based on their density: very low, low, medium, or high. The original WUI_MSB covered a total area of 13,177 km², representing approximately 15% of mainland Portugal. After applying the filtering framework, the WUI_MSB_F area was reduced by 49%, totaling 8,327 km². The workflow-implemented using Python scripting and ArcGIS Pro-is scalable for national-level applications. These experimental results highlight the importance of region-specific adjustments and demonstrate how this methodology can support policymakers in identifying and prioritizing context-specific exposed communities. By enhancing the reliability of open datasets, this approach offers a reproducible tool for wildfire resilience planning, particularly in data-scarce regions.
Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques. Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from five major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains. We used publicly available datasets as detailed in the paper, which can be provided upon request.
The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy. This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics. Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance. Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance. The questions asked were indicative and did not cover the entire field of orthodontics. Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
Artificial intelligence (AI) is quickly transforming healthcare by improving patient and clinician access to and understanding of medical information. Generative AI models answer healthcare queries and provide tailored and quick responses. This research evaluates the readability and quality of bladder cancer (BC) patient information in 10 popular AI-enabled chatbots. We used the latest versions of ten popular chatbots: OpenAI's GPT-4o, Microsoft's Copilot Pro, Claude-3.5 Haiku, Sonar Large, Grok 2, Gemini Advanced 1.5 Pro, Mistral Large, Google Palm 2 (Google Bard), Meta's Llama 3.3, and Meta AI v2. Prompts were developed to provide texts about BC, non-muscle-invasive BC, muscle-invasive BC, and metastatic BC. The modified Ensuring Quality Information for Patients (mEQIP), the Quality Evaluating Scoring Tool (QUEST), and DISCERN were used to assess quality. The Average Reading Level Consensus (ARLC), Flesch Reading Ease (FKRE), and Flesch-Kincaid Grade Level (FKGL) were used to evaluate readability. Ten chatbots exhibited statistically significant differences in mean mEQIP, DISCERN, and QUEST scores (p = 0.048, p = 0.025, and p = 0.021, respectively). Meta scored lowest on the average mEQIP, DISCERN, and QUEST, while Llama attained the highest. Statistically significant differences were also seen in the chatbots' average ARLC, FKGL, and FKRE scores (p = 0.002, p = 0.001, and p = 0.002, respectively), in which Google Palm produced texts that are easiest to read, and Llama is the most difficult chatbot to understand. AI chatbots can produce information on BC that is of moderate quality and readability, while there is significant variability among platforms. Results should be evaluated with caution due to the single-query approach and the continuously advancing AI models. Clinicians can support safety in implementation by delivering structured feedback and incorporating content review stages into patient education processes. Continuous collaboration between healthcare practitioners and AI developers is crucial to maintain the accuracy, currency, and clarity of AI-generated content.
Disputes over elusive Majorana particles-the hoped-for key to robust quantum chips-continue to divide the field.
Internal emails from 2021 reveal tension among researchers hunting for elusive Majorana particle.
Telerehabilitation is a cost-effective alternative to in-clinic rehabilitation. Although convenient, it lacks immersive and free-viewpoint patient visualization. Current research explores two solutions to this issue. Mesh-based methods use 3D models and motion capture for AR visualization. However, they are labor-intensive and less photorealistic than 2D images. Microsoft's Holoportation generates photorealistic 3D models with eight RGBD cameras in real time. However, it requires complex setups, high GPU power, and high-speed communication infrastructure, making deployment challenging. This article presents a Real-Time Free-Viewpoint Holographic Patient Rendering (RT-FVHP) system for telerehabilitation. Unlike traditional methods that require manually crafted assets such as 3D meshes, texture maps, and skeletal rigging, our data-driven approach eliminates the need for explicit asset definitions. Inspired by the HumanNeRF framework, we retarget dynamic human poses to a canonical pose and leverage 3D Gaussian Splatting to train a neural network in canonical space for patient representation. The trained model generates 2D RGB$\sigma$σ outputs via Gaussian Splatting rasterization, guided by camera parameters and human pose inputs. Compatible with HoloLens 2 and web-based platforms, RT-FVHP operates effectively under real-world conditions, including handling occlusions caused by treadmills. Occlusion handling is accomplished using our Shape-Enforced Gaussian Density Control (SGDC), which initializes and densifies 3D Gaussians in occluded regions using estimated SMPL human body priors. This approach minimizes manual intervention while ensuring complete body reconstruction. With efficient Gaussian rasterization, the model delivers real-time performance of up to 400 FPS at 1080p resolution on a dedicated RTX6000 GPU.
In recent years, Transformer-based large language models (LLMs) have significantly improved upon their text generation capability. Mental health is a serious concern that can be addressed using LLM-based automated mental health counselors. These systems can provide empathetic responses to individuals in need while considering the negative beliefs, stigma, and taboos associated with mental health issues. Considering the large size of these LLMs makes it difficult to deploy these automated counselors on low cost/resource devices such as edge devices. Therefore, the motivation of the present study to analyze the effectiveness of lightweight LLMs in the development of automated mental health counseling systems. In this study, lightweight open source LLMs such as Google's T5s (small variant), BARTB (base variant), FLAN-T5s (small variant), and Microsoft's GODELB (base variant) have been fine-tuned for automated mental health counseling task utilizing a diverse set of datasets publicly available online. The experimental results reveal that BART's base variant outperformed the other models across all key metrics such as ROUGE-1, ROUGE-2, ROUGE-L, and BLEU with scores of 0.4727, 0.2665, 0.3554, and 25.3993 respectively. In comparison to other models, BART-base model generated empathetic, and emotionally supportive responses. These findings highlight the potential of lightweight LLMs (small size LLMs), in advancing the field of LLM-based mental health counseling solutions and underscore the need for exploration of lightweight LLMs for this mental health counseling use case. The code for this work is available at the following link: https://github.com/diviitmg03/Comparative-analysis-of-LLMs-.git .
Chronic wounds affect approximately 2.5% of the US population and can cause severe complications if not identified and treated promptly. Artificial intelligence tools such as Microsoft's Copilot have the potential to expedite diagnosis, but their clinical diagnostic accuracy remains underexplored. Ten chronic wound cases were selected from the publicly available database of the Silesian University of Technology. Images and demographic data were entered into Copilot, which generated the top 3 differential diagnoses for each case. Diagnostic accuracy was evaluated using a predefined scoring system. Statistical analysis included descriptive statistics, the Wilcoxon signed-rank test, bootstrapping, the Fisher-Pitman permutation test, Cohen kappa, and Fisher exact test. Copilot correctly identified the primary diagnosis in 30% of cases and included the correct diagnosis within its top 3 differentials in 70% of cases. The mean diagnostic score was 1.7 (median: 2, SD: 1.25, variance: 1.57). The Wilcoxon test indicated no significant deviation from the median reference value (P = 0.6364), whereas bootstrapping yielded a 95% confidence interval of 1-4. The permutation test demonstrated a significant difference from the null hypothesis (P = 0.017), and the Cohen kappa revealed perfect agreement (kappa = 1, P = 0.00157). The Fisher exact test showed no significant association between primary and top 3 diagnostic accuracy (P = 0.20). Microsoft Copilot demonstrated limited diagnostic accuracy in chronic wound assessment, underscoring the need for cautious integration into clinical workflows. Broader datasets and more rigorous validation are crucial for enhancing artificial intelligence-supported diagnostics in wound care.
Background: Conversational artificial intelligence agents, or chatbots, are a transformational technology understudied in end-of-life care. Methods: OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing were asked to define "terminally ill," "end of life," "transitions of care," "actively dying," and provide three references. Outputs were scored by six physicians on a scale of 0-10 for accuracy, comprehensiveness, and credibility. Flesch-Kincaid Grade Level and Flesch Reading Ease (FRE) were used to calculate readability. Results: Mean (standard deviation) scores for accuracy were 9 (1.9) for ChatGPT, 7.5 (2.4) for Bard, and 8.3 (2.4) for Bing. Comprehensiveness scores averaged 8.5 (1.7) for ChatGPT, 7.3 (2.1) for Bard, and 6.5 (2.3) for Bing. Credibility was low with a mean score of 3 (1.8). The mean FRE score was 41.7, and the mean grade level was 14.1, indicating low readability. Conclusion: Chatbot outputs had important deficiencies that necessitated clinician oversight to prevent misinformation.
Thyroid nodules are common, with ultrasound imaging as the primary modality for their assessment. Risk stratification systems like the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) have been developed but suffer from interobserver variability and low specificity. Artificial intelligence, particularly large language models (LLMs) with multimodal capabilities, presents opportunities for efficient end-to-end diagnostic processes. However, their clinical utility remains uncertain. This study evaluates the accuracy and consistency of multimodal LLMs for thyroid nodule risk stratification using the ACR TI-RADS system, examining the effects of model fine-tuning, image annotation, prompt engineering, and comparing open-source versus commercial models. In total, 3 multimodal vision-language models were evaluated: Microsoft's open-source Large Language and Visual Assistant (LLaVA) model, its medically fine-tuned variant (Large Language and Vision Assistant for bioMedicine [LLaVA-Med]), and OpenAI's commercial o3 model. A total of 192 thyroid nodules from publicly available ultrasound image datasets were assessed. Each model was evaluated using 2 prompts (basic and modified) and 2 image scenarios (unlabeled vs radiologist-annotated), yielding 6912 responses. Model outputs were compared with expert ratings for accuracy and consistency. Statistical comparisons included Chi-square tests, Mann-Whitney U tests, and Fleiss' kappa for interrater reliability. Overall, 88.4% (6110/6912) of responses were valid, with the o3 model producing the highest validity rate (2273/2304, 98.6%), followed by LLaVA (2108/2304, 91.5%) and LLaVA-Med (1729/2304, 75%; P<.001). The o3 model demonstrated the highest accuracy overall, achieving up to 57.3% accuracy in Thyroid Imaging Reporting and Data System (TI-RADS) classification, although still remaining suboptimal. Labeled images improved accuracy marginally in nodule margin assessment only when evaluating LLaVA models (407/768, 53% to 447/768, 58.2%; P=.04). Prompt engineering improved accuracy for composition (649/1,152, 56.3% vs 483/1152, 41.9%; P<.001), but significantly reduced accuracy for shape, margins, and overall classification. Consistency was the highest with the o3 model (up to 85.4%), but was comparable for LLaVA and significantly improved with image labeling and modified prompts across multiple TI-RADS categories (P<.001). Subgroup analysis for o3 alone showed prompt engineering did not affect accuracy significantly but markedly improved consistency across all TI-RADS categories (up to 97.1% for shape, P<.001). Interrater reliability was consistently poor across all combinations (Fleiss' kappa<0.60). The study demonstrates the comparative advantages and limitations of multimodal LLMs for thyroid nodule risk stratification. While the commercial model (o3) consistently outperformed open-source models in accuracy and consistency, even the best-performing model outputs remained suboptimal for direct clinical deployment. Prompt engineering significantly enhanced output consistency, particularly in the commercial model. These findings underline the importance of strategic model optimization techniques and highlight areas requiring further development before multimodal LLMs can be reliably used in clinical thyroid imaging workflows.