Urban environments are shaped by intricate interactions among water, soil, air, and infrastructure, where traditional models often fail to capture nonlinear, non-Euclidean dynamics. Spatiotemporal graph learning (STGL) has emerged as a powerful framework to represent such complexity, enabling accurate forecasting and real-time decision support from urban districts to national and even global scales. This review provides the first comprehensive synthesis of STGL tailored to urban environments. We summarize advances in graph construction, spatial and temporal modeling, and fusion strategies, and examine applications across urban water systems, soil and agriculture, air quality, and urban risk. Landmark case studies, including Microsoft's Aurora, NVIDIA's Earth-2, and Google's GraphCast/GenCast, demonstrate STGL's potential as a foundation model for environmental intelligence. We conclude by identifying key limitations and outlining future directions, emphasizing federated learning, machine unlearning, and meta-learning to enhance next-generation STGL frameworks that ultimately support resilient and adaptive urban environments.
General-domain large language models (LLMs) have emerged as valuable tools in healthcare, however, their ability to understand and perform tasks based on data stored in tabular form has not been explored in Ophthalmology. We aimed to assess OpenAI's Generative Pre-trained Transformer 4o (GPT-4o) performance within real emergency department (ED) eye-related encounters extracted from electronic medical records in tabular format. We input the excel spreadsheet containing the data on 1,419 unique eye-related ED encounters, divided into (1) chief complaint (CC), history of present illness (HPI), and eye examination; (2) CC and eye examination; (3) eye examination only, into GPT-4o via Microsoft's Azure OpenAI Service using chain-of-thought (CoT) prompting and evaluated the diagnosis and assessment performance of the LLM on the presented data. GPT-4o answers were reviewed by board-certified ophthalmologists and classified as (1) GPT-4o provided a correct diagnosis and assessment; (2) GPT-4o provided an incorrect diagnosis and assessment; (3) GPT-4o unable to provide a correct diagnosis as the encounter documentation was incorrect; (4) GPT-4o unable to provide a correct diagnosis as it required ancillary tests. A sample of encounters were reviewed by a second board-certified ophthalmologist for inter-grader agreement assessment. Average accuracy rates were used to evaluate performance and compare statistical significance across scenarios. A second CoT prompting was performed after providing the LLM with the final encounter diagnosis to evaluate disagreement/inconsistencies between the presented documentation and the reported diagnosis. GPT-4o (CoT) overall accuracy was 0.76 (95% confidence interval [CI], 0.74-0.79); no significant difference was found in accuracy when GPT-4o was presented with CC, HPI and eye findings vs. CC and eye findings vs. eye findings only (P = 0.675). The inter-grader agreement kappa was 0.841 (P < 0.001). GPT-4o identified that 6.6% of all encounters did not have EMR documentation that supported the final encounter diagnosis. When encounters with incorrect EMR documentation and encounters with requirements for ancillary tests (5.2%) were excluded, GPT-4o accuracy was 0.87 (95% CI, 0.85-0.89). GPT-4o could accurately synthesize tabular data and provide assessments and diagnoses in real-world ophthalmology encounters, in addition to identify encounters with documentation that did not support the final ED encounter diagnosis. This capability has the potential to support the clinician's diagnosis. What is known General-domain large language models (LLMs) have emerged as valuable tools in healthcare. In ophthalmology, prior LLM studies have focused on text-based inquiries of a limited number of sample cases. What is new OpenAI’s Generative Pre-trained Transformer 4o (GPT-4o) could accurately synthesize and provide diagnoses for real-world ophthalmology scenarios presented as tabular data. Additionally, GPT-4o was able to detect incorrect documentation and flag inconsistencies, contradictions, or incomplete information, which can help ensure that the EMR documentation supports the clinician’s diagnosis.
Here, we summarize the work that Microsoft's philanthropic Artificial Intelligence (AI) for Good Lab has completed in the realm of promoting public and population health. In particular, after providing examples of how the AI for Good Lab has articulated the value of using AI to improve public and population health, we provide examples and references of the work demonstrating how the Lab has: applied Artificial Intelligence (AI) to improve maternal, fetal, and infant health; leveraged large language models to improve population health; and applied AI to improve rural health and healthcare. We also summarize what we have learned through our work, finding that: getting the question right and ensuring the limitations of any analysis are understood is important; collaboration across public, private, and educational institutions with subject matter experts will be the most effective and efficient way to harness this new technology; and that focusing on metrics that reflect health, and not just the accuracy of the model, is the most impactful way to improve the health of populations, worldwide.
The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs. Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category. A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (P < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (P < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate. LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.
Background: Conversational artificial intelligence agents, or chatbots, are a transformational technology understudied in end-of-life care. Methods: OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing were asked to define "terminally ill," "end of life," "transitions of care," "actively dying," and provide three references. Outputs were scored by six physicians on a scale of 0-10 for accuracy, comprehensiveness, and credibility. Flesch-Kincaid Grade Level and Flesch Reading Ease (FRE) were used to calculate readability. Results: Mean (standard deviation) scores for accuracy were 9 (1.9) for ChatGPT, 7.5 (2.4) for Bard, and 8.3 (2.4) for Bing. Comprehensiveness scores averaged 8.5 (1.7) for ChatGPT, 7.3 (2.1) for Bard, and 6.5 (2.3) for Bing. Credibility was low with a mean score of 3 (1.8). The mean FRE score was 41.7, and the mean grade level was 14.1, indicating low readability. Conclusion: Chatbot outputs had important deficiencies that necessitated clinician oversight to prevent misinformation.
Artificial intelligence (AI) is quickly transforming healthcare by improving patient and clinician access to and understanding of medical information. Generative AI models answer healthcare queries and provide tailored and quick responses. This research evaluates the readability and quality of bladder cancer (BC) patient information in 10 popular AI-enabled chatbots. We used the latest versions of ten popular chatbots: OpenAI's GPT-4o, Microsoft's Copilot Pro, Claude-3.5 Haiku, Sonar Large, Grok 2, Gemini Advanced 1.5 Pro, Mistral Large, Google Palm 2 (Google Bard), Meta's Llama 3.3, and Meta AI v2. Prompts were developed to provide texts about BC, non-muscle-invasive BC, muscle-invasive BC, and metastatic BC. The modified Ensuring Quality Information for Patients (mEQIP), the Quality Evaluating Scoring Tool (QUEST), and DISCERN were used to assess quality. The Average Reading Level Consensus (ARLC), Flesch Reading Ease (FKRE), and Flesch-Kincaid Grade Level (FKGL) were used to evaluate readability. Ten chatbots exhibited statistically significant differences in mean mEQIP, DISCERN, and QUEST scores (p = 0.048, p = 0.025, and p = 0.021, respectively). Meta scored lowest on the average mEQIP, DISCERN, and QUEST, while Llama attained the highest. Statistically significant differences were also seen in the chatbots' average ARLC, FKGL, and FKRE scores (p = 0.002, p = 0.001, and p = 0.002, respectively), in which Google Palm produced texts that are easiest to read, and Llama is the most difficult chatbot to understand. AI chatbots can produce information on BC that is of moderate quality and readability, while there is significant variability among platforms. Results should be evaluated with caution due to the single-query approach and the continuously advancing AI models. Clinicians can support safety in implementation by delivering structured feedback and incorporating content review stages into patient education processes. Continuous collaboration between healthcare practitioners and AI developers is crucial to maintain the accuracy, currency, and clarity of AI-generated content.
The expansion of the Wildland-Urban Interface (WUI) demands precise mapping to effectively mitigate wildfire risk. However, the absence of national building footprint databases presents a significant challenge. This study, focused on mainland Portugal, proposes a semi-automated, multi-criteria filtering framework to refine global open-source building datasets-specifically Microsoft's Global Building Footprints. The method integrates regional adaptability and spatial metrics such as area thresholds and proximity analyses, using Portugal's official Geographic Buildings Location Database as a reference. The framework prioritizes residential structures by excluding anomalies-such as industrial facilities, photovoltaic arrays, and transmission lines-through dynamically adjusted thresholds at various administrative levels (e.g., municipal and NUTS-2). The filtering process reduced the number of building footprints from approximately 5.6 million to around 3.0 million. We mapped the WUI across Portugal using both the original dataset (WUI_MSB) and the filtered dataset (WUI_MSB_F) to compare outcomes. The WUI was classified into Intermix and Interface types. Buildings that did not meet the minimum criteria to be considered part of the WUI were categorized based on their density: very low, low, medium, or high. The original WUI_MSB covered a total area of 13,177 km², representing approximately 15% of mainland Portugal. After applying the filtering framework, the WUI_MSB_F area was reduced by 49%, totaling 8,327 km². The workflow-implemented using Python scripting and ArcGIS Pro-is scalable for national-level applications. These experimental results highlight the importance of region-specific adjustments and demonstrate how this methodology can support policymakers in identifying and prioritizing context-specific exposed communities. By enhancing the reliability of open datasets, this approach offers a reproducible tool for wildfire resilience planning, particularly in data-scarce regions.
Medical education professionals expect artificial intelligence (AI) systems to be an efficient faculty resource for content creation. However, prior findings suggest that machine learning algorithms may exacerbate negative stereotypes and undermine efforts for diversity, equity, and inclusivity. This investigation explores the potential of OpenAI's ChatGPT (OCG) and Microsoft's Bing A.I. Image Creator (MBIC) to perpetuate ethnoracial stereotypes in medical cases. A series of medically relevant vignettes and visual representatives were requested from ChatGPT and MBIC for five medical conditions traditionally associated with certain ethnoracial groups: sickle cell anemia, cystic fibrosis, Tay-Sachs disease, beta-thalassemia, and aldehyde dehydrogenase deficiency. Initial prompting, self-prompting, and prompt engineering were iteratively performed to ascertain the extent to which AI outputs for generated vignettes and imagery were mutable or fixed. The ethnoracial identity in the vignettes of the clinical conditions adhered more closely than described in epidemiologic studies. Following prompt engineering and self-prompting, an increase in diversity was seen. On initial prompting, the most common ethnoracial identity depicted was Caucasian. Secondary prompting resulted in less diversity with higher conformation to the traditionally expected ethnoracial identity. The prevalence of dataset bias and AI's user-dependent learning abilities underscore the importance of human stewardship. The increasing use of AI in generating medical education content, like MCQs, demands vigilant use of such tools to combat the reinforcement of the race-based stereotypes in medicine.
Chronic wounds affect approximately 2.5% of the US population and can cause severe complications if not identified and treated promptly. Artificial intelligence tools such as Microsoft's Copilot have the potential to expedite diagnosis, but their clinical diagnostic accuracy remains underexplored. Ten chronic wound cases were selected from the publicly available database of the Silesian University of Technology. Images and demographic data were entered into Copilot, which generated the top 3 differential diagnoses for each case. Diagnostic accuracy was evaluated using a predefined scoring system. Statistical analysis included descriptive statistics, the Wilcoxon signed-rank test, bootstrapping, the Fisher-Pitman permutation test, Cohen kappa, and Fisher exact test. Copilot correctly identified the primary diagnosis in 30% of cases and included the correct diagnosis within its top 3 differentials in 70% of cases. The mean diagnostic score was 1.7 (median: 2, SD: 1.25, variance: 1.57). The Wilcoxon test indicated no significant deviation from the median reference value (P = 0.6364), whereas bootstrapping yielded a 95% confidence interval of 1-4. The permutation test demonstrated a significant difference from the null hypothesis (P = 0.017), and the Cohen kappa revealed perfect agreement (kappa = 1, P = 0.00157). The Fisher exact test showed no significant association between primary and top 3 diagnostic accuracy (P = 0.20). Microsoft Copilot demonstrated limited diagnostic accuracy in chronic wound assessment, underscoring the need for cautious integration into clinical workflows. Broader datasets and more rigorous validation are crucial for enhancing artificial intelligence-supported diagnostics in wound care.
In the last decades, maxillomandibular reconstruction has been revolutionised by the use of free flaps and virtual surgical planning technologies. However, the currently available applied physical cutting guides provide no intraoperative flexibility, and adjustments based on intraoperative findings are not possible. A novel augmented reality (AR)-guided technique is presented that allows for quick intraoperative surgical planning adaptations. A mandibular reconstruction using fibular bone was simulated and an application for Microsoft's HoloLens 2 developed for modelling the fibular segments. The application provided real-time feedback on the position of the saw with respect to the virtual planned osteotomy planes projected on the fibular bone. The technique was investigated in a validation test using 3-dimensional printed fibular models. Mean (SD) deviations from the planned osteotomy plane, expressed in degrees and segment length deviation, were 4.1° (2.6) and 2.0 mm (1.1), respectively, for session one, and 3.1° (2.3) and 2.3 mm (1.4), respectively, for session two. The feasibility of the AR-guided technique to perform osteotomies of fibular bone was established in this workflow simulation. The technique can improve the transfer of the preoperative plan to the intraoperative situation. Further development is, however, necessary since conventional cuttings guides are, so far, superior.
Disputes over elusive Majorana particles-the hoped-for key to robust quantum chips-continue to divide the field.
Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.
The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot. Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers. Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries. This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.
Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques. Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from five major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains. We used publicly available datasets as detailed in the paper, which can be provided upon request.
In recent years, Transformer-based large language models (LLMs) have significantly improved upon their text generation capability. Mental health is a serious concern that can be addressed using LLM-based automated mental health counselors. These systems can provide empathetic responses to individuals in need while considering the negative beliefs, stigma, and taboos associated with mental health issues. Considering the large size of these LLMs makes it difficult to deploy these automated counselors on low cost/resource devices such as edge devices. Therefore, the motivation of the present study to analyze the effectiveness of lightweight LLMs in the development of automated mental health counseling systems. In this study, lightweight open source LLMs such as Google's T5s (small variant), BARTB (base variant), FLAN-T5s (small variant), and Microsoft's GODELB (base variant) have been fine-tuned for automated mental health counseling task utilizing a diverse set of datasets publicly available online. The experimental results reveal that BART's base variant outperformed the other models across all key metrics such as ROUGE-1, ROUGE-2, ROUGE-L, and BLEU with scores of 0.4727, 0.2665, 0.3554, and 25.3993 respectively. In comparison to other models, BART-base model generated empathetic, and emotionally supportive responses. These findings highlight the potential of lightweight LLMs (small size LLMs), in advancing the field of LLM-based mental health counseling solutions and underscore the need for exploration of lightweight LLMs for this mental health counseling use case. The code for this work is available at the following link: https://github.com/diviitmg03/Comparative-analysis-of-LLMs-.git .
The purpose of this article is to explore what cognitive research can reveal about the way in which the neural system processes information. To that end, a comprehensive review of cognitive/behavioral and neuroscience models and findings is presented along with ideas as to how the human neural system has evolved. The representation of information in short-term memory (STM) is ascribed to stable oscillatory patterns across hierarchically structured functional networks of neocortical areas. These oscillatory patterns are primarily shaped by information in long-term memory (LTM) that is stored in the synaptic connections between neurons and, consequently, between neural areas. It is argued for the first time that the non-sensory and non-motor information processing stages revealed by behavioral research involve the change of potentially brain-wide oscillatory patterns that follow the reconfiguring of temporary neural networks. These network configurations can be governed by hub areas in the perceptual cortices (serving stimulus identification), the hippocampus (declarative memory), and the basal ganglia and prefrontal cortex (motor behavior, STM, and information processing). These ideas are integrated into a tentative neural Three-Level Systems (TLS) architecture comprising evolutionarily older perceptual and motor systems that are linked by a flexible central processing system located in the evolutionarily more recent association cortex.
This study aims to analyze and compare the quality, accuracy, and readability of information regarding anatomic total shoulder arthroplasty (aTSA) and reverse total shoulder arthroplasty (rTSA) provided by various AI interfaces (Open AI's ChatGPT and Microsoft's CoPilot). Thirty commonly asked questions (categorized by Rothwell criteria into Fact, Policy, and Value) by patients were inputted into ChatGPT 3.5 and CoPilot. Responses were assessed with the DISCERN scale, Journal of the American Medical Association (JAMA) benchmark criteria, and Flesch-Kincaid Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). The sources of citations provided by CoPilot were further analyzed. Both AI interfaces generated DISCERN scores >50 (aTSA and rTSA ChatGPT: 57 [Fact], 61 [Policy], 58 [Value]; aTSA and rTSA CoPilot: 68 [Fact], 72 [Policy], 70 [Value]), demonstrating "good" quality of information provided, except for the Policy questions by CoPilot, which were scored as "excellent" (>70). CoPilot's higher JAMA score (3 vs. 0) and FRES scores >30 indicated more reliable, accessible responses, which required a minimum of 12th-grade education to read the same. In comparison, the ChatGPT generated more complex texts, with the majority of the FRES scores <20, and FKGL score signifying complexity of academic level text. Finally, CoPilot provided citations and demonstrated the highest percentage of academic sources (31.1% for rTSA and 26.7% for aTSA), suggesting reliable sources of information. Overall, the information provided by both AI interfaces ChatGPT and CoPilot was scored as a "good" source of information for commonly asked patient questions regarding shoulder arthroplasty. But the answers to questions pertaining to shoulder arthroplasty provided by CoPilot proved to be more reliable (P = .0061), less complex, easier to read (P = .0031), and referenced information from reliable resources including academic sources, journal articles, and medical sites. Although answers provided by CoPilot were "easier" to read, they still required a 12th-grade education, which may be too complex for most patients, posing a challenge for patient comprehension. There were a substantial amount of nonmedical media sites, and commercial sources that were cited for both aTSA and rTSA questions by CoPilot. Critically, answers from both AI interfaces should serve as supplementary resources rather than primary sources on perioperative conditions pertaining to shoulder arthroplasty.
Generative Artificial Intelligence (GAI) has driven several advancements in healthcare, with large language models (LLMs) such as OpenAI's ChatGPT, Google's Gemini, and Microsoft's Copilot demonstrating potential in clinical decision support, medical education, and research acceleration. However, their closed-source architecture, high computational costs, and limited adaptability to specialized medical contexts remained key barriers to universal adoption. Now, with the rise of DeepSeek's DeepThink (R1), an open-source LLM, gaining prominence since mid-January 2025, new opportunities and challenges emerge for healthcare integration and AI-driven research. Unlike proprietary models, DeepSeek fosters continuous learning by leveraging publicly available open-source datasets, possibly enhancing adaptability to the ever-evolving medical knowledge and scientific reasoning. Its transparent, community-driven approach may enable greater customization, regional specialization, and collaboration among data researchers and clinicians. Additionally, DeepSeek supports offline deployment, addressing some data privacy concerns. Despite these promising advantages, DeepSeek presents ethical and regulatory challenges. Users' data privacy worries have emerged, with concerns about user data retention policies and potential developer access to user-generated content without opt-out options. Additionally, when used in healthcare applications, its compliance with China's data-sharing regulations highlights the urgent need for clear international data privacy and governance. Furthermore, like other LLMs, DeepSeek may face limitations related to inherent biases, hallucinations, and output reliability, which warrants rigorous validation and human oversight before clinical application. This editorial explores DeepSeek's potential role in clinical workflows, medical education, and research while also highlighting its challenges related to security, accuracy, and responsible AI governance. With careful implementation, ethical considerations, and international collaboration, DeepSeek and similar LLMs could enhance healthcare innovation, providing cost-effective, scalable AI solutions while ensuring human expertise remains at the forefront of patient care.
The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy. This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics. Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance. Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance. The questions asked were indicative and did not cover the entire field of orthodontics. Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
This study evaluates the accuracy of single camera markerless motion capture (SCMoCap) using Microsoft's Azure Kinect, enhanced with inverse kinematics (IK) via OpenSim, for upper limb movement analysis. Twelve healthy adults performed ten upper-limb tasks, recorded simultaneously by OptiTrack (marker-based) and Azure Kinect (markerless) from frontal and sagittal views. Joint angles were calculated using two methods: (1) direct kinematics based on body coordinate frames and (2) inverse kinematics using OpenSim's IK tool with anatomical keypoints. Accuracy was evaluated using root mean square error (RMSE) and Bland-Altman analysis. Results indicated that the IK method slightly improved joint angle agreement with OptiTrack for simpler movements, with an average RMSE of 8° for shoulder elevation in the sagittal plane compared to 9° with the coordinate frame method. However, both methods had higher RMSEs for rotational measurements, with IK and coordinate frame methods at 21° for shoulder rotation in the sagittal plane. Forearm pronation-supination measurements were unreliable due to tracking limitations. These findings suggest that Kinect with IK improves accuracy for simpler movements but struggles with rotational joint mechanics. Future research should focus on enhancing markerless tracking algorithms to fully realise the benefits of IK.