The aim was to evaluate the ability of four large language models (LLMs) (OpenAI's ChatGPT-3.5, Microsoft 365 Copilot, DeepSeek-R1, and Google Gemini 2.5 Pro) to develop treatment options when presented with clinical cases published in the maxillofacial prosthodontics literature. Six maxillofacial case reports were fed to the LLMs following a prompt that requested prosthodontic treatment options from the perspective of a prosthodontist. Expert evaluators scored the relevance, clarity, depth, focus, and coherence of the responses. Statistical analyses, including descriptive statistics, two-way analysis of variance (ANOVA), post hoc Tukey tests, Pearson correlation analyses, and intraclass correlation coefficients (ICCs), were performed (α < 0.05). There were significant differences among the total mean relevance (p = 0.003), clarity (p = 0.006), depth (p < 0.001), focus (p < 0.001), and coherence (p < 0.001) scores of chatbots. Copilot consistently scored the lowest, and Gemini or DeepSeek scored the highest for all five factors. Depth (p = 0.006), focus (p = 0.024), and coherence (p = 0.013) scores of senior prosthodontists were slightly higher than those of junior prosthodontists. Pearson correlation analysis revealed positive correlations between the total mean scores for all five factors (p < 0.001). The study demonstrates the ability of LLMs to develop maxillofacial prosthetic treatment plans tailored to specific clinical scenarios. There were significant differences between the abilities of the LLMs evaluated in this study. Copilot scored the lowest for all factors evaluated, and Gemini and/or DeepSeek scored the highest.
Aging is characterized by a progressive decline in physiological function, driven by intrinsic mechanisms (primary aging) and modifiable factors (secondary aging), ultimately leading to multimorbidity, disability, and mortality. Mitochondrial dysfunction, a major hallmark of aging, plays a central role in the loss of muscle mass and strength observed in frailty and sarcopenia. With age, mitochondrial quality control processes, including biogenesis, mitophagy, and dynamics, become dysregulated, impairing energy metabolism and muscle homeostasis. Mitochondrial dysfunction correlates with clinical biomarkers of sarcopenia and frailty, such as the decrease in walking speed and muscle strength, making it a therapeutic target for mitohormesis-based strategies aimed at preserving functional capacity. Mitohormetic agents induce reversible mitochondrial stress, triggering adaptive responses that enhance function. Among these interventions, physical exercise, particularly endurance and resistance training (RT), has been reported to be among the most effective, as it may modulate mitochondrial biogenesis, dynamics, and mitophagy through increases in proliferator-activated receptor gamma coactivator 1-alpha (PGC-1α) and mitochondrial transcription factor A (TFAM) expression, mitochondrial deoxyribonucleic acid (mtDNA) copy number, and mitochondrial content. Chronic RT can also elevate fusion and fission markers, potentially as a compensatory mechanism to mitigate mitochondrial damage. Apart from exercise, mitohormetic compounds such as harmol and piceid are emerging as promising supplements in the aging field. By modulating mitochondrial bioenergetics and dynamics, they may complement lifestyle-based interventions to improve mitochondrial fitness and extend health span.
暂无摘要(点击查看详情)
In Dialectical Behavior Therapy (DBT), monitoring inner tension and state dissociation is considered helpful for understanding a patient’s affective and behavioral responses, especially regarding self-harming behaviors. This study aims to investigate how potentially aversive internal states including dissociation, inner tension and fluctuations of tension, i.e. affective instability, during the initial phase of inpatient DBT relate to self-injury, suicidal ideation, and symptom reduction. 41 patients with personality disorder (borderline or combined), undergoing an 8-week inpatient DBT program used a smartphone application for ecological momentary assessment of inner tension and state dissociation. We assessed the Borderline Symptom List (BSL-23) upon admission and discharge, self-injury during the inpatient stay, and the intensity of suicidal ideation per day (diary card ratings) retrospectively from digital patient files. We employed linear mixed models to analyze the trajectories of inner tension, affective instability (i.e. squared difference between consecutive inner tension ratings), and state dissociation over the initial 3 weeks of therapy and variability across hours, as well as daily associations with the intensity of suicidal ideation. We used logistic regression to examine if aversive internal states are associated with the occurrence of self-injury during DBT. We found a slight reduction in state dissociation, affective instability, and high levels of inner tension during the initial 3 weeks of therapy. Analyses of intraday data showed a slight increase in state dissociation from morning to midday and self-injury during DBT was associated with slightly higher mean dissociation levels. On days with greater intensity of suicidal ideation, elevated inner tension and state dissociation levels were found while affective instability was not related to self-injury or suicidal ideation. Changes in BSL-23 over therapy were not related to aversive internal states during the initial therapy phase. Aversive internal states decreased early in inpatient DBT, with midday emerging as a critical time for dissociation management. Daily increases in state dissociation and inner tension might serve as warning signs for suicidality and our data confirm the utility of their monitoring in self-harm prevention. When state dissociation is therapeutically addressed during DBT as in the program investigated, it does not seem to hamper improvement in borderline symptoms. Retrospectively registered - https://osf.io/dfq9y/?view_only=4c19b891bb6448009b22f60b2552bd73. The online version contains supplementary material available at 10.1186/s40479-026-00339-1.
Medical coding structures health-care data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign International Classification of Primary Care, 2nd edition (ICPC-2) codes using the output of a domain-specific search engine. A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73 563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Twenty-eight models achieved F1-score>0.8; 10 exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B parameters) struggled with formatting and input length. Large language models show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.
Antimicrobial resistance (AMR) poses a critical global health threat, undermining the efficacy of antibiotics and complicating clinical decision-making. Although scientific literature on AMR is extensive, retrieving and synthesizing relevant evidence remains time-consuming for clinicians and researchers. Recent advances in large language models (LLMs) offer opportunities to enhance access to domain-specific knowledge. However, the diversity of available models, ranging from open-source to commercial, necessitates a systematic comparison of their performance, cost, and scalability in real-world biomedical applications. This study aims to describe the development of a retrieval-augmented generation (RAG) chatbot for AMR literature analysis and compare multiple commercial and open-source LLMs in terms of accuracy, faithfulness, response time, and cost-efficiency. A corpus of 164 peer-reviewed AMR-related articles was compiled from Google Scholar and embedded into a ChromaDB vector database using OpenAI's text-embedding-ada-002. The RAG chatbot was implemented to operate with 5 LLM backbones: GPT-4, GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, and LLaMA 4 Maverick. For each model, a temperature ablation study was performed to determine optimal performance. Evaluation metrics included correctness (pass rate and score), faithfulness, relevancy, computational cost, and latency, using a synthetic ground truth dataset generated with GPT-4. All models generated scientifically grounded responses when integrated into the RAG framework. GPT-4 achieved the highest correctness score (94.7%) but incurred the highest cost, while GPT-4o delivered nearly identical accuracy at a 9-fold lower cost and the fastest response time (3.88 s). LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but substantially reduced operational costs. Claude 3.7 Sonnet showed competitive accuracy, but the least favorable cost-performance ratio. Qualitative analysis revealed differences in response style, detail, and structure among models. A RAG-based chatbot can effectively support AMR research by delivering accurate, context-grounded, and scalable access to scientific literature. The comparative evaluation highlights trade-offs between performance, cost, and speed, guiding the selection of LLM architectures for clinical and research settings. Future work will focus on integrating language-specific embeddings and specialized domain agents to further enhance accuracy, adaptability, and clinical use.
Open-access endoscopy relies on referrals that are manually vetted, which is a resource consuming process, with potential biases. While Large Language Models (LLMs) have demonstrated potential in medical utilities, their ability to autonomously manage complex referral logistics remains understudied. We assessed whether LLMs can provide accurate recommendations on gastrointestinal endoscopy referrals. We extracted 200 multilingual endoscopy referrals with structured and unstructured medical data. We evaluated OpenAI’s o3 and Google’s Gemini 2.5-pro. A prompt was tuned on a set of 20 referrals and tested on the remaining 180 referrals. Eight variables were tested: procedure type, indication, need for anesthesiologist, omission of anti-aggregants, anti-coagulants and glucagon-like peptide-1 receptor agonists (GLP-1RAs), implantable electronic devices and need for intensified preparation. LLM responses were benchmarked against expert gastroenterologists. Accuracy and F1 scores were analyzed using bootstrapping, and models compared with McNemar’s test. Confusion matrices were calculated. Additionally, o3 generated patient-specific visual timelines. Among 200 referrals, 88 (44%) referred for colonoscopy, 53 (26.5%) for esophagogastroduodenoscopy; 65 (32.5%) required an anesthesiologist and 65 (32.5%) intensified preparation. Both models demonstrated comparable high performance, with o3 achieving 91%–100% accuracy and Gemini 2.5-pro achieving 89%–99% accuracy across all variables. There were no statistically significant differences between the models. Confusion matrix analysis confirmed high precision (> 95%) and specificity (> 91%) for both, indicating high reliability in resource allocation. Additionally, o3 successfully generated accurate, patient-specific visual instructions for all sampled cases. LLMs are highly accurate in processing endoscopy referrals and can generate patient-specific instructions. These tools offer a promising solution to streamline endoscopy workflows, reduce physician burden, and improve patient communication. The online version contains supplementary material available at 10.1186/s12876-026-04636-5.
暂无摘要(点击查看详情)
Artificial intelligence is increasingly integrated into medical decision-making, often framed as a supportive tool that enhances accuracy while leaving final judgment to clinicians. This paper argues that such framing obscures a deeper structural shift: medical action may proceed without any judgment ever occurring. AI systems do not judge; they generate outputs through statistical transduction. Clinicians, under institutional and legal pressures, may relay these outputs without regenerating them as their own reasons. When neither AI nor clinician generates judgment, decisions are enacted without a judging subject. While judgment without a judging subject may be sustainable elsewhere, medicine renders this absence unsustainable. Medical practice is characterized by irreversibility, case-specificity, meaning-demand, and relational accountability-features that presuppose judgment as a human act. Even clinically correct outcomes do not guarantee that patients will recognize a decision as right for them. When judgment disappears, informed consent persists only as a procedural ritual, simulating understanding without grounding it. To make this absence explicit, the paper introduces Metaqualia Theory (MTQ), distinguishing patient experience (Q), technical transduction (T), and judgment as meaning-generating endorsement (M). This leads to a prior ethical question: Is there an M here? This question precedes concerns about explainability and helps clarify the conditions under which consent and responsibility remain meaningful. The analysis suggests that when AI outputs are not regenerated as human judgment, their role in medical practice raises structural limits that cannot be addressed by transparency alone.
Climate change increasingly threatens plant productivity and ecosystem stability, highlighting the need for sustainable strategies that enhance plant resilience. The plant holobiont-comprising the plant and its associated rhizospheric microbiota-has emerged as a key functional unit governing plant performance under environmental stress. Among emerging non-invasive approaches, sound and vibration stimuli have been reported to influence plant growth, stress responses, and microbial activity; however, the physiological mechanisms underlying these effects remain poorly defined. This review synthesizes current evidence on sound-induced plant and microbial responses within a holobiont framework and advances a physiology-driven conceptual model linking acoustic stimuli to root function and rhizospheric processes. We propose that sound vibrations act primarily as mechanical cues perceived by plant tissues through mechanotransduction pathways, triggering calcium and hormonal signaling that modulate root architecture, metabolism, and exudation patterns. These root-level physiological changes are hypothesized to indirectly shape rhizospheric microbial community assembly and function, thereby influencing nutrient acquisition, stress tolerance, and agronomic performance. By explicitly connecting sound perception, root functional traits, and plant-mediated microbial responses, this review moves beyond a descriptive synthesis and provides a mechanistic framework to guide future experimental research. Understanding these pathways may support the development of sound-based strategies as low-impact tools for improving plant-soil-microbe interactions in sustainable agriculture.
暂无摘要(点击查看详情)
Natural language models and chatbots, particularly OpenAI's Generative Pre-Trained Transformer architecture, have transformed human interaction with digital interfaces. The latest versions, including ChatGPT-4o, offer enhanced functionalities compared to their predecessors. This study evaluates the accuracy of ChatGPT-4, ChatGPT-4o, and Claude 3.5 Sonnet in answering questions from the Brazilian Retina and Vitreous Society certification exam. We compiled 200 multiple-choice questions from the Brazilian Retina and Vitreous Society 2018 and 2019 exams. Questions were categorized into three domains: Anatomy and Physiology of the Retina, Retinal Pathology, and Diagnosis and Treatment. Using a standardized prompt developed according to prompt design guidelines, we tested ChatGPT-4, ChatGPT-4o, and Claude 3.5 Sonnet, recording their first responses as final. Three retina specialists performed a qualitative analysis of the answers. Accuracy was determined by comparing responses to the official correct answers. Statistical analysis was conducted using chi-square tests and Cohen's Kappa. Claude 3.5 Sonnet achieved the highest overall accuracy (72.5%), followed by ChatGPT-4o (66.0%) and ChatGPT-4 (55.5%). Claude 3.5 Sonnet and ChatGPT-4o significantly outperformed ChatGPT-4 (p<0.01 and p=0.03, respectively), while no significant difference was observed between Claude 3.5 Sonnet and ChatGPT-4o (p=0.16). Model responses agreed 74.5% of the time, with a Cohen's κ of 0.47. Retinal Pathology was the best-performing domain for all models, whereas Anatomy and Physiology of the Retina and Diagnosis and Treatment were the weakest domains for Claude 3.5 Sonnet and ChatGPT-4, respectively. This study is the first to assess Claude 3.5 Sonnet, ChatGPT-4, and ChatGPT-4o in retina specialist certification exams. Claude 3.5 Sonnet and ChatGPT-4o significantly outperformed ChatGPT-4, highlighting their potential as effective tools for studying retina specialist board exams. These findings suggest that the enhanced functionalities of Claude 3.5 Sonnet and ChatGPT-4o offer substantial improvements in medical education contexts.
Coumarins are a privileged scaffold in medicinal chemistry, renowned for diverse therapeutic activities including antiviral, anticancer, and neuroprotective effects. Building on our previous work with 3-substituted coumarins as inhibitors of tumor-associated carbonic anhydrases, we report a novel series of thiazol-hydrazono-coumarins targeting the ATP-binding domain of topoisomerase enzymes. Seventeen compounds were synthesized and evaluated for selective cytotoxicity against HeLa cells versus WI-38 fibroblasts and for antimicrobial activity against four ESKAPE pathogens, Escherichia coli, and Salmonella typhimurium. Several derivatives showed potent antibacterial activity, with MICs as low as 0.12 μg/mL against resistant Staphylococcus aureus strains and inhibition zones up to 33 mm against Gram-negative bacteria. Compound 13 exhibited strong selectivity, with an IC50 of 26.8 μg/mL in HeLa cells and 220.7 μg/mL in WI-38 cells. The five most active compounds were studied via molecular docking and MM/GBSA to elucidate their binding to bacterial DNA gyrase, topoisomerase IV, and human topoisomerase IIα. A molecular dynamics simulation of the S. aureus DNA gyrase B-compound 13 complex revealed a novel hydrogen bond between the coumarin ring and serine-129. These findings highlight thiazol-hydrazono-coumarins as promising antibacterial leads with ancillary anticancer activity, supporting their potential in treating infections in immunocompromised cancer patients.
Large language models (LLMs) are increasingly integrated into undergraduate medical education, particularly for generating learner feedback. While early LLM studies show promise, their educational impact and usage patterns remain unclear. The objective of this study is to systematically map how LLMs are being used to generate feedback for undergraduate medical students and to examine reported educational outcomes. A scoping review was conducted following Arksey and O'Malley's framework and reported using PRISMA-ScR. We searched PubMed and Web of Science and identified 4325 records. After screening/review, 42 studies were included. Data were charted using a structured form and outcomes were classified using Kirkpatrick Levels. The 42 included studies originated mostly from Global North countries, with nearly all using OpenAI's GPT models. Feedback was delivered in two main contexts: simulated clinical encounters and text-based assessment tasks. Only 8 studies (19%) used randomized controlled trial designs. Educational outcomes were: 22 studies (52%) included no student data (Level 0); 10 reported student reaction (Level 1); 10 assessed learning gains (Level 2); none addressed behavior change or patient-level effects (Levels 3-4). LLM-generated feedback often matched expert feedback in short-term effectiveness but showed variable accuracy. LLM-generated feedback is being explored across a range of educational settings, showing early signs of feasibility and perceived utility. However, the evidence base is limited in rigor and generalizability. Future research should assess behavioral and patient-level outcomes. The online version contains supplementary material available at 10.1007/s40670-025-02621-3.
Public opinion, which may be influenced by personal experiences, news, and social media, can impact compliance with public health measures (PHMs) during health emergencies. Artificial intelligence (AI) tools offer opportunities to analyze public opinion in real time during health emergencies. However, their performance in accurately identifying sentiment and themes in health-related online content remains unclear. This study aimed to evaluate the performance of natural language processing-based and large language model (LLM)-based AI tools when compared to human coding for sentiment analysis, topic modeling, and thematic analysis of public health datasets. Tools were selected to reflect those available to public health analysts and decision-makers. Data were collected via Google Alerts (GA) and social media posts from X (formerly known as Twitter) relevant to COVID-19 mitigation PHMs from December 2022 to February 2023. Following relevance screening, the sentiment of the complete datasets was analyzed by a human rater, with descriptive statistics used to summarize the overall sentiment profile. Subsets of 400 GA articles and 400 tweets were manually coded for sentiment by 2 human raters. Results were compared with outputs from 5 AI tools, including VADER (Valence Aware Dictionary and Sentiment Reasoner), SentimentGI, SentimentQDAP, Microsoft Azure, and OpenAI's ChatGPT-4. Topic modeling of the GA and X datasets was conducted using latent Dirichlet allocation in R and zero-shot prompting in ChatGPT-4 and compared with manual topic summaries. Thematic analysis of positive and negative sentiment datasets was conducted by a human rater and ChatGPT-4, with outputs evaluated for proficiency and reasonableness. The sentiment of the entire datasets was analyzed by a human rater, and descriptive statistics were calculated. Of 2227 GA results and 3484 tweets, 58% (n=1238) and 71% (n=2473), respectively, were relevant to PHMs. Human-coded sentiment analysis showed mostly neutral reporting in the news media, while social media expressed more polarized views. Across both datasets, AI tools demonstrated poor concordance with human-coded sentiment (Cohen κ <0.5 for all tools and sentiment categories). Topic modeling with ChatGPT-4 aligned more closely with human-rated topics than latent Dirichlet allocation, and of the 20 LLM-generated thematic outputs, 13 were rated proficient, and 7 were rated partially proficient. LLM outputs provided coherent, high-level summaries but lacked contextual insight. Human and LLM thematic analyses both identified themes of vaccine effectiveness, debate regarding PHMs, and public trust. Accessible AI tools demonstrate limited reliability for sentiment classification of health-related online text but show promise for rapid thematic exploration when combined with human oversight. These tools could complement traditional qualitative research in the context of health emergencies; however, they require human review to enhance the accuracy of interpretation. Further research is needed for non-English datasets.
On June 14, 2021, Philips recalled most positive airway pressure (PAP) devices distributed between 2009 and 2021 due to degrading foam with potential toxic effects, yet the impact on patients remains poorly understood. This study explored patient-reported impact among adults with obstructive sleep apnea (OSA) in North America. Between August 2023 and March 2024, adults in Canada and the United States (US) who self-reported an OSA diagnosis and PAP use completed an online survey. In Canada, recruitment was via a market research company, social media, and medical societies; in the US, through social media and medical societies. Survey domains included changes in PAP use, health, and financial effects. Responses were described by country and recall status. Of 2,953 unique survey visits in Canada, 632 responded (61.2% men, median age 50 years); of 214 US visits, 90 responded (44.4% men, median age 65.5 years). Among respondents with recalled devices (40.5% in Canada and 65.6% in the US), 18.8% and 5.1%, respectively, reduced PAP use, and 23.0% and 33.9% discontinued therapy. Emotional/mental health effects were reported by 46.1% and 74.6% in Canada and the US, respectively; physical symptoms by 35.9% and 61.0%; and financial impacts by 14.8% and 40.7%. Among those without recalled devices, 19.1% and 21.7% in Canada and the US reduced or stopped PAP, and 25.5% and 65.2% reported emotional effects. The Philips recall significantly disrupted PAP use and well-being, with about 40% of patients reducing or discontinuing therapy, highlighting the need for improved communication, patient support, and regulatory coordination. The online version contains supplementary material available at 10.1007/s44470-026-00045-3. The 2021 Philips recall affected millions of positive airway pressure (PAP) devices worldwide, but little is known about its real-world effects on patients with obstructive sleep apnea (OSA). Understanding these impacts is critical given the central role of PAP therapy in managing OSA and preventing adverse outcomes. In this cross-national survey of Canadian and US adults, the recall was associated with substantial disruptions in PAP use, emotional distress, physical symptoms, and financial strain, with about 40% of affected patients reducing or stopping therapy. These findings reveal the broad consequences of device recalls and emphasize the need for better communication, patient support, and regulatory coordination.
Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAI's GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the model's ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLM's annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.
A clinically euthyroid patient of Indian origin was identified with persistently undetectable TSH concentrations using our laboratory's third-generation Ultra TSH assay, raising concerns of assay interference. The discordant results, flagged by the treating physician, prompted an in-depth investigation to determine the cause of undetectable TSH values despite the patient's euthyroid clinical status. Over an eight-month period, three consecutive serum samples consistently showed TSH levels below 0.008 μIU/mL on our routine platform. To rule out analytical artifacts such as the high-dose hook (prozone) effect, heterophilic antibody interference, and other pre-analytical or analytical errors, the samples were re-evaluated under various dilution protocols and assay conditions. Reanalysis using two alternate FDA-approved TSH immunoassays (CLIA and ECLIA platforms) revealed a TSH concentration of 6.81 μIU/mL, consistent with the clinical picture and in stark contrast to our initial results. Given the persistence of this discrepancy and the suspected interference with antibody recognition, genetic analysis of the TSHB gene was performed. Sanger sequencing of the entire coding region revealed a homozygous A-to-G substitution (c.223A>G; AGA>GGA) in exon 3, resulting in an arginine-to-glycine amino acid change at codon 75 (R75G) in the mature TSH β-subunit (RefSeq: NP_000540.2). This variant was absent in a control subject of similar ethnic background with normal TSH levels on the same assay, supporting its role in the observed interference. The mutation likely alters the epitope conformation of the TSH molecule, reducing its recognition by monoclonal antibodies used in specific immunoassays without impairing its biological activity. This case underscores the importance of correlating laboratory results with clinical findings and highlights the need for cross-platform verification when discordant TSH values are encountered. Genetic variants affecting TSH structure can lead to misinterpretation of thyroid function, and efforts toward assay standardization and harmonization are essential to mitigate such diagnostic pitfalls.
OpenAI, Google, and Microsoft have recently developed popular large language models (LLMs) with incredible clinical applications. LLMs specific to neurosurgery, such as AtlasGPT, have also been recently released. However, the comparative neurosurgical diagnostic capabilities of these models are not well studied. The aim of this study was to evaluate and compare the ability of LLMs to diagnose neurosurgical pathologies. Clinical vignettes (n = 148) extracted from a common neurosurgery case-based review textbook were stratified by subspecialty. OpenAI's ChatGPT-3.5 and ChatGPT-4, Google's Gemini, Microsoft Copilot, and AtlasGPT were prompted to provide a diagnosis: "Provide a neurosurgical diagnosis given the following history…[vignette]." Imaging was inputted for capable LLMs, and all queries were run in May 2024. Diagnoses were compared with the textbook for accuracy and errors were categorized appropriately. ChatGPT-4 was the most accurate model (74% correct), followed by AtlasGPT (63% correct), ChatGPT-3.5 (53% correct), Microsoft Copilot (48% correct), and Gemini (36% correct). Chi-square comparisons demonstrated that ChatGPT-4 was more accurate in providing clinical diagnoses than its counterparts (p = 0.005). Across all vignettes and LLMs, most errors were due to an inability to attribute a key piece of information (generally imaging data) to the diagnostic process while otherwise using logical stepwise reasoning. ChatGPT-4 offered the most accurate diagnoses when given established clinical vignettes. Adding imaging processing capabilities and relevant data significantly increased the accuracy of LLM diagnoses. LLMs can offer accurate assessments of common neurosurgical conditions but necessitate detailed prompting from clinicians. Artificial intelligence has incredible clinical potential; however, practitioners must be cautious and think critically while using them for diagnostic purposes.
To address the challenges of hardware integration and system complexity in laboratory automation, this work introduces a universal platform built on two key innovations. First, a standardized instruction framework unifies the control of multi-brand robots by converting complex operations into simple, tabular instructions. Second, a zero-code natural language interface, powered by OpenAI's GPTs platform, translates user commands into executable workflows, which are reviewed by trained domain scientists before execution, with an automated validation mechanism providing an additional safeguard. The platform's performance was validated through complex, multi-device experiments, including an automated Cell Counting Kit-8 (CCK-8) cell viability assay, which yielded results highly consistent with those of manual operations. With a 99.0% success rate in translating natural language test instructions, this work demonstrates a practical framework to assist domain scientists in multi-robot laboratory automation.