共找到 20 条结果
Critical appraisal of the studies included in a systematic review is essential to ensure that results of the review are properly interpreted. Critical appraisal is also one of the most difficult steps in research reviews. Structured risk of bias (ROB) tools can facilitate critical appraisal, but these tools vary in content and structure, and there are unresolved issues in applications of these tools. Assessment of risk of reporting biases, such as outcome reporting bias (ORB) and analysis reporting bias (ARB), is especially difficult, given the lack of availability of the raw materials (such as prospectively registered protocols or analysis plans) needed to properly assess the risk of selective reporting and selective non-reporting of outcomes and analyses. To identify methods used in recent Campbell systematic reviews of intervention effects to assess the risk of selective reporting biases in included studies. We searched the Campbell Library website, using a structured online form developed for this purpose, with filters for publication dates (all dates in 2020 through April 2023) and type of document (completed reviews only). We included systematic reviews (SRs) of primary studies of intervention effects published in Campbell Systematic Reviews between 1 January 2020 and 30 April 2023. Of the 59 SRs published from 2020 through early 2023, 51 were eligible for our review. Forty-nine of these reviews included relevant studies of intervention effects. From these 49 reviews, we extracted data on methods used to assess risk of reporting biases (ORB and ARB), broader risk of bias (ROB) or study quality assessments, and adherence to 12 mandatory methodological standards. Data extraction and coding were performed in duplicate, by pairs of team members who worked independently, and any discrepancies were resolved by coders or by the review team. Results were compiled in a spreadsheet, which was used to generate tables, graphics, and a narrative summary. Reporting biases were defined and assessed in diverse and sometimes idiosyncratic ways in recent Campbell systematic reviews of intervention effects. Most (40 of 49) reviews conducted some structured assessment of reporting biases, but many did not report results of these assessments. Explanation and documentation of ORB and ARB assessments was missing in more than half (28) of the reviews. Only 12 reviews provided full documentation for their ORB/ARB assessments.Overall, we found that reviewers' descriptions of their assessments of reporting biases were often incomplete and inconsistent across studies. In many cases, these assessment practices did not reflect current understanding of the prevalence of selective reporting and ways in which these biases can undermine the validity of and confidence in results of research reviews. This observation is consistent with the fact that most reviews did not consider the potential impacts of risks of bias on the credibility of their results.None of the recent reviews appeared to meet all (12) of the mandatory methodological standards we assessed. On average, these reviews failed to meet 4.9 of these standards (SD = 2.3); almost three-quarters (35) of the reviews failed to meet four or more standards. Recent Campbell reviews did not consistently appraise or document risks of reporting biases in the studies they included. Assessment of risk of reporting biases is difficult, given the lack of availability of prospective, public protocols or analysis plans for most studies.Reviewers' failure to adhere to Campbell's mandatory methodological standards and editors' apparent inability to enforce these standards can be understood as functions of the contexts in which systematic reviews are highly desirable, highly cited, and under-resourced.We provide a decision tree to guide reviewers' assessments of reporting bias, along with nine recommendations for improving these practices in systematic reviews of intervention effects. Our recommendations include more deliberate use of eligibility criteria to eliminate studies that cannot provide valid answers to review questions, thorough documentation of reviewers' assessment processes and ROB ratings, and explicit use of ROB ratings in interpretation of results. Campbell systematic reviews often lack clear assessment of selective reporting bias The review in brief: Many recent Campbell systematic reviews do not clearly or consistently assess or report selective reporting bias, which limits confidence in review findings. What is this review about? Systematic reviews synthesize evidence from multiple studies to inform policy, practice, and future research. The credibility of these reviews depends in part on whether included studies report results fully and transparently. Selective reporting bias occurs when researchers report some outcomes or analyses but not others, often favoring statistically significant or positive results. This includes outcome reporting bias, where some measured outcomes are not reported, and analysis reporting bias, where only selected analyses are reported. These practices can distort the evidence base and may lead to biased conclusions in systematic reviews. This review examines how recent Campbell systematic reviews of intervention effects assess the risk of selective reporting bias in included studies. It also examines whether these reviews adhere to Campbell’s mandatory methodological standards related to risk of bias. What is the aim of this review? This Campbell systematic review examines methods used to assess selective reporting bias in Campbell systematic reviews of intervention effects. The review summarizes evidence from 51 Campbell systematic reviews published between January 2020 and April 2023, including 49 reviews that included studies of intervention effects. What are the main findings of this review? What studies are included? The review includes Campbell systematic reviews from several coordinating groups, including crime and justice, social welfare, education, and international development. Most reviews include both randomized and non-randomized studies. Reporting of methods and results varies considerably across reviews. Do Campbell reviews assess selective reporting bias? Most reviews include some assessment of selective reporting bias. However, approaches vary widely. About one in five reviews do not assess selective reporting bias at all. When selective reporting is assessed, fewer than one-third of reviews provide complete documentation to support judgments. How well is selective reporting bias assessed and documented? Descriptions of how selective reporting bias is assessed are often incomplete or unclear. Many reviews do not explain how judgments are made, do not clearly distinguish selective reporting from other sources of bias, or do not use study protocols or analysis plans to inform assessments. In some cases, a lack of evidence of selective reporting is treated as evidence that selective reporting is unlikely. Do reviews meet Campbell methodological standards? None of the reviews meet all mandatory Campbell methodological standards examined. On average, reviews fail to meet nearly five of 12 required standards related to assessment of reporting biases. Common shortcomings include limited documentation of risk-of-bias judgments, use of overall quality scores rather than domain-specific assessments, and limited or no consideration of how risk of bias may affect review findings. What do the findings of this review mean? Inconsistent assessment and reporting of selective reporting bias reduce confidence in the findings of many systematic reviews. Clearer methods, better documentation, and more consistent use of study protocols could strengthen assessments of selective reporting bias. The review identifies examples of good practice and provides guidance to support more transparent and rigorous assessments in future systematic reviews. How up-to-date is this review? The review authors searched for studies published up to April 2023 . This Campbell Systematic Review was published in 2026. Note: the first draft of this summary was generated by ChatGPT (version GPT 5.2 Instant, January 20, 2026, OpenAI, https://chat.openai.com) then edited by the authors.
Large language models (LLMs) and, more recently, large reasoning models (LRMs) have rapidly garnered significant interest for application in psychiatry and behavioral health. However, recent studies have identified significant shortcomings and potential risks in the performance of LLM-based systems, complicating their application to psychiatric diagnosis. Two promising approaches to addressing these challenges and improving the efficacy of these models are simulated reasoning (SR) and self-verification (SV), in which additional "reasoning tokens" are used to guide model output, either during or after inference. We aimed to explore how the use of SR (via LRMs) and SV (via supplemental prompting) affects the psychiatric diagnostic performance of LLMs. 106 case vignettes and associated diagnoses were extracted from the DSM-5-TR (Diagnostic and Statistical Manual, Version 5, Text Revision) Clinical Cases book, with permission. Both an LLM and an LRM model were selected from the latest available model generation for each of the two vendors studied (OpenAI and Google). Two inference approaches were developed: a Basic approach that directly prompted models to provide diagnoses and a SV approach that augmented the Basic approach with additional prompts. All case vignettes were processed by the two LLMs, two LRMs, and two inference approaches, and diagnostic performance was evaluated using the sensitivity and positive predictive value (PPV). Binomial generalized linear mixed models were used to test for significant differences between the model vendors (OpenAI, Google), type (LLM, LRM), and the addition of an SV prompt. All vignettes were successfully processed by each model and inference approach. Sensitivity ranged from 0.732 to 0.817, and PPV ranged from 0.534 to 0.779. The best overall performance was found in the o3-pro LRM using SV, with a sensitivity of 0.782 and a PPV of 0.779. No statistically significant fixed effects were found for sensitivity. For PPV, a statistically significant effect was found for prompt type (SV, P=.007) and model type (LRM, P=.009). No significant interaction effects were identified. We found that both SR and SV yielded statistically significant improvements in the PPV, without significant differences in the sensitivity. The addition of the manually specified SV prompt improved the PPV even when simulated reasoning was used. This suggests that future efforts to apply language models in behavioral health could benefit from manually crafted reasoning prompts and automated SR.
暂无摘要(点击查看详情)
Large language models (LLMs) are increasingly consulted for information about cleft lip and palate (CLP), yet the reliability of their outputs across clinical domains has not been evaluated. This study aimed to compare the quality of CLP-related information generated by GPT-4o and Gemini 2.5 Pro across multiple thematic domains using a validated quality instrument and a reliability-first analytic framework. Fifty-four standardized CLP questions across six domains were submitted to GPT-4o (OpenAI) and Gemini 2.5 Pro (Google DeepMind) on 25 September 2024 via their public interfaces, using new, history-free sessions and default settings, yielding 108 responses. Three independent, CLP-experienced raters scored each response using the Global Quality Score (GQS; 1-5 scale assessing accuracy, completeness, and clinical usefulness). Before comparing models, we applied a reliability-first filter: only domains where all three raters showed substantial agreement (Fleiss' kappa [κ] ≥ 0.60) were included in statistical comparisons. Domains that failed this threshold were analyzed qualitatively to identify the source of disagreement. A descriptive taxonomy of errors was developed for low-scoring responses. Three domains met the reliability threshold (General Care Information, General Cleft Information, and Pre-Treatment Information; 30 paired questions). Both models performed at a high and practically equivalent level: GPT-4o median GQS 4.33 (IQR 4.00-5.00) versus Gemini 2.5 Pro 5.00 (IQR 4.00-5.00); the difference was not statistically significant (Wilcoxon V = 139.00, p = 0.691; Hodges-Lehmann median difference 0.00, 95% CI -0.33 to 0.67). Three domains were excluded because rater agreement was insufficient; qualitative review showed this reflected genuine clinical practice variation rather than clear model errors. The most common inaccuracies were overgeneralization of outcomes, outdated surgical timing, and omission of multidisciplinary team roles. Both models provided high-quality CLP information in domains supported by clinical consensus, indicating they may serve as useful adjuncts for general patient and family counseling. Clinicians should, however, verify any treatment-specific content against current institutional protocols before relaying it to patients. Future research should assess readability, alignment with health literacy, and patient comprehension of AI-generated CLP information.
Background: Data extraction for systematic reviews is highly resource-intensive. This study evaluated four frontier large language models (LLMs) on complex structured metadata extraction from specialized neuroimaging artificial intelligence (AI) literature to determine their performance in automated evidence synthesis. Methods: We compared Google Gemini 3 Pro Preview, Anthropic Claude Opus 4.5, Perplexity Sonar Pro, and OpenAI GPT 5.2. Using a standardized prompt, each model extracted 22 variables from 91 peer-reviewed neuroimaging AI articles. The variables were stratified into low-, medium-, and high-complexity tiers. The performance was measured via the exact-match accuracy against a consensus-based expert ground truth. Results: The overall exact-match accuracy was moderate. Gemini 3 Pro Preview achieved the highest overall rate (56.4%), followed by Sonar Pro (52.1%), Claude Opus 4.5 (51.3%), and GPT 5.2 (46.5%). Gemini significantly outperformed all other models (p < 0.001). The performance declined dramatically as the variable complexity increased. Across models, the accuracy was 88.9-92.9% for low-complexity categorical fields, 47.0-63.3% for medium-complexity text extraction, and 2.7-15.5% for high-complexity variables requiring clinical judgment or multi-section synthesis. The most common type of error was misclassification. All four models scored 0% on the main performance metric, but this reflected a representational mismatch with the ground truth rather than extraction failure, indicating that the exact-match accuracy underestimates the true semantic performance. Conclusions: Frontier LLMs can effectively automate the retrieval of simple categorical data, but have serious difficulties with methodological variables that are complex. Although extraction can be fully automated for low-complexity fields, human review remains essential for context-dependent variables that require clinical judgment.
There is growing concern that artificial intelligence (AI) may diminish the quality of human relationships. However, in a context of widespread social importance (empathetic conversations between doctors and patients), AI can actually improve human conversational skills, potentially enhancing professional relationships. Recent advances in AI allow for realistically role-prompted counterparts for practicing professional conversations, enabling relational learning without the need for human counterparts. This study aimed to show the effectiveness of AI chatbots for learning professional communicative skills in medical education. Specifically, we hypothesized that a single conversation with an AI chatbot improves communication skills in medical students across 4 different conversational competencies. We conducted a quasi-experimental intervention study involving 4 distinct role-prompted scenarios (ie, shared decision-making, motivational interviewing, sexually transmitted diseases, and breaking bad news)-each designed to elicit in-depth empathic conversational skills aligned with key learning objectives in medical curricula. Students rated their competence for the 4 scenarios before and after a conversation with GPT-4o (OpenAI) using default settings, without fine-tuning. We expected higher perceived communication competence (PCC) in their conversation topic after the interaction compared with before the interaction in a 2-sided paired t test. Participants received AI-generated feedback, which they rated regarding adequacy. Post hoc analyses addressed gender and case effects, feedback adequacy, and prevalues in PCC. This study shows that a role-prompted GPT chatbot improves PCC in 162 medical students after a single conversation with mean of 13 (SD 4.8; 95% CI 12-14) prompt-response pairs. We found an increase in PCC with a mean difference of 0.94 (SD 1.64; 95% CI 0.69-1.20; Cohen d=0.58) from 5.89 (95% CI 5.55-6.23; scale 0-10) before the conversation to 6.83 (95% CI 6.55-7.12) after the conversation across 4 different patient role prompts. Furthermore, we found participants rating AI feedback of their conversation to be useful (mean 7.92, SD 1.61; 95% CI 7.67-8.17; scale 0-10), but feedback adequacy did not correspond to PCC increase (r=0.08; P=.32). Our results demonstrate how role-prompted GPT increases self-assessed communication competencies, introducing a novel tool for teaching relational learning. Our results present a starting point for using AI in education, particularly teaching communication in professional roles. On the basis of our findings in medical education, we anticipate further studies to investigate conversational training between lawyers and clients, marketers and customers, or managers and employees. Our research thus has implications for any field with a need for conversational training and relational learning.
We demonstrate how Large Language Models (LLMs) accelerate biomedical data harmonization through automated Common Data Element (CDE) generation. We processed 31 datasets including clinical taxonomies and research data dictionaries through OpenAI's Generative Pre-trained Transformer - 4 (API Model gpt-4-0613), generating comprehensive metadata for each element using a template-based system. Subject-matter experts validated outputs, finding 94% of generated metadata fields required no revision overall, with an unweighted accuracy of 83.8%, unweighted, for semi-structured sources. Dramatically faster than manual approaches. Our system uses ElasticSearch with weighted field matching to identify semantic equivalences between variables, avoiding duplicate CDEs while building a standardized repository. Testing with Alzheimer's Disease Neuroimaging Initiative (ADNI) and Global Parkinson's Genetic Program (GP2) datasets showed 32.4% of previously unseen headers successfully mapped to our CDEs, with interoperability scores averaging 53.8/100 based on matching, completeness, and compliance metrics. This approach automates the most tedious aspects of data integration, reducing barriers to cross-study collaboration in biomedical research.
Health care systems are increasingly considering large language model (LLM)-based chatbots for vaccine communication, but evidence that they improve durable, behaviorally relevant outcomes beyond existing health materials is limited. To examine whether brief, multiturn interactions with an LLM chatbot increase parental intention to vaccinate children against human papillomavirus (HPV) compared with no intervention and government public health materials and to assess whether any effects persist. This randomized clinical trial was conducted online among individuals in the US, Canada, and the UK from March 3 to May 25, 2025, with follow-up at 15 and 45 days. Eligible participants were parents 18 years or older with at least 1 HPV vaccine-eligible child (aged 11-17 years in the US and Canada and 12-17 years in the UK) who had not received HPV vaccination or whose vaccination status was unknown. Participants were randomized to (1) no-message control, (2) government public health materials matched to country (minimum 3-minute exposure), or (3) a 3-minute interaction with an LLM chatbot (OpenAI's GPT-4o) prompted to encourage HPV vaccination using a default response style or a shorter conversational style. The primary outcome was self-reported likelihood of vaccinating the child against HPV within 12 months (0- to 100-point scale, with 0 indicating extremely unlikely and 100 indicating extremely likely) measured immediately after intervention. Prespecified follow-ups included vaccination intent and self-reported vaccination at 15 and 45 days. A total of 1297 participants (mean [SD] age, 42.84 [6.93] years; 935 [72.1%] female) were randomized. Compared with no intervention, public health materials increased immediate vaccination intent (Cohen d = 0.53; 95% CI, 0.36-0.70), as did the default chatbot (d = 0.48; 95% CI, 0.30-0.65) and conversational chatbot (d = 0.33; 95% CI, 0.17-0.49). At 45 days, neither chatbot increased intent relative to controls, whereas public health materials maintained modest effects. No intervention increased self-reported vaccination uptake. The findings of this randomized clinical trial suggest that well-designed public health messaging may match or exceed the impact of short chatbot conversations for HPV vaccine promotion. ClinicalTrials.gov Identifier: NCT07132125.
The Human Phenotype Ontology (HPO) provides a unified framework cataloguing over 17,500 phenotypic abnormalities across more than 8,600 rare diseases, defining hierarchical relationships between them. For example, classifying missing arms and missing legs as both abnormalities of the limb. This structure enables phenome-wide analyses, including the prioritisation of phenotypes as candidates for gene therapy. However, the HPO currently lacks sufficient metadata describing the clinical severity of these phenotypes. Manual expert curation at this scale would be prohibitively labour-intensive, creating a need for automated approaches to systematically annotate phenotypic severity. GPT-4, a large language model (LLM) developed by OpenAI, was employed to annotate the severity of all phenotypic abnormalities catalogued in the HPO. Severity was operationalised using nine clinical characteristics: congenital onset, reduced fertility, sensory impairments, impaired mobility, immunodeficiency, physical malformations, cancer, intellectual disability, and death. Each characteristic was further qualified by frequency of occurrence across four levels: never, rarely, often, and always. To assess annotation quality, GPT-4's outputs were benchmarked against ground-truth labels embedded within the HPO itself. For instance, phenotypes residing in the "Cancer" HPO branch were expected to be annotated as cancer-causing. A novel severity scoring system was then developed that integrates both the nature of each clinical characteristic and its frequency of occurrence. Benchmarking demonstrated strong performance across all clinical characteristics, with true positive recall rates ranging from 89% to 100% (mean = 97%). This indicates that GPT-4 can replicate expert-level curation with high fidelity. The resulting severity scoring system produced quantitative severity metrics for phenotypic abnormalities across the HPO, incorporating both the type and frequency of associated clinical characteristics. These findings demonstrate that LLMs can automate the large-scale curation of clinical metadata with a high degree of accuracy, substantially reducing the burden of manual expert annotation. The severity metrics generated here provide a foundation for systematically ranking human phenotypes by their impact on health and quality of life, enabling more principled prioritisation of targets for therapeutic intervention, particularly in the context of rare diseases where evidence is sparse and resources for curation are limited. Future work may extend this framework to incorporate additional clinical dimensions or validate annotations against independent clinical datasets.
Large language models can synthesize biomedical knowledge, parse vast amounts of data, and generate code, positioning them as promising tools for biomarker discovery from high-throughput omics data. Here, we benchmark six models from OpenAI, Anthropic, and Google on plasma cell-free RNA datasets spanning three clinical cohorts: Kawasaki disease versus multisystem inflammatory syndrome in children, active tuberculosis versus symptomatic respiratory controls, and myalgic encephalomyelitis/chronic fatigue syndrome versus sedentary controls. We evaluate literature-guided nomination of diagnostic gene panels for downstream machine learning and autonomous construction of end-to-end classifiers from raw count matrices to held-out test predictions. Despite prompt adherence issues, model-nominated panels recapitulate canonical immune pathways and outperform random panels across cohorts, even matching differential gene expression baselines in the tuberculosis cohort. End-to-end automation proves feasible but is model- and task-dependent. One model approaches conventional performance for Kawasaki disease versus multisystem inflammatory syndrome in children, whereas performance decreases for tuberculosis and myalgic encephalomyelitis/chronic fatigue syndrome cohorts. These findings delineate current capabilities and limitations of large language models in diagnostics and open a path for their future use in biomarker discovery.
Background/Objectives: Psoriasis is a chronic immune-mediated inflammatory disease increasingly recognized as a systemic disorder associated with significant metabolic and cardiovascular comorbidities. Among these, obesity (defined as BMI > 30 kg/m2) plays a pivotal role, acting both as a risk factor for psoriasis development and as a modifier of disease severity, clinical phenotype, and therapeutic response. The relationship between psoriasis and obesity is bidirectional and sustained by shared inflammatory and metabolic pathways. This review aims to provide a comprehensive and updated synthesis of the epidemiological association between psoriasis and obesity, to elucidate the underlying pathophysiological mechanisms, and to discuss the clinical and therapeutic implications of excess body weight in psoriasis management. Methods: A narrative review of the literature was conducted, including epidemiological studies, mechanistic research, clinical trials, and real-world evidence addressing the interplay between psoriasis and obesity. Relevant data were identified from peer-reviewed publications focusing on inflammatory pathways, metabolic dysfunction, cardiovascular risk, and treatment outcomes in obese patients with psoriasis. The graphical figures included in this manuscript were created with the assistance of a large language model-based image-generation tool, ChatGPT-5 by OpenAI, using author-defined prompts. The prompts requested schematic medical illustrations summarizing the pathophysiological links between obesity and psoriasis, including adipose tissue dysfunction, adipokine imbalance, systemic inflammation, and activation of the IL-23/Th17 axis. For the therapeutic algorithm, the prompt requested a stepwise clinical flowchart for obese patients with psoriasis, including BMI assessment, comorbidity screening, universal weight-management measures, psoriasis severity stratification, obesity-adapted biologic selection, and management of suboptimal response. The generated images were subsequently reviewed, edited, and approved by the authors to ensure scientific accuracy, clarity, and consistency with the manuscript content. Results: Epidemiological evidence consistently demonstrates a higher prevalence of obesity among patients with psoriasis, with obesity independently associated with increased disease severity. Shared mechanisms include adipose tissue-driven cytokine production, dysregulated adipokine secretion, insulin resistance, endothelial dysfunction, and activation of the IL-23/Th17 axis, collectively contributing to systemic inflammation and accelerated atherogenesis. Obesity negatively impacts the efficacy, pharmacokinetics, and long-term drug survival of conventional systemic agents and biologic therapies, leading to suboptimal clinical outcomes. Conclusions: Obesity is a key determinant of psoriasis burden, influencing disease expression, comorbidities, and therapeutic response. Integrating weight reduction strategies into personalized psoriasis management may improve both dermatological outcomes and overall cardiometabolic health, supporting a holistic approach to patient care.
Introduction Large language models (LLMs) are used for biomedical text processing, but decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable quote affects performance for trial eligibility-scope classification from abstracts. Methods We used 200 randomized controlled trials and provided models with the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Flagship models from three vendors (OpenAI, Google, and Anthropic) were queried in two conditions: Label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings, and a separate judge step used an LLM to rate whether each quote supported the assigned label. Results Evidence requirements modestly reduced coverage, i.e., non-invalid non-abstained outputs (GPT-5.2 86.2% to 84.3%, Gemini 3 flash preview 98.3% to 92.8%, Claude Opus 4.5 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss' kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. Conclusion Substring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.
Clinical notes contain a vast amount of potentially useful information about adverse drug event (ADE) signals that never reach pharmacovigilance databases. Traditional rule-based or sentence-level models often miss subtle causal cues and generate excess false positives. To build a large language model (LLM) pipeline that reads entire electronic health record (EHR) notes, identifies drug-event pairs with a "reasonable possibility" of causation, and infers other important properties of each ADR, such whether the event is "serious" or "unlabeled". We adopted a two-pass workflow using model "OpenAI o1": Pass 1 screens each note for ADEs; Pass 2 adds 20 structured fields. A diverse sample of 372 deidentified notes from physicians and pharmacists at the University of California, San Francisco (UCSF), from 31 specialty/setting cells, yielded 191 ADEs. One medical expert reviewed each ADE for validity, seriousness, and label status. Another expert created a gold standard manually curated ADE set on 100 of the 372 ADEs to give us a percent "recall" estimate. A third expert met with the first expert to arrive at a consensus on the validity of LLM ADEs validated by the first expert but not found in the gold standard ADEs, giving us a estimate of "accuracy". Of 191 ADEs, 180 were true positives (94.2% precision) with 84.1% recall (F1 = 88.9%). Seriousness was correct in 100% and label status in 93.9% of cases. Medical Dictionary for Regulatory Activities (MedDRA) lowest level term (LLT) coding was correct in 92.5% of valid ADEs; errors were mostly non-existent LLTs. Of all valid ADEs, 12.2% met FDA "serious" criteria, 15.0% were unlabeled, and 8.9% were "failure of efficacy." On the first pass, 84.9% of notes contained no ADEs, keeping inference costs to USD $0.18 per note and $0.35 per validated ADE. The model inferred some ADEs not mentioned by physicians, e.g., tacrolimus-associated hypomagnesemia. While the LLM evaluated in this study is not perfect, it can transform free-text EHR notes into ADEs with 94% accuracy, and such data, when statistically analyzed in aggregate, can lead to new safety signals of potential drug side effects. Integrated with platforms like Sentinel in the USA, or Darwin EU, in the European Union, this approach could rapidly surface rare, serious, and unlabeled ADEs for further regulatory analysis.
Objective: To develop and evaluate an edge-hosted Large Language Model (LLM)-assisted system for automated Neonatal Intensive Care Unit (NICU) discharge summary generation using an evidence-grounded, field-level evaluation framework. Methods: This implementation and evaluation study was conducted in a Level III NICU in India. Longitudinal patient records were constructed from integrated bedside physiologic data (ARCHITECT) and a structured electronic medical record (EMR) platform Although an embedded audio-video module was present, it was not used in this study. Automated discharge summaries were generated by MORPHEUS, an edge-hosted orchestration pipeline running on NVIDIA Jetson AGX Orin hardware with JetPack 6.2. Local orchestration, preprocessing, and workflow execution were performed on the edge device, while language generation inference was performed using the OpenAI gpt-4o-mini API. Documentation quality was assessed with an LLM-based evaluator guided by a clinician-defined rubric comprising 72 fields organized across 14 section contexts and scored on five dimensions: clinical accuracy, completeness, actionability, coherence, and non-hallucination. Paired, field-level comparisons were performed against clinician-authored summaries. Of 549 NICU admissions screened between 1 October 2024 and 3 November 2025, 401 met the inclusion criteria for evaluation. Prompt refinement was performed iteratively using omission-derived feedback without model weight updates. Results: Across 401 evaluated admissions, MORPHEUS-generated summaries demonstrated higher rubric-based scores and lower omission burden than clinician-authored summaries within the structured evaluation framework used in this study, with mean scores of 0.93 versus 0.75 for accuracy, 0.91 versus 0.67 for completeness, 0.93 versus 0.72 for actionability, 0.94 versus 0.74 for coherence, and 0.95 versus 0.78 for non-hallucination, with the largest absolute advantage observed for completeness. Error taxonomy analysis demonstrated fewer omissions, unsupported assertions, and contradictions in AI-generated summaries than in clinician-authored summaries. Iterative prompt refinement was associated with directional improvement across quality dimensions and reduced omission burden, with omission rate per patient decreasing from 2.484 to 1.807 in the later iteration. Conclusions: An edge-hosted LLM-assisted pipeline can generate NICU discharge summaries that meet or exceed clinician-authored documentation quality under a reproducible, clinician-grounded evaluation framework. These findings support the feasibility of deploying edge-orchestrated generative AI systems for high-stakes neonatal clinical documentation using a clinician-grounded field-level evaluation framework.
Approximately one billion people worldwide live with a mental disorder, yet access to minimally adequate treatment remains low. Against this backdrop, general-purpose generative artificial intelligence (GenAI) systems have rapidly expanded into informal mental health support. OpenAI disclosed that roughly 1.2 million ChatGPT users weekly display indicators of suicidal planning or intent, and meta-analytic evidence supports modest efficacy for AI chatbots in reducing common mental health symptoms. Nevertheless, these tools may pose substantial clinical risks for vulnerable individuals. Two interlocking mechanisms-algorithmic sycophancy and anthropomorphic projection-converge to produce self-reinforcing engagement-validation loops capable of reinforcing maladaptive beliefs and contributing to clinical risk. Structural investment in mental-health workforce capacity must remain the foundation of future responses to the global treatment gap, with GenAI deployed as a supervised adjunct to clinicians within hybrid stepped-care frameworks subject to independent safety evaluation, transparent disclosure obligations, and regulatory oversight proportional to clinical risk.
Failure Mode and Effects Analysis (FMEA) is widely used in radiation oncology to proactively identify and mitigate risks, but it is time-consuming and depends heavily on expert experience. This study evaluated whether large language models (LLMs) can supplement traditional expert-driven FMEA by identifying novel failure modes within the Radiation Planning Assistant (RPA) workflow. A multidisciplinary team of board-certified medical physicists, quality assurance engineers, and software developers independently used 4 LLMs (ChatGPT-4, Gemini 2.5 Pro, phi4-reasoning-14B, and OpenAI/oss-120B) to generate potential failure modes across the RPA contouring and planning workflow. Team members used diverse prompting strategies, including supplementary materials such as RPA user guides, as context. Each failure mode was first rated for severity, occurrence, and detectability by the LLMs, then independently rescored by experts using the TG-100 framework to enable comparison. The highest-risk modes, based on expert scoring, were subsequently reviewed with 2 clinical user groups in South Africa. The 4 LLMs collectively generated 190 candidate failure modes. After review for relevance and duplication, 79 unique and interpretable modes were retained for analysis. Among these, 3 exceeded the 125 risk priority number threshold from a prior study, all related to staff accountability and role ambiguity. On average, LLMs assigned higher severity (7.3 vs 4.1), similar occurrence (2.8 vs 3.3), and lower detectability (5.4 vs 2.8) scores, producing higher mean RPNs (110 vs 36). Clinical users from 2 centers in South Africa confirmed that several artificial intelligence-identified risks were plausible, particularly those tied to workflow accountability. LLMs can broaden risk discovery in FMEA by surfacing contextually relevant and previously unrecognized failure modes. However, expert oversight remains essential for validating and prioritizing risks. Artificial intelligence should be viewed as a complementary tool that enhances, rather than replaces, human judgment in radiation therapy safety assessments.
To evaluate the diagnostic performance, clinical impact, and cost-effectiveness of a workflow combining EUCAST rapid antimicrobial susceptibility testing (RAST) with lateral flow immunochromatographic assays (LFIA) for resistance detection directly from positive blood cultures (BCs). A single-center, retrospective study including 179 monomicrobial bloodstream infection (BSI) episodes was conducted. RAST results were compared with broth microdilution (BMD) as the reference method. LFIA performance was assessed for the detection of extended-spectrum β-lactamases (ESBLs) and carbapenemases. The impact on antimicrobial therapy was analyzed at predefined time points. A cost-effectiveness analysis using a decision tree model compared three strategies: matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MALDI-TOF MS) identification plus RAST, MALDI-TOF MS plus LFIA and RAST, and the FilmArray BCID2 molecular panel. Overall categorical agreement between RAST and definitive AST was 98.2%, with a low rate of very major errors (n = 8). LFIA showed excellent performance, with 100% concordance for ESBL and carbapenemase detection. Antimicrobial therapy was modified in 54.2% of patients, with most changes occurring after direct identification and LFIA results. In 97% of cases with therapeutic modification, treatment was appropriate according to definitive AST. In the economic analysis, MALDI-TOF MS plus LFIA and RAST showed a favorable balance between cost (€34.32 per patient) and effectiveness (16.51 h gained), compared with MALDI-TOF MS plus RAST alone (€10.94; 15.11 h) and BCID2 (€160.60; 20.43 h). The incremental cost-effectiveness ratio for adding LFIA was €16.59 per hour gained. The combination of RAST and LFIA directly from positive BCs provides rapid, accurate, and cost-effective microbiological information, supporting early optimization of antimicrobial therapy and representing a practical alternative to molecular syndromic panels for BSIs.
暂无摘要(点击查看详情)
Lung chronic graft-versus-host disease (cGVHD) after allogeneic hematopoietic cell transplantation (HCT) comprises heterogeneous pulmonary phenotypes known as lung complications after transplantation (LCAT). While bronchiolitis obliterans syndrome (BOS) is well recognized and is associated with poor survival, restrictive phenotypes-including HCT-associated organizing pneumonia (HCT-OP) and truncal sclerosis (TS)- remain poorly defined. Prior studies often grouped restrictive phenotypes, potentially obscuring phenotype-specific risk profiles and introducing survival bias by not taking into account the variable timing of LCAT onset. Direct comparison between specific LCAT phenotypes and patients with cGVHD without lung involvement is limited, leaving uncertainty regarding the relative prognostic impact of individual LCAT phenotypes. Using dynamic time-dependent Cox regression, we hypothesized that LCAT phenotypes represent clinically and prognostically distinct entities. We conducted a longitudinal retrospective cohort study of adults with cGVHD. LCAT phenotypes were adjudicated as BOS, HCT-OP, or TS based on clinical, physiologic, and radiographic criteria. Patients with cGVHD without lung involvement served as the control group. Baseline for all survival analyses was defined as the date of cGVHD diagnosis. Multivariable Cox proportional hazards models were fit with LCAT phenotype as a time-varying exposure to account for variable onset timing and mitigate immortal-time bias; adjusted survival curves were derived using dynamic standardization. Non-relapse mortality (NRM) was analyzed using cause-specific Cox models with relapse occurring after cGVHD onset as a competing event. Of 895 patients who met inclusion criteria, LCAT was identified in 183 patients: BOS (n=85), HCT-OP (n=42), TS (n=32), and mixed phenotypes (n=24). Median time from cGVHD diagnosis to LCAT onset differed significantly: 3.3 months for HCT-OP, 9.4 months for BOS and 20.5 months for TS (p <0.001). Pulmonary function tests showed distinct patterns, with severe airflow obstruction in BOS, predominant diffusion impairment in HCT-OP; and restriction in TS. In multivariable time-dependent Cox models, BOS was independently associated with inferior OS (HR 2.14; 95% CI 1.51-3.02; p<0.001) and elevated NRM (HR 3.04; 95% CI 2.08-4.43; p<0.001) relative to controls. TS was independently associated with increased NRM (HR 2.33; 95% CI 1.12-4.86; p=0.024) but not OS (HR 1.53; 95% CI 0.78-3.05; p=0.217). HCT-OP was not significantly associated with increased OS or NRM risk relative to controls, a finding confirmed by sensitivity landmark analyses. In the LCAT burden model, mixed LCAT (≥2 phenotypes) conferred the greatest risk for both OS (HR 3.22; 95% CI 1.78-5.83; p<0.001) and NRM (HR 4.95; 95% CI 2.55-9.62; p<0.001) compared with controls, and also carried higher NRM than single LCAT (HR 2.04; 95% CI 1.03-4.02; p=0.041). LCAT phenotypes represent clinically distinct entities with divergent prognostic profiles. Accurate and granular phenotyping is essential to inform prognosis, guide surveillance strategies, and support therapeutic decision-making in patients with cGVHD.
Lithium-sulfur (Li-S) batteries offer high theoretical energy density but face critical challenges due to sulfur's poor conductivity and the polysulfide shuttle effect. Here we report a novel cathode design utilizing a hybrid carbon matrix derived from buckwheat biomass and single-walled carbon nanotubes (SWCNT) to overcome these issues. The buckwheat-derived hard carbon (HC), obtained at 1000 °C, provides hierarchical porosity to anchor polysulfides and buffer sulfur expansion, while SWCNT (optimized at 6%) creates a conductive network. As a result, S@SP/HC/SWCNT cathode delivers an initial discharge capacity of ~ 1250 mAh g- 1 at 0.1 C and retains ~ 60% of this capacity after 100 cycles with ~ 100% Coulombic efficiency. Overall, this sustainable, low-cost, and scalable carbon-sulfur cathode architecture effectively addresses key Li-S challenges and demonstrates strong promise for practical high-energy, long-life Li-S batteries.