The integration of artificial intelligence-generated content (AIGC) tools into academic research offers transformative potential for enhancing productivity and innovation. However, within the highly regulated and ethically sensitive medical context, the use of AIGC is accompanied by significant challenges. Medical postgraduates, as the future vanguard of medical science, play a crucial role in the advancement of digital health, and their intention to use AIGC tools will significantly influence the use of these emerging technologies in medical research. Despite the growing popularity of AIGC tools, there remains a paucity of in-depth understanding of the factors driving or hindering medical postgraduates' intention to use these tools in academic research. A clear comprehension of these influencing factors is essential to foster the responsible, effective, and sustainable integration of AIGC into medical research. This study aimed to systematically explore the key factors influencing medical postgraduates' intention to use AIGC tools in academic research, with the goal of informing strategies to promote their ethical use and enhance scholarly research capabilities. We used a qualitative research design based on grounded theory. Semistructured interviews were conducted with 30 medical postgraduates across diverse specialties, all of whom had prior research experience and familiarity with AIGC tools. Participants were recruited purposively to ensure diverse perspectives. Data analysis followed a systematic coding process to inductively develop a conceptual model, which was further structured and interpreted through the theoretical lens of the Unified Theory of Acceptance and Use of Technology. Our analysis identified 7 core factors directly shaping usage intention: performance expectancy, effort expectancy, social influence, facilitating conditions, individual characteristics, task characteristics, and technology characteristics. Further analysis revealed that performance expectancy acted as a mediating variable in the relationships between both task characteristics and technology characteristics and usage intention. Additionally, social influence moderated the relationship between task characteristics and performance expectancy. The research findings underscore that, while AIGC tools are valued for assisting daily research tasks, medical postgraduates' intention to use them in academic research is influenced by technical deficiencies, high cognitive load, and the strict ethical risks and data governance requirements in the medical field. This study constructs a conceptual model aimed at elucidating the influencing factors of medical graduate students' intention to use AIGC in academic research. Recommendations derived from the findings include (1) fostering artificial intelligence literacy and critical competency among medical postgraduates; (2) optimizing AIGC tools to better address domain-specific needs, accuracy, and security concerns prevalent in health research; and (3) establishing clear academic supervision and ethical governance mechanisms to ensure responsible use. These measures are essential to harness the potential of AIGC while safeguarding the rigor and integrity of medical academic research.
To identify research topics in medical chatbots and analyze their temporal trends, geographic distributions, and journal preferences. Latent Dirichlet Allocation (LDA) topic modeling was applied to 9,650 publications (1986-2024), extracting eight core topics integrated with time-series analysis, geographic statistics, and journal associations. Eight topics were identified. Temporal trends revealed three phases: technology incubation (2000-2015), rapid breakthrough (2015-2020), and application consolidation (2020-2024). Geographically, the United States, China, and the United Kingdom dominated research output (46.0%). Journal analysis highlighted Journal of Medical Internet Research (JMIR) (7.1%), Journal of the American Medical Informatics Association (JAMIA) (5.4%), and IEEE Journal of Biomedical and Health Informatics (IEEE JBHI) (3.2%) as top contributors, with JMIR and JAMIA reinforcing clinical informatics and digital therapeutics. Research on medical chatbots needs to balance technical feasibility and clinical value. Future research should focus on three directions: developing validation frameworks for leveraging large language models in clinical applications (LLMs), establishing transnational data-sharing infrastructure, and creating ethical governance mechanisms that ensure responsible innovation while maintaining health equity.
Data quality is the degree to which data are fit for their intended purpose and is described using quality dimensions. The increased use of medical data in clinical research and medical artificial intelligence development has rendered data quality assessment essential. Despite existing data quality definitions, frameworks, and tools, data quality assessment in real-world settings faces multiple challenges. This stems from a lack of understanding of how to assess real-world data quality and interpret the results. Therefore, practical approaches to data quality assessment are needed that are appropriate for diverse data environments, intended use, quality dimensions, and requirements. This study proposes a practical approach for assessing the completeness of electronic health records for medical research. This approach integrates structural completeness, rule-based assessment, and descriptive analyses of completeness and data diversity to clarify how data quality can be measured and meaningfully interpreted in practice. Completeness of a large-scale electronic health record (EHR) dataset from Gachon University Gil Medical Center was evaluated, covering January 2005 to December 2023. Completeness was assessed using a three-part approach comprising (1) structural completeness assessment, (2) rule-based assessment, and (3) descriptive analyses of completeness and data diversity. Assessments were conducted using clinical data quality assessment tools. This practical approach was used to assess EHR completeness for medical research from 1,798,153 patient records. In the structural assessment, 7 out of 39 tables were empty or unavailable, indicating limited capturing of clinician free-text data. The rule-based assessment identified substantial missingness in vocabulary fields (30.6%) and missing or special-characteristic values in relation to observation (23.8%), measurement (4.0%), care site (1.6%), and death (0.3%). Descriptive analyses demonstrated a balanced gender distribution (49.3% male and 50.7% female) and a predominantly Korean racial distribution (96.75%). Collectively, these findings illustrate the completeness quality of a multi-perspective completeness assessment for medical research. This study demonstrates how data quality dimensions can be measured in practice through a real-world completeness assessment. This practical approach enables evaluation of EHR completeness and provides insight into data quality. Its findings have implications for researchers conducting data quality assessments and applying quality dimensions in medical research.
Graduate health informatics programs in the United States differ widely in cost, curriculum, and program design. However, it is unclear how these differences influence affordability, accreditation signaling, and preparation for a data-driven workforce. This study aimed to evaluate the value (tuition and affordability), structure (delivery format, credit load, culminating experience, and accreditation), and curriculum (technology content emphasis) of US graduate health informatics programs. It examined how accreditation and modality relate to program design, and whether tuition-normalized curriculum breadth differed by accreditation status. A cross-sectional study of 107 US graduate health informatics programs was conducted using publicly available data collected between January and May 2025. Tuition was standardized to cost per credit. Curricular content was coded for technology density and mapped to the Commission on Accreditation for Health Informatics and Information Management Education domains. Comparative statistics, regression models, and exploratory cluster analyses were used to assess relationships between tuition, credit requirements, accreditation, delivery format, and curriculum characteristics. Programs varied by delivery format, with 37 of 107 (34.6%) online, 32 of 107 (29.9%) hybrid, 23 of 107 (21.5%) in person, and 15 of 107 (14.0%) flexible. Credit requirements most commonly fell between 31 and 39 credits. Culminating experiences included capstone (54/107, 50.5%), internships (21/107, 19.6%), and thesis (7/107, 6.5%). Required credit hours showed modest variation by delivery format but not by accreditation status. Accreditation was not associated with differences in the tuition-normalized curriculum breadth structural proxy in this program-level analysis. Programs requiring internships had significantly higher mean credit loads than programs without internships (39.0 vs 31.3 credits; P=.005). Cluster analysis revealed 4 descriptive program configurations differentiated by cost, modality, credit requirements, and culminating experiences. In this program-level descriptive analysis, accreditation status was not associated with differences in tuition-normalized curriculum breadth structural proxy. Instead, delivery format and internship requirements were descriptively associated with variation in credit load and cost. Improving transparency in tuition models and aligning program structure with curricular scope may support efforts to enhance equity and value in graduate health informatics education.
Nursing informatics is essential for digital health transformation; however, the technology acceptance of undergraduate nursing students in Saudi Arabia remains underexplored. This study examined factors influencing nursing students' intention to use informatics technologies using the technology acceptance model. A cross-sectional survey was conducted with 132 undergraduate nursing students. Data were analyzed using descriptive, correlational, and hierarchical regression analyses. Perceived usefulness (mean 3.68, SD 1.22) and perceived ease of use (mean 3.64, SD 1.32) were the strongest predictors of acceptance, together explaining 87% of the variance (R²=0.87; β=0.323 for usefulness, P<.001; β=0.195 for ease of use, P=.032). Only 25.8% (n=34) of the students often used electronic health records, while 31.8% (n=42) had no electronic health record experience, indicating a clear gap in practical informatics exposure. Nursing students' acceptance of informatics is primarily driven by its perceived usefulness and perceived ease of use. These findings highlight the urgent need to integrate practical, user-centered informatics training and clinical simulation into undergraduate nursing curricula to better prepare students for technology-based practice.
Studies suggest that the introduction of electronic health records (EHRs) has decreased the efficiency of clinical practice and increased clinician workload for US-based physicians. Most studies involve clinicians in primary care settings. Less is known about other health care settings, subspecialist clinicians, or whether markers of efficiency and workload change over time. This study aimed to describe 2 common metrics of after-hours use of the EHR (pajama time and time outside scheduled hours [TOSH]) among diverse specialists and track these parameters longitudinally in a Canadian setting. In this longitudinal descriptive study, medical and surgical specialists were observed starting from the introduction of a system-wide EHR in 2019 to 2022 at a large quaternary teaching hospital in Edmonton, Alberta. Pajama time and TOSH were extracted from the EHR on an Epic system platform and monitored over time. Clinicians were stratified according to clinical group (medical and surgical) and workload (clinical full-time equivalent). A total of 71 medical and surgical specialists participated in this study, spending approximately 24 to 40 minutes per day on pajama time and 32 to 55 minutes per day on TOSH depending on clinician grouping. Both metrics increased over the observation period, as reflected in the longitudinal plots and the higher values observed at the end vs the beginning of follow-up. After-hours EHR use in this Canadian cohort of medical and surgical specialists was similar to what is reported in the US literature, although the drivers may be different. Perhaps surprisingly, these markers increased over time despite presumed improved familiarity with the EHR. The extent to which this affects clinician well-being and work-life integration cannot be determined from these results, although there may be cause for concern.
The occurrence of sepsis in patients with heart failure (HF) has received less attention in research; yet, it poses a significant clinical challenge due to the complex interplay between chronic cardiac dysfunction and acute systemic inflammation. The stress hyperglycemia ratio (SHR) has emerged as an independent risk factor in various cardiovascular diseases and patients with sepsis, but its role in predicting sepsis risk in patients with HF remains underexplored. This study investigates the association between SHR and sepsis occurrence in patients with HF and explores the potential mediating role of inflammatory indicators. This retrospective cohort study used data from the Medical Information Mart for Intensive Care-IV (version 3.0) database, encompassing patients with HF from critical care units. SHR was calculated based on initial blood glucose and glycated hemoglobin A1c levels. The analysis population was divided into 4 groups based on the quartiles of the SHR. The primary end point was 7-day sepsis incidence, which was diagnosed following Sepsis-3 criteria. Within the 1205-patient cohort (male: 764/1205, 63.4%; median 71.51, IQR 62.45-79.47), a total of 162 (13.4%) patients with HF experienced sepsis within 7 days. In the fully adjusted model, a per-unit SHR increase was linked to an 18% higher sepsis risk (hazard ratio 1.18, 95% CI 1.01-1.38; P=.04). Restricted cubic splines analysis showed a nonlinear saturation effect association (P for nonlinearity=.02), which was consistent in the diabetic subgroup (P for nonlinearity=.01). After adjusting for 7-day mortality as a competing event using the Fine-Gray model, SHR was independently associated with an increased risk of sepsis (P=.01). The association between SHR and sepsis was significantly modified by diabetes mellitus, BMI, and insulin use (all P for interaction<.05). Furthermore, mediation analysis indicated that several inflammatory indices, including the systemic immune-inflammation index, neutrophil-to-lymphocyte ratio, platelet-to-neutrophil ratio, systemic inflammation response index, and monocyte-neutrophil-to-lymphocyte ratio, significantly mediated the association in critically ill patients with HF. In critically ill patients with HF, an elevated SHR was associated with heightened 7-day sepsis risk, especially for those combined with diabetes. Furthermore, systemic inflammatory indices partially mediated this association in the overall population, implicating inflammation as a potential mechanistic link between SHR and sepsis.
Despite the increasing use of machine learning (ML) in clinical research, the early stages of data preparation, especially for structured clinical data, often receive limited methodological scrutiny. These datasets typically contain missing values, complex categorical variables, and imbalanced class distributions, all of which complicate downstream model development and interpretation. This study introduces a structured preprocessing framework designed to address common challenges in medical tabular data and to assess how preprocessing choices affect the stability and portability of predictive models across settings. We constructed a modular workflow comprising 3 components. First, preprocessing strategies included imputation for missing data, 3 types of categorical encoding (one-hot, frequency, and target), and resampling approaches for class imbalance (Synthetic Minority Over-sampling Technique [SMOTE] and Random Over Sampling Example [ROSE]). Second, 6 classification algorithms were used to evaluate performance patterns, including logistic regression (LGR), decision tree (DT), random forest, XGBoost (XGB), CatBoost (CAT), and light gradient-boosting machine (LightGBM). Third, we assessed cross-dataset portability using 2 datasets with distinct data-generating mechanisms: a registry for patients with end-stage renal disease (ESRD; n=412) and the population-based Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey. For each dataset, we independently cleaned, standardized, encoded, tuned, and evaluated models using the same predefined hyperparameter search space, without cross-dataset feature matching or pooling the area under the ROC curve (AUC) calculations; the complete pipeline was then rerun on BRFSS as an external replication. One-hot encoding in combination with ROSE yielded the most consistent performance improvements in terms of AUC (0.940) and accuracy (0.932), particularly for classifiers sensitive to class distribution. Notably, ROSE enhanced sensitivity without substantially distorting the original data structure. Feature importance rankings also contributed to model interpretability, and performance trends were largely reproducible in cross-context application. Our findings suggest that preprocessing decisions often treated as ancillary play a central role in shaping model outcomes, especially in high-variance clinical datasets. The proposed framework offers a reproducible and adaptable tool for aligning data preparation with the unique demands of health care prediction tasks and may serve as a foundation for future efforts to standardize preprocessing in clinical ML workflows.
暂无摘要(点击查看详情)
The rapid integration of large language models into electronic medical record systems introduces a critical theoretical vulnerability. Drawing on foundational computer science proofs of "model collapse," this viewpoint introduces the concept of "Clinical Model Autophagy"-a systemic degradation of diagnostic integrity that occurs when clinical artificial intelligence (AI) models are recursively trained on unverified, AI-generated synthetic data. As these recursive models may progressively regress toward statistical means, they undergo "Interpretative Drift," a clinically concerning phenomenon where rare pathological variances are systematically erased and complex diseases are homogenized into benign averages. To prevent the irreversible contamination of health care data ecosystems, the author urgently proposes the Data Purity Standard (DPS). The DPS mandates the cryptographic watermarking of all AI-assisted clinical entries for provenance tracking, alongside the establishment of "Human Vaults." These physically segregated repositories of physician-verified heritage data will serve as immutable biological anchors to safely guide future AI training, ensuring the long-term reliability of digital health infrastructure.
暂无摘要(点击查看详情)
[This corrects the article DOI: 10.2196/38150.].
Artificial intelligence (AI) is rapidly transforming clinical practice, yet empirical evidence on Chinese physicians' acceptance of AI medical tools remains scarce at the national level. This study aimed to evaluate the current acceptance of AI medical tools among Chinese physicians, identify key determinants, and elucidate underlying mechanisms using an extended Unified Theory of Acceptance and Use of Technology (UTAUT) and explainable machine learning. A nationwide cross-sectional survey was conducted from January to April 2024, recruiting 4024 in-service physicians across 29 provincial-level administrative units in China via stratified random sampling. The questionnaire incorporated 5 UTAUT constructs-performance expectancy, effort expectancy, social influence, facilitating conditions (FC), and a newly introduced "positive impact" dimension. Psychometric properties were validated through exploratory and confirmatory factor analyses. Structural equation modeling assessed direct and moderated effects, with hospital level, professional title, AI familiarity, and future optimism as moderators. Six classification models were compared for predictive performance; balanced random forest was selected, and model interpretability was evaluated using Shapley Additive Explanations (SHAP). Overall acceptance exceeded 90% across subgroups. Structural equation modeling showed that performance expectancy, social influence, FC, and positive impact significantly and positively predicted physicians' behavioral intention to use AI medical tools. Six negative moderation effects were identified. The random forest achieved 85.6% accuracy and an area under the receiver operating characteristic curve of 0.836; SHAP analysis identified organizational support (FC_HospPromoteAI) as the feature with the highest mean absolute SHAP value, though all effect sizes were modest. Chinese physicians exhibit high acceptance of AI medical tools, mainly driven by organizational support and perceived clinical benefits. The combined use of extended UTAUT and explainable AI provides actionable insights for targeted AI implementation strategies in health care.
Predicting mortality among people living with HIV enables clinicians to implement timely, targeted, and preventive interventions at the start of antiretroviral therapy (ART). However, prognostic models must rely strictly on baseline predictors to avoid look-ahead bias and ensure scientific validity. This study evaluates machine-learning (ML) algorithms for baseline mortality prediction using routine electronic medical record data. This study aims to predict mortality among people living with HIV receiving ART using baseline clinical and sociodemographic characteristics through ML models in public health facilities of Gondar City Administration, Northwest Ethiopia. The retrospective cohort study was conducted using electronic medical record data from 12,871 people living with HIV on ART (2005-2024). Seven base classifiers were evaluated using stratified 10-fold cross-validation. Synthetic minority oversampling technique (SMOTE)-balanced variants were used only for sensitivity analysis. SMOTE oversampling was applied only to training folds; the final evaluation used the original imbalanced test data. Shapley Additive Explanations (SHAP) analysis identified key baseline predictors. Gradient boosting on the original data achieved superior performance (accuracy 87.0%, F1-score 0.619, area under the receiver operating characteristic curve 0.859), outperforming extreme gradient boosting (F1-score 0.609, area under the receiver operating characteristic curve 0.835) and SMOTE variants. The SHAP analysis identified education level, lack of formal education (+0.84), and a low baseline cluster of differentiation 4 (CD4; a type of immune cell count) count of 140 cells/mm³ (+0.54) as substantially increasing predicted mortality risk. Urban residence (-0.35) and working functional status (-0.12) showed protective effects, whereas age (45 y; -0.02) had minimal influence in the illustrated case. Globally, lower CD4 counts and the absence of formal education were consistently associated with increased mortality risk. Ensemble ML models demonstrated moderate-to-strong discrimination for predicting mortality among people living with HIV using strictly baseline routine electronic medical record data. SHAP-based interpretability confirmed that educational attainment and baseline CD4 count were the dominant determinants of predicted mortality risk, underscoring the combined influence of socioeconomic vulnerability and immunological status at ART initiation. These findings support the potential utility of interpretable ML models for early risk stratification and targeted clinical decision-making in resource-limited settings; however, external validation is required before routine clinical implementation.
In the evolving landscape of health care, data use plays an ever-increasing role in health care IT. However, data are often siloed and uncoded free text distributed across several IT systems. This paper introduces a health knowledge management platform, designed to integrate, harmonize, and enable reuse of health care and medical research data. The platform aims to bridge the gap between research and patient care, showcased through real-world scenarios, emphasizing data harmonization and knowledge management within a health care institution. The study is based at the University Hospital Schleswig-Holstein. The main objective of this project is to design, implement, and evaluate a knowledge management platform that integrates health care and biomedical research to support use cases in both domains. The study describes the "health knowledge management platform" designed to access and gain knowledge from health care and medical research data. We performed several rounds of focus groups with stakeholders to elicit the platform requirements. In the process, we identified key aspects of the platform. From the functional requirements, we designed an architectural concept. The platform evaluation follows the Framework for Evaluation in Design Science Research and International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) 25010 standard with a focus on key aspects identified and real-world scenarios. Two application scenarios, cardiology and radiology, are selected for a requirement-based, qualitative evaluation. We show that our health knowledge management platform is capable of integrating diverse data formats like Health Level 7 Version 2 messages, CSV exports, and Digital Imaging and Communications in Medicine. It currently integrates over 46 million admit, discharge, transfer messages, 38 million imaging studies, and structured clinical data for approximately 1.5 million patients. The platform supports different scenarios based on its 5-layer architecture, including a clinical data repository and services like Master Patient Index and Consent Management. The evaluation against 39 predefined functional requirements showed our platform's capability in certain real-world scenarios of cardiology and radiology. Our evaluation demonstrates that the platform covers the majority of the identified requirements to support knowledge management in health care institutions. Our requirement-based evaluation of the health knowledge management platform at University Hospital Schleswig-Holstein reveals its capabilities, which is possibly leading to better knowledge transfer between patient care and research. The platform's architecture and standardized data improve the quality of data and facilitate access to knowledge. Ongoing development and potential quantitative measures will further enhance its applicability in dynamic health care landscapes.
Predicting enterocutaneous fistula (ECF)-associated sepsis and mortality poses significant challenges in digital health care due to the disease's complexity and heterogeneous clinical manifestations. Current approaches that rely on single-modal data or traditional scoring systems often fail to capture the intricate immune-inflammatory dynamics and multisystem involvement in patients with ECF. This study aims to develop an artificial intelligence (AI)-driven multimodal fusion model integrating clinical, imaging, and transcriptomic data for early prediction of ECF-associated sepsis and 28-day mortality, addressing the limitations of conventional single-dimensional models. This study leveraged publicly available datasets (Medical Information Mart for Intensive Care III [MIMIC-III], electronic Intensive Care Unit [eICU], and The Cancer Genome Atlas) to construct a multimodal framework. Clinical parameters were processed using Extreme Gradient Boosting, abdominal imaging features were extracted via convolutional neural networks, and transcriptomic profiles were analyzed with variational autoencoders. A Transformer-based fusion network was employed for joint prediction and validated through cross-validation and external testing. Key features were identified using Shapley Additive Explanations and Local Interpretable Model-Agnostic Explanations interpretability algorithms, while immune regulatory mechanisms were explored via weighted gene co-expression network analysis. The multimodal model achieved an area under the curve (AUC) of 0.89 for predicting sepsis and 28-day mortality, outperforming unimodal models (clinical-only model, AUC 0.72, and imaging-only model, AUC 0.78). Critical predictors included Sequential Organ Failure Assessment score, lactate levels, intra-abdominal free fluid on imaging, and immunoregulatory genes (programmed death-ligand 1 [PD-L1] and indoleamine 2,3-dioxygenase 1 [IDO1]). Mechanistic analysis revealed distinct immune reprogramming in patients with sepsis, characterized by increased regulatory T cells and M2 macrophages, along with downregulated cluster of differentiation 8+ (CD8+) T cells. This multimodal AI model offers an innovative digital solution in medical informatics, enabling precise early risk stratification for ECF-associated sepsis. By integrating multisource data and providing interpretable insights into immune-inflammatory pathways, the model enhances health care quality for patients with ECF and paves the way for personalized intervention strategies.
Autosomal dominant nonsyndromic hearing loss (ADNSHL) is highly heterogeneous, with more than 64 genes implicated in its etiology. This complexity limits the diagnostic power of clinical examinations and audiometry alone, while existing computational approaches have achieved only moderate accuracy and often lack interpretability. As precision medicine increasingly emphasizes genotype-phenotype correlations, there is a recognized need for diagnostic tools that provide clinicians with transparent, interpretable outputs. This study aimed to develop and evaluate the AudioGene Translational Dashboard, an interpretable clinical informatics tool that integrates machine learning models and interactive visualizations to enhance genotype-phenotype correlations and support diagnostic decision-making in ADNSHL. We developed the AudioGene Translational Dashboard, integrating 2 machine learning models (AudioGene version 4 and AudioGene version 9.1) with 6 interactive visualization tools. AudioGene version 4 uses a multi-instance support vector machine classifier for patients with multiple audiograms, while AudioGene version 9.1 combines adaptive boosting, k-nearest neighbors, random forest models, and logistic regression for patients with a single audiogram. Visualizations include audiometric profile plots, audioprofile surfaces, clustering analyses, and data distribution charts designed to facilitate clinical interpretation. The AudioGene Translational Dashboard was developed to address the "70/30" phenomenon, indicating a 74% likelihood that the causative gene is among the top 3 predicted genes, thereby providing clinicians with a clear confidence indicator ("green flag") or a caution alert ("red flag") during diagnosis. While this level of performance is well suited for hypothesis generation, the remaining uncertainty underscores the need for interpretive context in clinical decision-making. Visualization tools enhanced clinicians' ability to interpret and correlate phenotypic data with predicted genetic outcomes, improving diagnostic confidence and interpretability. The AudioGene Translational Dashboard advances clinical informatics in genetic diagnosis of ADNSHL by integrating explainable artificial intelligence with interactive visualizations, enhancing clinical interpretability and diagnostic accuracy. This approach facilitates informed clinical decision-making, highlights the translational potential of genotype-phenotype computational models, and supports precision medicine in hearing loss diagnostics. Future enhancements will target improving class balance and incorporating additional user-customizable features to further optimize clinical applicability.
MedlinePlus, developed by the National Library of Medicine (NLM) in the United States, is one of the most widely used, authoritative, consumer-grade health information resources on the web. Although extensively used and discussed in scholarly work for health literacy and patient education, it is unclear how MedlinePlus has been integrated into clinical care or embedded within health informatics applications. This study aimed to understand how MedlinePlus has supported patients and caregivers by increasing access to health information for clinical care and illness management. The insights on this topic will inform the design and development of patient-facing digital health intervention tools for improved health communication, decision engagement, informed decision-making, and health outcomes. We conducted a systematic literature review following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. First, we developed a comprehensive literature search strategy, searched 9 citation databases, and aggregated and deduplicated search results before importing them into Covidence for manual screening using predefined inclusion and exclusion criteria. Second, reviewers independently assessed all studies at the title-abstract and full-text levels, resolving discrepancies through ongoing discussions. Third, we applied the PICO (problem/population, intervention, comparison, and outcome) and the Collaborative Chronic Care Model as guiding frameworks for data extraction and analysis. All included studies underwent quality assessment using the Mixed Methods Appraisal Tool. In total, 28 studies reported in 27 sources met our inclusion criteria. We categorized the extracted data into 4 areas. First, regarding bibliometrics, the studies were reported between 2004 and 2024, with 2010 having the highest number of studies. Of these studies, 25 were conducted in the United States, 2 were conducted in Iran, and 1 was conducted in Argentina. Health informatics journals and conference proceedings, as well as library science journals, were prominent publishing venues. The NLM funded half of the studies. Second, regarding participants, most studies focused on outpatients. Other participant roles included physicians, nurses, hospital staff, pharmacists, and librarians. Fewer than half of the studies addressed the social determinants of health. Third, regarding intervention, most studies implemented MedlinePlus information interventions within clinical settings. Other interventions occurred in community pharmacies, community organizations, libraries, online health platforms, or patient portals. Fourth, regarding outcome, only 4 studies assessed clinical outcomes, and the findings were mixed and inconsistent. However, 24 of 28 studies reported positive nonclinical outcomes, including improved attitudes toward and satisfaction with MedlinePlus and enhancements in patients' information-seeking behaviors, confidence, and willingness to engage in decision-making, physician-patient communication, self-management, and self-efficacy. This systematic literature review is the first comprehensive examination of how MedlinePlus has been integrated into clinical care, supporting patients and caregivers with enhanced access to health information. Our findings offer evidence and insights through the Collaborative Chronic Care Model lens and can guide the development of digital health interventions to improve patient health.
The exponential growth of digital information has led to an unprecedented expansion in the volume of unstructured text data. Efficient classification of these data is critical for timely evidence synthesis and informed decision-making in health care. Machine learning techniques have shown considerable promise for text classification tasks. However, multiclass classification of papers by study publication type has been largely overlooked compared to binary or multilabel classification. Addressing this gap could significantly enhance knowledge translation workflows and support systematic review processes. This study aimed to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original studies, reviews, evidence-based guidelines, and nonexperimental studies. The titles and abstracts of McMaster's Premium Literature Service (PLUS) dataset comprising 162,380 papers were used for fine-tuning seven domain-specific transformers. Clinical experts classified the papers into four mutually exclusive publication types. PLUS data were split in an 80:10:10 ratio into training, validation, and testing sets, with the Clinical Hedges dataset used for external validation. A grid search evaluated the impact of class weight (CW) adjustments, learning rate (LR), batch size (BS), warmup ratio, and weight decay (WD), totaling 1890 configurations. Models were assessed using 10 metrics, including the area under the receiver operating characteristic curve (AUROC), the F1-score (harmonic mean of precision and recall), and Matthew's correlation coefficient (MCC). The performance of individual classes was assessed using a one-to-rest approach, and overall performance was assessed using the macro average. Optimal models identified from validation results were further tested on both PLUS and Clinical Hedges, with calibration assessed visually. Ten best-performing models achieved macro AUROC≥0.99, F1-score≥0.89, and MCC≥0.88 on the validation and testing sets. Performance declined on Clinical Hedges. Models were consistently better at classifying original studies and reviews. Biomedical Bidirectional Encoder Representations from Transformers (fine-tuned on biomedical text; BioBERT)-based models had superior calibration performance, especially for original studies and reviews. Optimal configurations for search included lower LRs (1 × 10-5 and 3 × 10-5), midrange BSs (32-128), and lower WD (0.005-0.010). CW adjustments improved recall but generally reduced performance on other metrics. Models generally struggled with accurately classifying nonexperimental and guideline studies, potentially due to class imbalance and content heterogeneity. This study used a comprehensive hyperparameter search to highlight the effectiveness of fine-tuned transformer models, notably BioBERT variants, for multiclass clinical literature classification. Although class weighting generally decreased overall performance, addressing class imbalance through alternative methods, such as hierarchical classification or targeted resampling, warrants future exploration. Hyperparameter configurations were crucial for robust performance, aligning with the previous literature. These findings support future modeling research and practical deployment in human-in-the-loop systems to support knowledge synthesis and translation workflows with the findings from this work.
Cochrane plain language summaries (PLSs) aim to make systematic review findings more accessible to the general public. However, inconsistencies in how conclusions are presented may impact comprehension and decision-making. Classifying PLSs based on conclusiveness can improve clarity and facilitate informed health decisions. This study aimed to develop and evaluate deep learning language models for the classification of PLSs according to 3 levels of conclusiveness (conclusive, inconclusive, and unclear) and to compare their performance with a general-purpose large language model (GPT-4o). We used a publicly available dataset containing 4405 Cochrane PLSs of systematic reviews published until 2019, already classified by humans according to 9 categories of conclusiveness regarding the intervention's effectiveness or safety. We merged these categories into 3 classes based on the strength of conclusiveness: conclusive, inconclusive, and unclear. For the fine-tuning, we used Scientific Bidirectional Encoder Representations from Transformers (SciBERT), a pretrained language model trained on 1.14 million papers primarily from the health sciences, and Longformer, a transformer model designed specifically to process long documents. The script was developed using the Python programming language and the PyTorch framework. We computed evaluation metrics using the scikit-learn machine learning library and determined the area under the curve of the receiver operating characteristic (AUCROC) to measure the model performance in balancing sensitivity and specificity. We also analyzed a separate set of 213 PLSs and compared the predictions of our pretrained models with both manual verification and outputs generated by ChatGPT. The model based on SciBERT achieved a balanced accuracy of 56.6%. The AUCROC was 0.91 for "conclusive," 0.67 for "inconclusive," and 0.75 for "unclear" conclusiveness classes. The Longformer-based model had a balanced accuracy of 60.9%, with AUCROCs of 0.86 for "conclusive," 0.67 for "inconclusive," and 0.72 for "unclear" conclusiveness classes. Both models underperformed compared with ChatGPT, which demonstrated higher accuracy (74.2%), better precision and recall, and a higher Cohen κ (0.57). Fine-tuning 2 transformer-based language models showed mixed results in classifying Cochrane PLSs by conclusiveness, likely due to semantic overlap and subtle linguistic differences. Despite satisfactory internal test metrics, the fine-tuned models failed to generalize to newly published PLSs, where performance dropped to near-chance levels. These findings suggest that general-purpose large language models like GPT-4o may currently offer more reliable results for practical classification tasks in biomedical applications.