Charles Friedman's Fundamental Theorem of Biomedical Informatics holds that a person working in partnership with an information resource outperforms that same person unassisted. Since its publication, advances in artificial intelligence (AI), adaptive learning systems, and large-scale data infrastructures have transformed the biomedical ecosystem, extending informatics beyond clinical care into domains such as public health, consumer health, translational science, and the broader life sciences. Such expansion has further underscored the importance of the Fundamental Theorem while also elucidating ways it can be expanded to meet current needs. To reassess and extend the Fundamental Theorem for the AI era in a manner that preserves its conceptual strength while broadening its applicability across an evolved and more complex biomedical ecosystem. This Viewpoint synthesizes empirical evidence and sociotechnical theory related to human-AI collaboration, learning health systems (LHS), learning public health systems (LPHS), AI governance, and systems science to contextualize the Fundamental Theorem within such contemporary frameworks. We argue that the unit of analysis of the Fundamental Theorem should shift from individuals and tools to adaptive sociotechnical systems spanning clinical care, public health, translational research, consumer engagement, and life sciences innovation. We propose an expanded theorem: A learning biomedical ecosystem that continuously optimizes human-AI collaboration will outperform humans or AI alone. This evolution builds directly upon Friedman's original theorem, reaffirming its human-centered foundation, while incorporating AI-enabled computation, adaptive learning, and systems-level integration across the modern biomedical enterprise.
To leverage the evaluation of the environmental impact of the Direct AP-HP/Lorah e-referral service offered in Paris (France) hospitals to create recommendations for evaluating the impact of digital health services. We review the tools and methods currently available to measure the carbon footprint and electricity consumption of digital services in the context of Life Cycle Assessment (LCA) for a comprehensive evaluation. We use the recent deployment of a telemedicine communication service at a major French hospital as a case study to understand the practical implications of conducting an impact study. Three-deployment scenarii are considered: current usage, double usage and maximum capacity. The bulk of the carbon footprint of the Direct AP-HP/Lorah service is due to servers vs network and user terminals in all scenarios considered. Computing hardware production impacts was instrumental in the overall impact assessment, as embodied impact represent 45% of Carbon footprint and the most of Metallic resource depletion. Recommendations for further studies notably include adequate anticipation of service usage and data collection. The environmental impact of the new telemedicine service could be assessed in sufficient level of details to provide decision makers with an adequate comparison of the service with alternative email communication. The recommendations derived from this use case should facilitate adequate impact data collection for future studies.
To assess prevalence, characteristics, and institutional predictors of clinical informatics (CI) education and student organizations in US allopathic (MD) and osteopathic (DO) medical schools. We reviewed 222 US medical schools from the 2024 Medical School Admission Requirements (MSAR) and American Association of Colleges of Osteopathic Medicine (AACOM) databases. Using predefined criteria, we identified CI-related courses and student groups and abstracted institutional characteristics including affiliated CI fellowships. Bivariate and multivariable logistic regression identified predictors. Of 222 schools, 30.2% offered at least one CI course and 23.0% had a student group. In bivariate analyses, MD programs and institutions with CI fellowships were significantly more likely to offer courses (both P < 0.001). In multivariable analyses, MD program type was the strongest predictor (adjusted odds ratio [aOR]=6.58, 95% confidence interval [CI] 2.22-22.41), followed by CI fellowship presence (aOR = 3.36), private school status (aOR = 2.08), and class size (aOR = 1.01). All 51 student groups were at MD institutions, and urban setting was associated with group presence (P = 0.034). The association with CI fellowships suggests a relationship between graduate and undergraduate medical education institutions. The association with MD programs could be influenced by differing curricular demands and flexibility. The association with urban settings may reflect the role of local innovation ecosystems. CI educational opportunities vary across US medical schools, concentrated at MD programs and institutions with GME-level infrastructure. These findings establish a baseline for CI opportunities and highlight the need to understand whether institutional differences translate to measurable competency gaps.
To identify research topics in medical chatbots and analyze their temporal trends, geographic distributions, and journal preferences. Latent Dirichlet Allocation (LDA) topic modeling was applied to 9,650 publications (1986-2024), extracting eight core topics integrated with time-series analysis, geographic statistics, and journal associations. Eight topics were identified. Temporal trends revealed three phases: technology incubation (2000-2015), rapid breakthrough (2015-2020), and application consolidation (2020-2024). Geographically, the United States, China, and the United Kingdom dominated research output (46.0%). Journal analysis highlighted Journal of Medical Internet Research (JMIR) (7.1%), Journal of the American Medical Informatics Association (JAMIA) (5.4%), and IEEE Journal of Biomedical and Health Informatics (IEEE JBHI) (3.2%) as top contributors, with JMIR and JAMIA reinforcing clinical informatics and digital therapeutics. Research on medical chatbots needs to balance technical feasibility and clinical value. Future research should focus on three directions: developing validation frameworks for leveraging large language models in clinical applications (LLMs), establishing transnational data-sharing infrastructure, and creating ethical governance mechanisms that ensure responsible innovation while maintaining health equity.
The automation of medical report generation using large language models (LLMs) could significantly reduce physicians' documentation burden while enhancing healthcare efficiency. However, the misuse of generative artificial intelligence in medical reporting can lead to important safety risks for patients. We addressed 2 questions: (1) What is the quality of medical reports generated by LLMs in English and French? and (2) Can we distinguish between human-written and LLM-generated medical reports? We evaluated the quality of reports generated by several multilingual, open-weight LLMs using text similarity metrics on 4212 medical reports in English and French across multiple specialties. A bilingual expert panel of certified physicians (n = 4) and medical residents (n = 5) scored accuracy, fluency, and completeness of generated reports using a 1-5 Likert scale. Experts also completed a Turing-like test, blindly identifying reports as human or machine-generated. Phi-4 achieved the best overall performance (ROUGE-1: 0.70, BERTScore: 0.83). Expert evaluation confirmed high-quality reports in both languages (overall 4.6/5.0). Medical experts performed better than chance but struggled to differentiate human versus machine reports (accuracy: 0.60). Automatic classifiers showed strong performance (accuracy: 0.98). The high quality of LLM-generated reports supports their potential to enhance healthcare efficiency in multilingual settings. However, the discrepancy between human detection difficulty and automated detection success reveals inherent limitations in relying solely on human oversight for quality assurance and misuse prevention. Deployment of LLMs for medical reporting requires combining automated detection tools with human expertise to ensure patient safety. Dataset and code: https://github.com/ds4dh/medical_report_generation.
We report on findings from a meeting convened by the American College of Medical Informatics (ACMI) to characterize aspects of the patient experience that could be improved using informatics. The American College of Medical Informatics fellows were invited to share their experiences as patients and suggest informatics approaches that may improve the patient experience. We identified 4 themes: (1) getting the right care, (2) data sharing and data interoperability, (3) guiding low-cost evaluations, and (4) predictive analytics. Despite widespread adoption of health IT, patient experiences remain far from optimal. The American College of Medical Informatics fellows identified informatics approaches, applications, and research areas that have the potential to improve patient experiences with health care systems.
Despite the significant potential for Clinical Decision Support Systems (CDSSs) to improve care processes and health outcomes, several barriers hinder their widespread implementation in healthcare. While numerous systematic reviews have summarized potential barriers and facilitators for CDSS implementation, a comprehensive framework to guide and evaluate the implementation of CDSSs in healthcare is lacking. This overview of reviews, aims to establish a framework-GUIDE-CDSS-aimed at guiding and evaluating implementation of CDSSs in healthcare. An overview of systematic and scoping reviews was conducted by searching 6 databases. Systematic reviews or scoping reviews that used qualitative research methods to described implementation determinants for CDSSs in the healthcare domain were included. The AMSTAR 2 tool was used to assess the methodological quality. Results were collated into the GUIDE-CDSS framework. This framework describes implementation determinants and elements within those determinants found to impact implementation of CDSSs in healthcare. Twenty-three reviews were included in the analysis. All reviews had at least 2 critical weaknesses, showing a limited methodological quality of included reviews. Eight determinants and 38 elements for implementation of CDSSs in healthcare and were described in the GUIDE-CDSS framework: perceived relevance, perceived effect, trustworthiness, ease of use, workflow, training and skills, resources, and implementation strategy. This overview provides a comprehensive synthesis of the determinants influencing the implementation of CDSSs in healthcare, collated in the GUIDE-CDSS framework. The findings underscore that for successful CDSS development, implementation and evaluation is multifactorial. This study was registered in PROSPERO (No. CRD42024512455).
Clarify disciplinary foundations and internal structure of biomedical informatics. We analyze BMI's emergence at disciplinary intersections and map its internal structure across 4 domains: theory and practice of knowledge discovery, knowledge representation and reasoning, knowledge architecture, and knowledge-driven transformation. We compare BMI with mathematics, computer science, biostatistics, and biomedical engineering, and illustrate emergent characteristics through a precision medicine example. BMI's distinctive contribution-elucidating the structure of biomedical knowledge and developing methods to discover, preserve, and make knowledge actionable-requires strength across all 4 domains. BMI developed these domains pragmatically: building systems, extracting principles, and formalizing theories. The discipline must now complement empirical approaches with rigorous theoretical work: assessing adequacy of existing theories, identifying gaps, and orchestrating collaborative development. BMI creates emergent capabilities across disciplines. As biomedicine becomes increasingly complex, BMI must strengthen its theoretical foundations while demonstrating transformative potential of knowledge spanning biological scales and time.
Dr. Kevin B. Johnson delivered this address on May 16, 2026, at the Commencement Ceremony of The D. Bradley McWilliams School of Biomedical Informatics, UTHealth Houston, to the graduating class of 2026. The address uses the concept of "The Big Mo" (compounding momentum) as a frame for understanding the current inflection point in AI and medicine. Drawing on his own career arc from paper-based clinical practice at Johns Hopkins through early adoption of health informatics to the present era of AI in healthcare, Johnson argues that the fears graduates hold about technological obsolescence and institutional instability are real but misdirected. He reframes both: biomedical informatics professionals are not targets of AI but its essential architects, and the external environment has always been uncertain for those doing important work. His charge to graduates is singular: stay on the wave.
To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought (CoT) reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with 5 state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B) on 3 benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model CoT reasoning and CoT-based majority voting. Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.
Artificial intelligence (AI) is increasingly prevalent. Patients and clinicians may use AI-based tools in many different languages. To investigate AI translation tools for descriptions of genetic conditions and how AI identification of genetic conditions is affected by translations. We used Neural machine translation (NMT) and large language-model (LLM) translation to translate descriptions of 40 genetic conditions into 191 and 93 languages, respectively. Excluding translations retaining English medical terms verbatim, we respectively focused on 139 and 70 languages. After assessing translations, we assessed the ability of 3 proprietary and 3 open-weight general LLMs to identify conditions in the translations. We analyzed how accuracy was affected by the conditions' prevalence in the literature, and attributes of the languages (the script, language family, and prevalence of the language in training sources). We also investigated adaptive translation for select languages. We found significant differences in condition identification based on the translation method, condition, language, and prediction model. The accuracy of some models was more affected than others by factors like the conditions' literature prevalence, language script, family, and language prevalence. Adaptive translation for select languages did not improve translations or diagnostic accuracy with the 3 tested LLMs. However, further analysis with 1 language showed that this approach was more effective with smaller LLMs. AI-based translation has variable performance, which can affect the ability of AI models to recognize genetic conditions. These findings should inform safe medical AI use to support consistent performance in different languages.
Phase II of MVP-CHAMPION, a federal collaboration between the Veterans Affairs Healthcare System (VA) and the Department of Energy (DoE), leveraged large-scale clinical, geo-spatial, and genetic data with state-of-the-art artificial intelligence (AI), and high-performance computing (HPC) to improve value in healthcare. Eight clinical priority projects for which AI was a critical missing capability were initiated to address: lung cancer screening (MVP 061), suicide risk screening (MVP 062), cardiovascular risk in obstructive sleep apnea (MVP 063), checkpoint inhibitor toxicity (MVP 064), heart failure (MVP 065), renal complications in diabetes (MVP 066), post COVID-19 sequelae (MVP 067), and antipsychotic medication toxicity (MVP 068). Building on a strong regulatory and administrative foundation, we developed multimorbidity-aware analytic frameworks, reusable computational tools, and analytic pipelines. These greatly facilitated identification of novel risk factors including genetic variants and specification of more discriminating prediction models. Novel genetic risk factors are informing development and repurposing of medications and discriminating prediction models promise to improve healthcare value. The research foundation developed in Phase I and extended in Phase II of MVP CHAMPION has supported an unprecedented federal collaboration and yielded significant scientific advances. Our clinical findings are poised for near-term application, while advances in machine learning and high-performance computing may accelerate the broader adoption of artificial intelligence in healthcare. This maturing VA-DoE federal collaboration is poised to transform the future of Veterans' healthcare and the broader national landscape of precision health.
This address was delivered by Eric Horvitz, MD, PhD, at the 2026 graduation ceremony of Columbia University School of Nursing on May 19, 2026, where he received the Second Century Award for Excellence in Health Care. The address considers the responsibilities of clinicians in shaping the future of artificial intelligence in medicine. It frames health care as an "open world," where information is incomplete, time is limited, and decisions are made under uncertainty. As AI transforms biomedicine and clinical care, the address emphasizes the importance of clinician engagement in guiding how these technologies are developed and used, and calls for systems that strengthen clinical judgment, support care teams, and advance human health, dignity, connection, and trust.
To evaluate the clinical applications and translation readiness of model-based synthetic tabular data in healthcare, and identify gaps in governance reporting that may hinder translation. We systematically searched Ovid MEDLINE and Embase (2010-August 2025; PROSPERO: CRD42025635514) for studies that generated and applied model-based synthetic tabular data in clinical contexts. Screening used a "human-in-the-loop" large language model workflow alongside independent manual review, achieving 100% sensitivity for included studies. Unlike prior reviews focused primarily on evaluation methodology, we mapped use-cases and deployment paradigms, and audited translation-readiness reporting using a predefined governance framework (validation depth, privacy, fairness, regulatory alignment). Thirty-seven studies (2019-2025) were included. GANs predominated; other approaches included VAEs, diffusion models, LLM-based synthesis, and Bayesian networks. Dataset augmentation was the primary application, often improving downstream model performance for rare outcomes. Emerging applications included synthetic control cohorts and algorithmic bias mitigation. Translation-readiness reporting was limited: 34/37 studies (92%) relied solely on internal validation, 9/37 (24%) used formal privacy models, 6/37 (16%) reported explicit fairness evaluations, and 6/37 (16%) addressed regulatory alignment. Few studies distinguished "no-release" from "delayed-release" paradigms. A systemic gap exists between methodological innovation and deployment-readiness reporting. Model-based synthetic data show clear value for augmentation and class balancing, but inconsistent reporting of validation, privacy, fairness, and regulatory considerations limits confidence in clinical deployment. We propose TRUST-SD (Transparency and Reporting for Utility, Safety, and Translation of Synthetic Data), an author-derived, preliminary, evidence-informed reporting checklist spanning 7 domains, as a starting point for community refinement and consensus-building.
To evaluate and compare the performance of large language models (LLMs) in identifying contributing factors (CFs) underlying patient safety incident investigations. Four open-source, lightweight LLMs, including BERT, LLaMA2, GPT2, and Phi-2 were applied to classify CFs across 6 sociotechnical system-levels encompassing 12 categories (eg, person, task, and organizational factors). Reports of real-world patient safety investigations from public health systems were extracted and labelled by domain experts (n_report/CFs = 300/1338). Data were split into training (n = 852), validation (n = 98), and test sets (n = 388). Performance was evaluated using specificity, precision, recall, and F1 scores. The fine-tuned encoder-based BERT model achieved the highest performance, with a micro-averaged F1 score of 63.6%, outperforming all decoder-based models. Among the decoder models, Phi-2 demonstrated the strongest performance (F1 = 54.9%), exceeding both LLaMA2 and GPT2. BERT performed consistently across 6 system-levels but often misclassified "organization" as "person". LLMs hold promise for automating the extraction of CFs from complex safety narratives, particularly for frequently reported system-levels such as "person" and "tasks". Such automation may substantially reduce the manual effort required to analyse reports of patient safety investigations while supporting more consistent analysis across large incident datasets. Applying LLMs to analyse the underlying causes of patient safety incidents depends on developing high-quality, domain-specific datasets that enhance the representation of patient safety knowledge and improve model understanding of incident causation. Improving data coverage for rare system-levels is essential to address the current limitations of LLMs in capturing nuanced patient safety concepts and domain-specific reasoning.
Alert fatigue is defined as alert dismissals due to excessive or irrelevant alerts and is frequently cited as a barrier to clinical decision support system use and impact. However, the criteria for determining the presence or absence of alert fatigue are poorly defined. The objective of this systematic review of systematic reviews was to identify operationalized definitions and measures of alert fatigue or alert-related metrics. Systematic reviews reporting at least one alert-related metric or measure/operationalization of alert fatigue for physician-directed electronic alerts were included. The Cochrane Library, Embase, and PubMed were searched from database start to 2024. The Revised Assessment of Multiple Systematic Reviews was used to assess study quality and risk of bias. Data were synthesized narratively and with descriptive statistics. A total of 22 studies were included in the review. Studies reported between 1 and 11 alert metrics. Studies were most often of medium quality. Reporting of primary study characteristics was frequently judged to be insufficient. Only one article reported an operational definition of alert fatigue. The most common alert metrics were quantity, override rate, and acceptance rate. Alert fatigue measurement methods are not clearly or consistently defined in systematic reviews related to alert fatigue in clinical decision support. Reporting of other primary study characteristics is often limited. We recommend that future efforts use a significant, sustained decrease in appropriate alert response rates from an established baseline as a measure of alert fatigue.
Stigma impacts outcomes across stigmatizing conditions, including substance use disorders (SUDs). Recent policy changes give patients rapid access to clinical notes in the electronic health record (EHR), which may include stigmatizing language. The objective of this study was to assess the perspectives of women with history of pregnancy and SUDs on typical language used in clinical notes. Women with a history of pregnancy and SUD were recruited through an online crowd-sourcing platform. Respondents viewed examples of clinical language and answered survey questions about perceived stigma. An inductive approach was used to analyze open-text responses, and themes were developed. Three hundred seventy survey respondents wrote a response to at least one open-text question. Thematic analysis yielded 4 major themes: (1) anticipation of future stigma facilitated by EHR documentation can affect patients' care decisions for themselves and their babies; (2) documented SUD history could have short- and long-term effects on patients' experience of stigma and discrimination, especially in labor and delivery; (3) phrases using "denies" and quotes within quotation marks could be perceived as stigmatizing and decrease trust in providers; (4) nonstigmatizing language and acknowledgement of recovery in notes can facilitate positive experiences for patients, but patients want more acknowledgement of recovery and positive language. Electronic health record documentation can modulate stigma experiences for women during and after pregnancy through stigmatizing language in clinical notes and facilitating discrimination, decreasing trust in providers and negatively impacting health outcomes. Raising providers' awareness of nonstigmatizing and positive language or implementing technology to prompt nonstigmatizing terminology could contribute to positive experiences among women with a history of pregnancy and SUD.
Health systems undertaking electronic health record (EHR) transitions often struggle to prepare and support clinicians in learning and using the new system. We evaluated a national peer coaching program-the National EHRM Supplemental Staffing Unit (NESSU)-designed to support clinicians during the U.S. Department of Veterans Affairs' (VA's) transition to a new EHR. Our goal was to assess NESSU's reach, perceived usefulness, and association with key EHR user outcomes, and to characterize how NESSU achieved its observed impacts. Using a convergent mixed-methods design, we surveyed EHR users at the most recent VA facility to implement the new EHR. Descriptive statistics summarized program reach and perceived helpfulness. Regression models assessed associations between NESSU participation and 3 outcomes: burnout, EHR-related stress, and EHR confidence. Qualitative data included 62 interviews with users and open-ended survey responses. We used structured coding and thematic analysis to identify themes. Among 385 respondents, 58.4% reported receiving NESSU support and 83.6% of those rated it as helpful. NESSU participation was associated with lower rates of burnout (29% vs 41%, P = 0.016) but not with differences in EHR confidence or EHR-related stress. Qualitative analysis yielded 4 themes describing how NESSU functioned (filling education gaps, providing responsive support, offering expert guidance, and drawing upon notable interpersonal skills) and one theme describing its overall impact. Findings demonstrate that peer coaching can address important support needs during EHR transitions. Scalable, clinician-led peer coaching may represent an essential component of large-scale EHR transitions, supporting both implementation and clinician well-being.
To evaluate if a single-subject study (S3) design, utilizing paired transcriptome samples from the same patient (eg, "sepsis" vs "recovered"), can replicate transcriptomic signatures from small case-control studies, addressing challenges in patient accrual for rare or sub-stratified diseases. We generated a sepsis gene signature (SGS) comprising 300 differentially expressed genes (DEGs; FDR < 5%) from a human sepsis case-control cohort using general linear models (GLMs). Reproducibility of SGS was assessed through three approaches applied to sub-sampled independent datasets: single-subject analyses (N-of-1-MixEnrich), anticipated to perform better; conventional paired-sample GLM analyses; and a traditional case-control GLM analysis. SGS reproducibility in GLM analyses was inconsistent at smaller cohort sizes (∼80% reproducibility; n = 5) but stabilized at cohort sizes >6. Remarkably, the single-subject-study approach consistently reproduced SGS in each of the 18 subjects individually (100% reproducibility; n = 1). Conventional GLMs are not designed for single-subject or small cohort analyses due to their dependence on larger samples to mitigate variable dispersion and human heterogeneity. In contrast, S3 methods enhance statistical power by: reducing multiple testing through gene set aggregation, emphasizing concordant changes in pathway activity rather than exact molecular consistency, and exploiting paired samples from the same individual. This proof-of-concept demonstrates that S3 designs effectively validate gene expression signatures derived from case-control studies, highlighting their potential in research or clinical trials constrained by small sample sizes. However, further validation and computational simulation are needed to demonstrate scalability to other conditions and sensitivity to validation subject variations from the "average subject" of discovery cohorts.
Generative information extraction using large language models (LLMs), particularly through prompting combined with few-shot learning, has become a popular method. In many ways such prompts with examples resemble the annotation guidelines long used for manual labeling of data for information extraction, and indeed studies have demonstrated the direct use of these guidelines as effective prompts. However, constructing annotation guidelines is both labor- and knowledge-intensive. Instead, this paper proposes to leverage LLMs' impressive ability to automatically create such annotation guidelines. Specifically, we propose a zero-shot hierarchical prompt engineering method that harvests the knowledge summarization and text generation capacity of LLMs to synthesize annotation guidelines to improve downstream LLMs while requiring minimal human input. Zero-shot clinical named entity recognition benchmarks, 2012 i2b2 EVENT, 2012 i2b2 TIMEX, 2014 i2b2, and 2018 n2c2 showed improvements of 0.2% to 25.86% for Llama 3.1 and 5.82% to 16.13% for GPT-OSS in strict F1 scores from the no-guideline baseline. The LLM-synthesized guidelines showed equivalent or better performance compared to human-written guidelines by 0.23% to 10.00% in most tasks. LLMs generate high-quality annotation guidelines following a consistent pattern (eg, title, entity types, examples) without human guidance, indicating that a representation of such a concept has been encoded during the pre-training. Nuances in definitions, however, still require adjustment by researchers to align with the project. This study proposes a novel hierarchical prompt engineering method that requires minimal knowledge transfer from a human expert and is applicable to multiple biomedical domains.