Artificial intelligence (AI) is reshaping clinical practice and redefining the competencies future physicians will need. International bodies, such as the Association of American Medical Colleges, have called for structured AI training in medical curricula. Despite growing international consensus, no systematic nationwide evaluation had been conducted in Spain prior to this study. This study aimed to characterize the presence, type, and curricular features of AI-related training across all Spanish universities offering an official medical degree and to assess differences by institutional ownership and geographic region. This cross-sectional study was conducted from July to September 2025. Universities were the unit of analysis. A census of all institutions offering an officially recognized medical degree was obtained from the Register of Universities, Centers and Degrees; all 52 eligible institutions were included. Publicly available curricula and course guides for the 2025-2026 academic year were reviewed by 2 independent researchers and validated by an external evaluator. Courses were classified as (1) a specific AI course (AI as primary topic, accounting for >50% of syllabus), (2) an AI-similar course (a digital health or biomedical informatics course referencing AI as secondary content), or (3) not AI-related training. Course-level variables included ownership (public or private), region, status (compulsory or elective), European Credit Transfer and Accumulation System (ECTS) credits, academic year, and department. All analyses were descriptive. Potential sources of bias were addressed through predefined classification criteria, duplicate independent extraction, and external dataset verification. Of 52 universities, 36 (69.2%) were public and 16 (30.8%) were private. A total of 10 (19.2%) institutions offered at least one specific AI course; 6 (11.5%) included an AI-similar course. Overall, 16 (30.8%) universities had incorporated AI in some form; 36 (69.2%) institutions had not incorporated AI. Rates were similar for public (7/36, 19.4%) and private institutions (3/16, 18.8%). Identified courses ranged from 3 to 6 ECTS credits, representing an average of 1.17% of the 360-credit degree; most were elective. Only the University of Jaén offered a compulsory course with AI content. Marked regional disparities were observed: Andalusia led with 5 of 9 (55.6%) universities offering a specific AI course, while 10 autonomous communities had no universities with any AI-related training. This study delivers the first census-based, reproducible, national assessment of AI integration in Spanish undergraduate medical education. Unlike prior work focused on individual programs or nonstandardized definitions, we applied a consistent taxonomic framework reusable for longitudinal monitoring and international benchmarking. Findings reveal a heterogeneous, predominantly elective, and low-weight curricular landscape with striking interregional inequities. These results inform curriculum reform, accreditation standards, and faculty development priorities and support the establishment of minimum national competency standards and systematic monitoring to ensure equitable AI literacy among future physicians in Spain.
Trustworthy artificial intelligence (AI) in health care requires assurance frameworks that translate ethical principles into measurable governance and evaluation practices. While a growing number of AI assurance frameworks have been proposed, they differ substantially in governance structure, institutional embedding, and implementation mechanisms, reflecting differences in intended purpose and use. To date, few studies have applied standardized, rubric-based evaluation criteria to systematically compare how assurance instruments with different institutional origins operationalize ethical principles across the AI lifecycle. This study aimed to develop and apply a structured, rubric-based evaluation instrument to compare 2 health care AI assurance instruments, including the Coalition for Health AI (CHAI) responsible AI guide, a voluntary consortium-based instrument, and South Korea's Trustworthy AI guideline, a government-issued instrument. A 7-dimension evaluation rubric was developed based on a synthesis of established international AI assurance and governance instruments. The rubric covered core principles, AI lifecycle coverage, governance context, stakeholder breadth, operational maturity, instrument design and tools, and public accessibility. Seven independent evaluators with expertise in health care AI governance assessed each instrument using a 5-point ordinal rating scale (1=absent-5=comprehensive). Each evaluator independently scored the materials using a standardized rubric. Discrepancies were resolved through structured consensus discussions, with reference to rubric definitions and source documents. Final scores were determined based on documented evidence, requiring full consensus rather than averaging. Interrater reliability was assessed using Fleiss kappa. Both instruments demonstrated strong alignment in core principles (CHAI: 4; Trustworthy AI Guideline: 5) and stakeholder breadth (both: 4). The government-issued Trustworthy AI Guideline exhibited broader AI lifecycle coverage (5 vs 4), a more formalized governance context (5 vs 3), and higher operational maturity (4 vs 2), reflecting stepwise oversight and formal embedded oversight mechanisms supported by legislation. In contrast, the voluntary CHAI instrument demonstrated greater emphasis on instrument design and implementation tools (4 vs 3) and higher public accessibility (5 vs 3), driven by open-access resources such as assurance standards guides and applied model cards. Interrater agreement of independent ratings was moderate to substantial (Fleiss kappa=0.47-0.64; P<.001), indicating consistent scoring patterns among evaluators. This comparative analysis indicates that voluntary and government-issued AI assurance instruments operationalize trustworthy AI principles in distinct but complementary ways. Voluntary instruments emphasize flexible tools and accessible implementation resources, while government-issued guidelines embed assurance functions within formal governance and oversight structures. Rather than representing competing models, these approaches address different assurance needs across the AI lifecycle. By identifying concrete areas of alignment and divergence, this study supports a more coherent comparison of assurance practices and highlights potential opportunities for alignment across documentation structures and evaluation approaches that can support safe, equitable, and scalable deployment of health care AI across diverse institutional contexts.
Artificial intelligence (AI) has demonstrated strong potential in breast cancer diagnostics by improving accuracy, efficiency, and clinical workflow. However, adoption among physicians remains variable. Existing research often overlooks the contextual and experiential differences between clinicians who use AI and those who do not. A comprehensive understanding of barriers and facilitators, especially across user groups, is essential to inform equitable and effective AI implementation in real-world settings. This study aimed to (1) identify key barriers and facilitators influencing the use of AI tools in breast cancer diagnostics, with a specific focus on comparing current users and nonusers, and (2) examine how social, technological, and individual-level factors are linked to physicians' attitudes toward AI, intention to use it, and perceived likelihood of future adoption. A cross-sectional, embedded mixed methods survey was conducted with 46 Austrian physicians. Quantitative items were based on the technology acceptance model and its extensions. Open-ended responses were analyzed using conventional content analysis and integrated with quantitative results via joint displays. Ordinary least squares regressions examined factors associated with attitudes, intention, and the likelihood of future AI use. Among the 46 participating physicians, 52% (n=24) reported current AI use. Common facilitators included improved quality of work, efficiency, and expanding knowledge. Nonusers highlighted barriers such as limited access (17/21, 81%), high costs, and lack of training. AI users highlighted barriers related to limited integration with existing systems and concerns about trust. Despite these differences, both groups expressed strong future adoption intentions. Perceiving multiple facilitators was significantly associated with more favorable attitudes (B=0.83; P=.02), stronger intention to use AI (B=1.32; P=.01), and higher perceived likelihood of future use (B=1.56; P=.001). AI-related skills positively predicted intention (B=1.00; P=.04) and likelihood of future use (B=1.16; P=.01), while colleagues' positive views about AI predicted both attitudes (B=0.34; P=.02) and intention (B=0.39; P=.01). In contrast, perceiving multiple barriers was associated with lower intention (B=-0.84; P=.047) and likelihood (B=-1.48; P<.001). Being aged 50 or older was significantly associated with more negative attitudes (B=-1.11; P=.002) and lower likelihood of future use (B=-0.82; P=.02). This study offers preliminary insights into the implementation of AI in breast cancer diagnostics within the Austrian health care context. AI adoption appears to be a staged process with evolving support needs. Early-stage users may benefit from improved access and training, while experienced users require support for workflow integration and trust-building. Promoting peer support, addressing demographic disparities, and embedding AI training into clinical routines may support more sustainable and equitable adoption. These findings inform tailored implementation strategies and offer recommendations that may be transferable to other health systems.
Explainer videos are widely used in higher education. With the increasing availability of artificial intelligence (AI)-generated avatars, it remains unclear whether the presentation format-human presenter vs AI avatar-affects learning outcomes and user experience, especially in technologically complex fields. This study aimed to assess the feasibility of a randomized crossover design to investigate learning gain and user experience associated with content-identical explainer videos delivered by either an AI-generated avatar or a human presenter. Exploratory analyses examined the potential differences between the presentation formats. An observer-blinded, prospective randomized crossover feasibility study was conducted with 13 undergraduate engineering students. Participants viewed 2 content-identical explainer videos on fuel cell technology presented by either an AI-generated avatar or a human presenter in a randomized sequence. Learning gains were recorded using a 7-item knowledge test administered at baseline and after the first and second video presentations. User experience was assessed after each video by using the AttrakDiff2 questionnaire. Because there was no washout period and the instructional material was identical in both videos, the second learning phase was vulnerable to carryover and test-retest effects. Consequently, analyses of learning outcomes focused on the initial phase, whereas user experience was examined through pooled comparisons across both conditions. Both presentation formats were associated with substantial short-term learning gains. The difference in the learning gain between the AI avatar and human presenter videos was not statistically significant (median newly correct items 5, IQR 3-5.5 vs 4.5, IQR 2.5-5; P=.51; Z=0.66; r=0.183). In contrast, user experience ratings were consistently higher for the human-presented video across all AttrakDiff2 dimensions, with small to medium effect sizes. The AI avatar presentation was generally perceived as neutral. This study shows that investigating AI-based explainer videos vs those using a human presenter in classroom settings is feasible and highlights methodological challenges, particularly those related to crossover designs involving content-identical materials. In this small exploratory sample, no significant differences in short-term learning gains were detected between different presentation formats. Nonetheless, participants clearly preferred human presenters in terms of user experience. These results should not be seen as proof of equivalence but rather as a foundation for future research with larger sample sizes, improved study designs, and more sensitive outcome measures.
Telehealth expansion and artificial intelligence (AI) adoption are often described as parallel dimensions of health system digital transformation. However, whether telehealth scale is associated with hospital AI adoption and whether this relationship varies across hospital settings remain unclear. This study examined the association of telehealth scale with clinical and operational AI adoption tiers in US hospitals and assessed whether these patterns differed by telehealth reporting behavior and geography. This cross-sectional study included 6173 US acute care hospitals using linked 2024 American Hospital Association Annual Survey and Information Technology Supplement data and 2023 Healthcare Cost Report Information System data. Telehealth scale was parameterized using log-transformed telehealth volume, a telehealth nonreporting indicator, and a reported-zero telehealth indicator. Clinical and operational AI adoption tiers were derived from hospital-reported AI capability items and classified into 3 tiers. Both outcomes were modeled using multioutput gradient-boosted tree classifiers, and model behavior was interpreted using Shapley additive explanations, partial dependence plots, and stratified analyses by the Core-Based Statistical Area category. Telehealth volume was the strongest predictor of both clinical and operational AI adoption tiers and had a larger contribution to the clinical AI model. Telehealth nonreporting was common, occurring in 57% (3521/6173) of hospitals, and was concentrated among hospitals in the lowest clinical AI adoption tier, accounting for 91.4% (3145/3441) of hospitals with no reported clinical AI adoption. Higher telehealth volume was associated with a steep increase in predicted clinical AI adoption tiers at lower telehealth volumes, followed by a plateau at higher volumes. At similar telehealth volumes, rural hospitals showed weaker telehealth-attributed contributions to predicted clinical AI adoption tiers than metropolitan hospitals. Supplementary analyses suggested that telehealth reporting status and telehealth intensity reflected related but distinct structural processes. Telehealth scale was strongly associated with hospital AI adoption tiers, especially clinical AI adoption tiers. These findings suggest that telehealth capacity may serve as a practical hospital-level marker of broader digital readiness for AI adoption, but the cross-sectional design does not establish whether telehealth expansion precedes or causes AI adoption. Hospitals with telehealth nonreporting and rural hospitals may face additional structural barriers that limit the translation of digital capacity into AI maturity. Policies to reduce inequities in hospital AI adoption may therefore need to pair telehealth expansion with implementation support, interoperability capacity, and organizational resources.
Artificial intelligence (AI) is increasingly transforming health care through improvements in diagnosis, predictive analytics, and workflow optimization. However, there remains a significant gap in AI training within UK medical education, leaving future clinicians underprepared for AI-driven health care environments. This viewpoint paper investigated global best practices for AI integration into medical education and proposes a structured framework for embedding AI into the UK medical curriculum. It aimed to assess current attitudes, highlight existing knowledge gaps, and recommend practical implementation strategies. An analysis of international case studies (eg, Stanford University, the University of Toronto, and Chinese University of Hong Kong) was conducted alongside a review of teaching methodologies, stakeholder perspectives, and UK-based surveys to identify core competencies and challenges in AI education. Effective integration strategies include the use of AI-powered simulations, interdisciplinary collaboration, elective modules, and faculty training. Major barriers include lack of AI-literate educators, insufficient ethical training, and limited infrastructure. Knowledge gaps persist among students and faculty in areas such as algorithmic bias, AI ethics, and clinical decision-making. To meet the demands of modern health care, the UK medical curriculum must adopt comprehensive AI training. This includes practical exposure, ethical awareness, and stakeholder engagement. Proactive reform will ensure that graduates are equipped to critically and ethically apply AI tools in clinical practice.
Feedback is essential for medical students' learning during clinical clerkships; yet, supervising physicians often struggle to provide meaningful written feedback due to time constraints. Large language models offer a promising approach to supplement human feedback, but how artificial intelligence (AI)-generated and human feedback differ in authentic clinical settings remains unclear, as most comparisons have been conducted in classroom or simulation contexts. The aim of the study is to examine how AI-generated feedback and supervisor-provided feedback differ when applied to medical students' clinical clerkship logs, by identifying the distinct characteristics and complementary strengths of each feedback type. This cross-sectional convergent mixed methods study included 161 weekly clinical clerkship logs from 47 fifth- and sixth-year medical students across 12 clinical departments at Nagoya University, Japan (January-May 2024). Of 164 eligible logs, 3 were excluded because supervisors entered contact messages rather than substantive feedback. AI feedback was generated using GPT-4o. In total, 10 faculty physicians and 10 medical students evaluated both feedback types in blinded, randomized order using a validated 5-category rubric (criteria-based, clear direction, accuracy, prioritization, and supportive tone), followed by open-ended comments and source identification. Quantitative analyses (paired 2-tailed t tests, cumulative link mixed-effects models; α=.05 with Bonferroni correction) were complemented by qualitative thematic analysis and integrated using joint display analysis. AI feedback was significantly longer than supervisor feedback (mean 382.02, SD 81.82 vs mean 98.87, SD 73.66 characters; Cohen d=2.84, 95% CI 2.50-3.19; P<.001). Cumulative link mixed-effects models showed that AI scored higher on criteria-based (odds ratio [OR] 11.81, 95% CI 7.64-18.27; P<.001) and clear direction (OR 6.61, 95% CI 4.35-10.06; P<.001), with no significant differences on accuracy (OR 1.35, 95% CI 0.91-2.00; P>.99), prioritization (OR 1.70, 95% CI 1.16-2.50; P=.10), or supportive tone (OR 1.34, 95% CI 0.87-2.06; P>.99). AI feedback showed greater consistency (variance ratio 3.9:1; Levene F1,320=73.20; P<.001). All 20 evaluators correctly identified feedback sources. Qualitative analysis revealed that AI provided structured, text-anchored feedback addressing rubric criteria, while supervisors offered experience-based feedback grounded in clinical context and professional expertise. This study extends the comparison of AI-generated and supervisor feedback to an authentic clinical clerkship environment, moving beyond classroom and simulation settings examined in prior work. Through integrated mixed methods analysis, a key distinction emerged between text-anchored AI feedback, which systematically addresses written log content in alignment with rubric criteria, and experience-based supervisor feedback, which draws on clinical observation and professional judgment. AI consistently delivered structured feedback addressing gaps that arise when time-pressured supervisors provide brief comments, while supervisors contributed clinically grounded insights that AI cannot replicate. These complementary strengths suggest that AI feedback should supplement rather than replace supervisor feedback, and that hybrid models leveraging each type's advantages warrant investigation in clinical education.
Continuous advancements in voice artificial intelligence technologies aim to assist older adults and caregivers, potentially improving quality of life and reducing caregiving burdens. Although research has explored the potential of voice-enabled artificial intelligence (VAI) assistants, such as Alexa (Amazon.com, Inc) and Google Home, to support older adults' health in informal care settings, there remains a significant gap in understanding the ethical dimensions and values that may influence their future adoption by caregivers and care recipients. This research aims to explore older adult and informal family caregivers' perspectives of VAI assistants for supporting informal care, including the ethical dimensions and values that influence their decisions about future adoption for these purposes. This research uses participatory speculative design to explore older adults' and informal family caregivers' perspectives of how VAI might support informal care in the future, and the ethical concerns they have about adopting VAI technologies. We conducted 8 workshop sessions with older adults and caregivers (n=9) over four months. Each phase focused on one of three goals: (1) to understand existing experiences, (2) to envision future VAI technologies, and (3) to reflect on ethical values that shape acceptance. In workshops, we aimed to gain insights into their experiences and challenges in managing informal care tasks and how future implementation of VAI might support the caregiving process to address their needs and concerns while emphasizing the ethical dimensions they value. The findings suggest older adults and informal family caregivers see potential opportunities for VAIs to support informal aging care by automating daily health tasks to improve efficiency, enhancing mental health and well-being, and offering companionship. However, participants felt that VAI alone might not be sufficient to address the complex needs of informal care. Additionally, they raised several ethical concerns related to transparency, privacy, inclusiveness, trust, affordability, and autonomy, which they felt needed to be addressed to encourage adoption of VAI technologies for informal care in the future. Based on the findings, we offer insights and design implications for VAI systems that balance efficiency with ethical values to support diverse caregiving needs and potentially encourage future adoption in the informal care space.
As artificial intelligence (AI) models become increasingly integrated into facial aesthetic surgery for attractiveness prediction and surgical outcome simulation, their potential to perpetuate bias poses clinical concerns. Current models trained on limited datasets inaccurately evaluate underrepresented populations and risk promoting aesthetic homogenization that conflicts with patient goals of ethnic feature preservation. Drawing on current literature, this paper examines bias across AI development stages in aesthetic facial evaluation. Benchmark datasets such as SCUT-FBP (South China University of Technology-Facial Beauty Prediction) and the Chicago Face Database underrepresent older adults, non-White, and ethnically diverse populations. Training methodologies lack fairness-aware techniques, and evaluation focuses on overall rather than demographic-stratified accuracy. While individual mitigation strategies exist-including balanced datasets, adversarial debiasing, and fairness metrics-no comprehensive framework integrates these approaches across the entire development lifecycle. We propose a 6-pillar framework spanning the AI development lifecycle: (1) diverse data collection with synthetic augmentation, (2) fairness-aware training techniques, (3) complementary fairness metrics with intersectional assessment, (4) explainable AI for clinical transparency, (5) stakeholder engagement, and (6) continuous monitoring. Despite the challenges of maintaining algorithmic standardization and cultural specificity, this framework provides implementation guidance for AI developers, clinicians, and institutions, with principles applicable beyond aesthetic surgery to broader facial analysis applications.
There is growing concern that artificial intelligence (AI) may diminish the quality of human relationships. However, in a context of widespread social importance (empathetic conversations between doctors and patients), AI can actually improve human conversational skills, potentially enhancing professional relationships. Recent advances in AI allow for realistically role-prompted counterparts for practicing professional conversations, enabling relational learning without the need for human counterparts. This study aimed to show the effectiveness of AI chatbots for learning professional communicative skills in medical education. Specifically, we hypothesized that a single conversation with an AI chatbot improves communication skills in medical students across 4 different conversational competencies. We conducted a quasi-experimental intervention study involving 4 distinct role-prompted scenarios (ie, shared decision-making, motivational interviewing, sexually transmitted diseases, and breaking bad news)-each designed to elicit in-depth empathic conversational skills aligned with key learning objectives in medical curricula. Students rated their competence for the 4 scenarios before and after a conversation with GPT-4o (OpenAI) using default settings, without fine-tuning. We expected higher perceived communication competence (PCC) in their conversation topic after the interaction compared with before the interaction in a 2-sided paired t test. Participants received AI-generated feedback, which they rated regarding adequacy. Post hoc analyses addressed gender and case effects, feedback adequacy, and prevalues in PCC. This study shows that a role-prompted GPT chatbot improves PCC in 162 medical students after a single conversation with mean of 13 (SD 4.8; 95% CI 12-14) prompt-response pairs. We found an increase in PCC with a mean difference of 0.94 (SD 1.64; 95% CI 0.69-1.20; Cohen d=0.58) from 5.89 (95% CI 5.55-6.23; scale 0-10) before the conversation to 6.83 (95% CI 6.55-7.12) after the conversation across 4 different patient role prompts. Furthermore, we found participants rating AI feedback of their conversation to be useful (mean 7.92, SD 1.61; 95% CI 7.67-8.17; scale 0-10), but feedback adequacy did not correspond to PCC increase (r=0.08; P=.32). Our results demonstrate how role-prompted GPT increases self-assessed communication competencies, introducing a novel tool for teaching relational learning. Our results present a starting point for using AI in education, particularly teaching communication in professional roles. On the basis of our findings in medical education, we anticipate further studies to investigate conversational training between lawyers and clients, marketers and customers, or managers and employees. Our research thus has implications for any field with a need for conversational training and relational learning.
Several artificial intelligence (AI) governance frameworks have emerged to help health systems (HS) address AI-related risks. However, most fail to capture the multidimensional and evolving nature of real-world governance. This systematic review aimed to synthesize existing AI governance frameworks for HS and to propose an integrative AI governance model identifying key components to guide AI-related policy, practice, and research in HS. A comprehensive search was conducted in 8 academic databases (PubMed, MEDLINE, Embase, ACM Digital Library, Web of Science, Scopus, Social Sciences Abstracts, and PsycINFO), gray literature databases, and international organization web portals in October 2024 (updates: July 2025 and March 2026) and limited to studies published from November 2014 to March 2026 in English, French, Spanish, or Portuguese. Eligible documents included peer-reviewed articles and reports proposing AI governance frameworks for HS. Two reviewers independently selected the frameworks, assessed their quality using the Appraisal of Guidelines for Research and Evaluation for Health Systems, and extracted data. Results were synthesized using thematic analysis. The research retrieved 10,175 records, among which 19 AI governance frameworks were identified. Most were published between 2022 and 2024 (n=13, 68%), half (n=10, 53%) were developed by authors based in North America, and only one-third (n=6, 32%) were derived from primary studies. The frameworks focused on 4 levels of AI governance: international (n=3, 16%), national (n=5, 26%), local (n=3, 16%), and organizational (n=8, 42%). All of them underline the crucial role of multidisciplinary bodies in the structure of AI governance in HS. Six key AI governance processes in HS emerged as critical: (1) need and/or problem identification (n=14, 74%), (2) data governance (n=17, 89%), (3) risk assessment and management (n=17, 89%), (4) validation and/or evaluation (n=18, 95%), (5) maintenance and monitoring (n=16, 84%), and (6) integration (n=9, 47%). Additionally, 4 pivotal relational mechanisms were identified: (1) ethical principles and/or values (n=17, 89%), (2) education and training (n=14, 74%), (3) communication (n=12, 63%), and (4) standards and regulations (n=13, 68%). Our study provides a comprehensive synthesis of existing AI governance frameworks for HS across 4 levels (local, regional, national, and international), underpinned by a quality assessment of the 19 identified frameworks. It differs from existing studies that concentrate on specific dimensions or settings by contributing an integrative AI governance model for HS comprising 2 dimensions and 4 relational mechanisms across the 4 levels, explicitly modeling their interactions. Future research should test and operationalize the proposed model to enhance its practical applicability. Strengthening the methodological rigor of AI governance frameworks will be essential for the responsible integration of AI in HS. As the framework is primarily grounded in Global North and English-language literature, validation in other contexts is warranted.
The integration of artificial intelligence (AI) into clinical research challenges traditional informed consent (IC) frameworks due to algorithmic complexity, opacity, and adaptive nature. While public demand for transparency regarding AI use in healthcare is high, current ethical guidelines lack specificity, and no assessment exists of AI representation in IC documentation within the trial registry. This study aimed to evaluate the prevalence, clarity, and completeness of AI-related consent disclosures in clinical trials registered on ClinicalTrials.gov and to propose a framework for enhanced patient digital literacy and ethical robustness. We conducted a cross-sectional content analysis of 114 AI-involved clinical trials with publicly available IC documents from ClinicalTrials.gov (searched on June 21, 2025). We assessed AI-specific disclosures, readability (SMOG index), document length, visual aid use, and data governance protocols against WHO/NIH standards. We also refined an AI risk framework encompassing model autonomy, departure from standards of care, patient-facing interaction, and clinical risk, scoring each trial on a 3-tier scale. Over half (55%) of ICs failed to disclose the AI type or usage, and 16.4% omitted risks entirely. A significant discrepancy existed between trial registry and IC reporting of AI methods. Only 14% of ICs met dual criteria for brevity (<15,000 characters) and readability (SMOG <13). Higher-risk trials did not demonstrate improved readability (Spearman's p>0.05). Only 11.4% of ICs included visual aids, and their inclusion was not correlate with lower reading difficulty. Data handling protocols post-withdrawal were inconsistent: 51 ICs provided no information, 30 specified data destruction, 29 allowed continued use, and only 4 (3.5%) offered participants a choice. Cited data protection laws varied widely, with no dominant standard. Current IC practices in AI clinical trials registered on ClinicalTrials.gov show a notable disconnect from ethical principles, with deficits in transparency, readability, and participant control over data. Our findings indicate a need for more standardized, participant‑centered consent practices. We propose the Minimum Requirements for Informed Consent in AI‑Related Clinical Trials (MRIC‑AI) as one possible framework to improve consent quality. However, cautions should be noted that these findings are limited to publicly available consent documents in the registry, and may differ from final onsite versions.
Promoting early HIV testing and patient detection is an important public health goal. In Japan, approximately 30% of the population is diagnosed with AIDS. Several studies have investigated the challenges related to HIV diagnosis; however, there are limitations in understanding the characteristics and barriers faced by individuals who are at high risk of HIV but have not yet been tested or have not sought medical consultation. This study aimed to examine the factors associated with medical consultation and HIV-testing behaviors, explore the reasons for not undergoing HIV testing, and evaluate the effectiveness of HIV-related awareness efforts among respondents to a revisit survey conducted via an artificial intelligence (AI)-based symptom search engine. This retrospective cohort study used data obtained from the AI-based symptom search engine, Ubie. Episodes involving individuals who used the AI-based symptom checker to search for their symptoms, which were subsequently suggested as HIV/AIDS/sexually transmitted infection (STI)-related conditions, were included. Those who answered the first and revisit survey questionnaires were included in the analysis. Multivariable logistic regression analyses were conducted to explore the factors associated with medical consultation in both the overall suggested HIV/AIDS/STI-related condition group and the suggested STI-related condition subgroup. Factors associated with HIV testing in individuals who underwent medical consultations were also explored using multivariable logistic regression analysis. The reasons for not undergoing HIV testing and the future intention to undergo testing were described. The number of eligible episodes was 424,893 for 332,976 individuals. Of these, medical consultations were performed in 105,365 cases and HIV testing in 394 cases. Compared with individuals in their 20s, older age groups were associated with a higher tendency to seek medical consultations. The provision of awareness information through the AI-based symptom checker was associated with medical consultation behavior, and 29% (280/964) of people who initially had no intention of undergoing HIV testing responded that they would undergo HIV testing after using the AI-based symptom checker. Compared with the internal medicine department, the gynecology department was significantly associated with HIV testing; however, the HIV testing rates were low in the suggested STI-related condition subgroup across major departments. These results suggest that HIV-related information delivered via an AI-based symptom checker may raise awareness or consideration of medical consultation among individuals actively searching for symptoms potentially associated with HIV. To further promote HIV testing, it may be necessary to refine the content and delivery of educational materials and enhance HIV testing literacy among physicians who encounter patients with STIs.
Preventable adverse drug reactions in geriatric patients are caused by overdosing, especially in cases of impaired renal function. Artificial intelligence (AI) chatbots are being discussed as tools to generate drug information, which can adjust drug dosing and prevent subsequent adverse drug reactions based on individualized patient data. However, the question arises as to the extent to which such AI chatbots can withstand scientific evaluation in this task. We newly developed and validated the AI quality output score (AQUOS, ranging from 0% to 100%) to assess the quality of AI chatbot answers. We investigated whether AQUOS depends on (1) renal function, (2) medication complexity, (3) prompting language (English and German), and (4) whether the answers are reproducible (assessed at 2 independent times). Additionally, we assessed the potential for harm. In a standardized prompt, we asked 4 AI chatbots (ChatGPT, Copilot, Gemini, and Scite) whether the medication of 100 geriatric patients with polymedication at discharge should be adjusted according to their renal function. We prompted drug-related queries in 2 languages and at 2 times to assess AI chatbot answers, and we scored the generated outputs based on AQUOS. Additionally, we assessed possible harm from the AI chatbot answers using the World Health Organization definition "The conceptual framework for the international classification for patient safety." We analyzed 1600 AI chatbot answers, with AQUOS values ranging from -19.0% to 95.2%, depending on the chatbot. We found that AQUOS declined with decreasing renal function (ChatGPT: -0.215; P=.03) and increasing medication complexity (Scite: -0.239; P=.02). Possible harm also correlated with more complicated patient statuses (lower kidney function and higher medication complexity) across all chatbots. Overall scores were up to 4.8% higher in English than in German prompting. The AI chatbot answers were highly reproducible. In renal drug dosing, the quality of AI chatbot answers declined as renal function decreased and medication complexity increased. Even the highest AQUOS achieved is insufficient for deploying AI chatbots in the high-risk health care sector.
Multiple mini interviews (MMIs) are widely used in medical school admissions to assess applicants' nonacademic attributes in a structured and reliable manner. However, the development of high-quality MMI stations is resource intensive and dependent on expert input. This study explored the utility of artificial intelligence (AI) in the generation of MMI stations for the Direct and Graduate Entry Medicine Program admissions process for domestic applicants at Monash Medical School. To our knowledge, this study represents the first empirical evaluation of AI-generated MMI stations deployed in a real-world medical school admissions context. A total of 56 MMI stations from the 2025 admissions cycle were evaluated, including 17 (30.4%) AI-generated and 39 (69.6%) traditionally developed stations, administered across 824 domestic applicants for a total of 4897 applicant-station interactions. We assessed station quality through both reliability (using Cronbach α to examine internal consistency) and discrimination capability (using SD and range of scores) at the station level. AI-generated stations exhibited slightly higher reliability (α=0.82) compared with traditional stations (α=0.81), though this difference was not statistically significant (P=.91). Both AI-generated and traditionally developed stations demonstrated variable discrimination capability, with some stations from each development method showing excellent combinations of high reliability and strong discriminatory power, while others exhibited ceiling effects that limited their discriminatory power. Of note, a greater proportion of AI-generated stations were classified as optimal (α>0.85), and a smaller proportion were classified in the review category (α<0.75), compared with traditional stations. These results suggest that AI-generated stations can achieve psychometric performance comparable to traditionally developed stations. Our findings highlight the utility of AI as a useful tool for MMI station generation, offering a scalable approach that may reduce the resource burden on faculty while maintaining or enhancing psychometric quality for applicants. Ongoing quality assurance and evaluation remain essential to ensure fairness and validity across the admissions process.
Synthetic data generation (SDG) has emerged as a promising solution to address data scarcity in health care, where privacy concerns, regulatory barriers, and the high cost of data acquisition limit access to real patient datasets. Machine learning models in this domain often operate in low-data regimes, with training set sizes as low as 20 and a median dataset size of around 600 records-conditions that hinder model generalization and increase the risks of overfitting and bias. SDG addresses these challenges by producing artificial samples that mimic real-world patient data, enabling robust and privacy-preserving model development. This study was a comprehensive assessment of SDG-augmented training across a wide array of models-both pretrained and non-pretrained-for outcome prediction in 13 health care datasets. For small datasets of sizes 50 and 350 records, we answer 3 key questions: (1) Do pretrained SDG models generate more effective augmentations than their non-pretrained counterparts for small datasets? (2) Is augmentation beneficial for both pretrained and non-pretrained classifiers for small datasets? (3) Among 3 state-of-the-art classification models, which offers the best predictive performance on small datasets? The workload that this study aimed to improve was binary classification. The 3 classifiers considered were light gradient boosting trees, large language models (LLMs) adapted to tabular data, and Tabular Prior-Data Fitted Network (TabPFN), a transformer-based method that has become the new state of the art in terms of tabular data classification. Each classifier was augmented through different SDG methods: current state-of-the-art techniques (Bayesian networks, conditional tabular generative adversarial networks, tabular variational autoencoders, and sequential trees) and the use of LLMs for tabular data generation. Augmented TabPFN demonstrated superior performance, yielding significantly higher area under the curve and integrated calibration index scores compared to other classifiers. Post hoc analysis revealed that, for the dataset sizes examined, SDG and LLM models exhibited overfitting tendencies. Notably, simple dataset augmentation through sampling with replacement achieved performance comparable to that of SDG-based and LLM-based augmentation methods for TabPFN, suggesting that gains were primarily driven by increased sample size rather than SDG. Given its strong performance and minimal computational overhead, we recommend augmenting TabPFN through sampling with replacement as the optimal approach for small-data binary classification tasks. This method achieves performance comparable to that of more complex SDG techniques while offering substantial computational advantages.
Psychotic disorder represents a leading cause of disability worldwide, and relapse in psychosis is common. Artificial intelligence (AI) is increasingly recognized as a method that could aid clinical monitoring for individuals experiencing psychosis. This review aims to map the existing literature on AI-based approaches-including machine learning, deep learning, and natural language processing-used to detect relapse in individuals with psychotic disorders. A systematic search strategy was conducted on PubMed, PsycINFO, and Embase up to January 7, 2026. Observational studies, randomized controlled trials, and quasi-experimental studies that used AI methods to detect relapse in psychosis were eligible for inclusion. Screening and data extraction procedures were conducted by at least 2 reviewers working independently. Findings were extracted, charted, and described using narrative synthesis based on data extraction and consensus meetings with the research team. The scoping review was prospectively registered with the Open Science Framework. Relevant studies identified (N=10) included the use of digital tools such as smartphone- and smartwatch-based monitoring, ecological momentary assessment tools, social media activity, and internet searches. Digital phenotyping via smartphones and wearables emerged as the most common method for data collection. The efficacy of AI models varied with sensitivity (or recall) ranging from 0.25 to 0.77 and specificity (or precision) ranging from 0.06 to 0.88. The reported area under the receiver operating characteristic curve for models ranged from 0.63 to 0.78. AI models were heterogeneous across studies, and most study findings were not replicated. This scoping review highlights both the promise and the current limitations of AI in psychosis relapse detection. Passive digital phenotyping research in the detection of psychosis relapse has progressed, and personalized approaches with individual-level modeling show promise; however, further studies need to include larger numbers of participants and should incorporate methods such as large language models. Future studies will require large collaborations aimed at delivering AI methods for use in real-world clinical practice.
The convergence of artificial intelligence (AI), blockchain technology, and health care represents one of the most transformative yet technically challenging frontiers in computational medicine. As health care systems adopt data-driven paradigms for precision medicine and clinical decision support, the need for secure, privacy-preserving, and collaborative learning frameworks has become critical. This tutorial introduces a comprehensive, clinically oriented, and compliance-aware framework integrating federated learning (FL) and blockchain for secure and privacy-preserving health care analytics. FL enables collaborative training across distributed institutions without raw data sharing, in alignment with privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR). However, FL remains vulnerable to model poisoning and gradient leakage. To address these risks, we introduce blockchain-based FL (BCFL), which leverages blockchain's immutable ledger and decentralized consensus to enhance trust, verifiability, and auditability. The tutorial's main contributions include (1) a taxonomy of diverse medical data types and their FL requirements; (2) three integration architectures (fully coupled, semicoupled, and loosely coupled) analyzed for security, scalability, and regulatory compliance; (3) a security analysis of health care-specific vulnerabilities and mitigation strategies using advanced cryptography, such as zero-knowledge proofs, homomorphic encryption, and differential privacy; and (4) a regulatory compliance framework addressing HIPAA, GDPR, and United States Food and Drug Administration guidelines for AI-enabled medical devices. We demonstrate BCFL's relevance across major health care applications, including disease prediction, medical imaging, patient monitoring, and drug discovery, and highlight emerging research directions such as quantum-resilient cryptography, scalable interoperability, and automated compliance. This tutorial serves as a foundational resource for advancing secure, compliant, and collaborative AI in health care; fostering privacy-preserving analytics; and improving patient outcomes.
Millions of people now use leading generative AI tools (chatbots) for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The field currently lacks a validated, automated benchmark for determining AI chatbot safety in mental health, including for users at risk of suicide. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet this urgent need. This human validation study examines alignment of the VERA-MH safety evaluation for AI chatbot suicide risk detection and response with safety ratings by expert human clinicians. We simulated a large set of conversations between large language model (LLM)-based users ("user-agents") spanning a wide range of suicide risk levels and disclosure styles and general-purpose AI chatbots. Licensed mental health clinicians from Spring Health used a scoring rubric developed for VERA-MH to independently rate the simulated conversations for safe and unsafe chatbot behaviors. An LLM-based evaluator (the "judge") used the same scoring rubric to evaluate the same set of conversations. We then examined rating alignment across (a) individual clinicians, (b) clinician consensus and the LLM judge, and (c) different judge LLMs. We also examined clinicians' ratings of user-agent realism, suicide risk, and disclosure. Clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR]: 0.77), thus establishing a reliable clinical consensus reference. The LLM judge was strongly aligned with this clinical consensus reference (IRR: 0.81) when using the same scoring rubric. Ratings were stable across judge LLMs and evaluations. Clinicians' ratings of user-agent realism and how well the intended user-agent suicide risk and disclosure styles were reflected in the simulated conversations were mixed. For the potential mental health benefits of AI chatbots to be realized, attention to safety is paramount. Findings support the reliability of VERA-MH: an open-source, fully automated AI safety evaluation for suicide risk detection and response. These results reflect an earlier version of the benchmark, and as VERA-MH continues to evolve, external validation of updated versions will be an important next step. Future research directions include VERA-MH generalizability and robustness, as well as expanding to target other key areas of AI safety for mental health.
Medical documentation imposes a significant administrative burden on physicians and reduces time for direct patient care. Artificial intelligence (AI)-assisted tools such as automatic speech recognition and large language models (LLMs) promise to reduce this burden, but their performance in multilingual environments has not been explored. Switzerland is highly multilingual, and non-native German-speaking physicians may find documentation particularly challenging. This study aimed to compare the efficiency and documentation quality of four clinical documentation workflows-including both AI-assisted and traditional methods-in a Swiss tertiary hospital setting characterized by linguistic diversity. In this proof-of-concept study at a Swiss tertiary hospital (Department of Plastic and Hand Surgery, Cantonal Hospital Aarau), two physicians-a native Swiss German speaker and a non-native German speaker-documented encounters with simulated patients having common hand disorders. Four documentation workflows were tested: (1) traditional dictation with transcription by a secretary; (2) real-time dictation using speech recognition software for voice to text transcription; (3) postencounter dictation transcribed by an AI (Whisper) and processed by a GPT-based agent; and (4) AI-assisted ambient dictation of entire appointments using audio recording and automatic transcription. Documentation efficiency was measured by recorded physician time, and note quality was assessed using a modified Physician Documentation Quality Instrument (PDQI-9) scored by three different LLMs. To protect patient privacy, only synthetic (simulated) patient data were used. AI-assisted workflows-particularly workflow 4 (AI-assisted ambient dictation)-produced the shortest physician documentation times per report. In post-hoc comparisons, workflow 4 was significantly faster than solely the speech recognition software workflow (workflow 2) for both physicians (adjusted P<.001). For the non-native speaker, workflow 4 was not significantly faster than traditional dictation (workflow 1) after adjustment (P=.08). LLM evaluators assigned high absolute scores (median PDQI-9 >47/50); however, inter-rater reliability was poor (Krippendorff's alpha=-.433, 95% CI: -0.444 to -0.416), indicating systematic disagreement that precludes definitive conclusions about documentation quality from these scores alone. AI-assisted documentation demonstrated significant time savings for the native speaker, though the reduction for the non-native speaker did not reach statistical significance in this pilot (P=.08). Such tools show potential to alleviate the linguistic challenges faced by non-native speakers, reduce administrative burdens, and enable physicians to spend more time with patients. However, the inconsistency of AI-based quality scoring suggests that LLMs cannot yet reliably replace human evaluation. Future studies should evaluate these workflows in real-world clinical implementation, address data privacy and security issues, and include human evaluators to validate the benefits observed in this study.