With the shortage of physicians and surgeons and increase in demand worldwide due to situations such as the COVID-19 pandemic, there is a growing interest in finding solutions to help address the problem. A solution to this problem would be to use neurotechnology to provide them augmented cognition, senses and action for optimal diagnosis and treatment. Consequently, doing so can negatively impact them and others. We argue that applying neurotechnology for human enhancement in physicians and surgeons can cause injustices, and harm to them and patients. In this paper, we will first describe the augmentations and neurotechnologies that can be used to achieve the relevant augmentations for physicians and surgeons. We will then review selected ethical concerns discussed within literature, discuss the neuroengineering behind using neurotechnology for augmentation purposes, then conclude with an analysis on outcomes and ethical issues of implementing human augmentation via neurotechnology in medical and surgical practice.
Physicians are--and feel--ethically, professionally, and legally responsible for patient outcomes, buffering patients from harmful AI determinations from medical AI systems. Many have called for explainable AI (XAI) systems to help physicians incorporate medical AI recommendations into their workflows in a way that reduces the potential of harms to patients. While prior work has demonstrated how physicians' legal concerns impact their medical decision making, little work has explored how XAI systems should be designed in light of these concerns. In this study, we conducted interviews with 10 physicians to understand where and how they anticipate errors that may occur with a medical AI system and how these anticipated errors connect to their legal concerns. In our study, physicians anticipated risks associated with using an AI system for patient care, but voiced unknowns around how their legal risk mitigation strategies may change given a new technical system. Based on these findings, we describe the implications for designing XAI systems that can address physicians' legal concerns. Specifically, we identify the need to provide AI recommendations alongside contextual information tha
We report on the final electroweak measurements performed with data taken at the Z resonance by the experiments operating at the electron-positron colliders SLC and LEP. The data consist of 17 million Z decays accumulated by the ALEPH, DELPHI, L3 and OPAL experiments at LEP, and 600 thousand Z decays by the SLD experiment using a polarised beam at SLC. The measurements include cross-sections, forward-backward asymmetries and polarised asymmetries. The mass and width of the Z boson, $\MZ$ and $\GZ$, and its couplings to fermions, for example the $ρ$ parameter and the effective electroweak mixing angle, are precisely measured. The number of light neutrino species is determined to be 2.9840+/-0.0082. The results are compared to the predictions of the Standard Model. Electroweak radiative corrections beyond the running of the QED and QCD coupling constants are observed with a significance of five standard deviations, and in agreement with the Standard Model. Of the many Z-pole measurements, the forward-backward asymmetry in b-quark production shows the largest difference with respect to its Standard Model expectation, at the level of 2.8 standard deviations. Through radiative correctio
Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.
Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians' first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas
We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.
Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors' and patients' privacy. These results place SynDocDis as a promising framework for a
The advancement and maturity of large language models (LLMs) and robotics have unlocked vast potential for human-computer interaction, particularly in the field of robotic ultrasound. While existing research primarily focuses on either patient-robot or physician-robot interaction, the role of an intelligent virtual sonographer (IVS) bridging physician-robot-patient communication remains underexplored. This work introduces a conversational virtual agent in Extended Reality (XR) that facilitates real-time interaction between physicians, a robotic ultrasound system(RUS), and patients. The IVS agent communicates with physicians in a professional manner while offering empathetic explanations and reassurance to patients. Furthermore, it actively controls the RUS by executing physician commands and transparently relays these actions to the patient. By integrating LLM-powered dialogue with speech-to-text, text-to-speech, and robotic control, our system enhances the efficiency, clarity, and accessibility of robotic ultrasound acquisition. This work constitutes a first step toward understanding how IVS can bridge communication gaps in physician-robot-patient interaction, providing more contr
Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians' knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians' diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice.
A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abili
In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify tra
Physician rostering in hospitals is complex due to varying shift structures, qualifications, and department- or hospital-specific regulations. Most existing optimization models are highly tailored to a single hospital or department and rarely see practical use. We present a general framework and a corresponding mixed-integer programming (MIP) model for physician rostering that accommodates a wide variety of roster structures and constraints. The model is integrated into a web application with an advanced graphical user interface (GUI), allowing physicians to specify preferences and hospital staff to configure the MIP model to their roster requirements without any mathematical or technical background. This approach enables easy adaptation to different hospitals or departments and straightforward updates in response to structural changes, such as new duties or modified qualifications. The applicability and effectiveness of the framework are demonstrated using real-world data from three departments in different hospitals specializing in internal medicine, cardiology, and orthopedics/trauma surgery. In one department, the system is already in everyday use, while in the other two, our m
Electronic health records (EHRs) have improved data accessibility but have also introduced cognitive burden for physicians, given the sheer volume and complexity of the data involved. Advances in large language models (LLMs) create new opportunities to rethink how clinicians interact with medical data through dynamic, adaptive interfaces. In this position paper, we explore how generative AI can support physicians' information needs by enabling more dynamic interactions with patient data. Through semi-structured interviews with internal physicians at Microsoft, we identify key challenges in data navigation and synthesis, and characterize clinicians' information needs during diagnostic workflows. We further examine how physicians conceptualize AI can help their work process and how these mental models shape expectations for interaction and trust. Based on these insights, we discuss design considerations for generative user interfaces that support clinician-centered workflows.
Continuous Medical Education (CME) plays a vital role in physicians' ongoing professional development. Beyond immediate diagnoses, physicians utilize multimodal diagnostic data for retrospective learning, engaging in self-directed analysis and collaborative discussions with peers. However, learning from such data effectively poses challenges for novice physicians, including screening and identifying valuable research cases, achieving fine-grained alignment and representation of multimodal data at the semantic level, and conducting comprehensive contextual analysis aided by reference data. To tackle these challenges, we introduce Medillustrator, a visual analytics system crafted to facilitate novice physicians' retrospective learning. Our structured approach enables novice physicians to explore and review research cases at an overview level and analyze specific cases with consistent alignment of multimodal and reference data. Furthermore, physicians can record and review analyzed results to facilitate further retrospection. The efficacy of Medillustrator in enhancing physicians' retrospective learning processes is demonstrated through a comprehensive case study and a controlled in-l
Intra-physician prescribing variability, the probability that one physician issues discordant decisions for two patients deemed comparable on observed covariates, holds great impact in quality of care, safety and cost. However, there are no known validated measurement methods. Here, we benchmark eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis and Bayesian binomial generalized linear mixed model) against a synthetic ground truth across 94 experimental conditions. Learned-Weights matching achieves the lowest mean absolute error (0.027), followed by Mutual-Information-weighted matching (0.028) and RF Proximity (0.034). All eight discordance-analysis methods preserve the physician rank ordering with high fidelity (Spearman > 0.89 versus the ground truth on the SCORE2 experiment), as long as the physician variability groups are well separated. Under a continuous-heterogeneity physician model, rank preservation degrades substantially for unsupervised methods (Spearman = [0.28, 0.35]) but is retained by supervised feature-weighted methods and the GLMM (Spearman = [0.62, 0.68]). Th
Purpose: This study aims to enhance the clinical use of automated sleep-scoring algorithms by incorporating an uncertainty estimation approach to efficiently assist clinicians in the manual review of predicted hypnograms, a necessity due to the notable inter-scorer variability inherent in polysomnography (PSG) databases. Our efforts target the extent of review required to achieve predefined agreement levels, examining both in-domain and out-of-domain data, and considering subjects diagnoses. Patients and methods: Total of 19578 PSGs from 13 open-access databases were used to train U-Sleep, a state-of-the-art sleep-scoring algorithm. We leveraged a comprehensive clinical database of additional 8832 PSGs, covering a full spectrum of ages and sleep-disorders, to refine the U-Sleep, and to evaluate different uncertainty-quantification approaches, including our novel confidence network. The ID data consisted of PSGs scored by over 50 physicians, and the two OOD sets comprised recordings each scored by a unique senior physician. Results: U-Sleep demonstrated robust performance, with Cohen's kappa (K) at 76.2% on ID and 73.8-78.8% on OOD data. The confidence network excelled at identifyin
Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 13.27\% improvement over vanilla RAG methods and even a 4.55\% enhancement compared to fine-tuning strategies, without incurring additional training costs. Furth
This paper studies how gifts - monetary or in-kind payments - from drug firms to physicians in the US affect prescriptions and drug costs. We estimate heterogeneous treatment effects by combining physician-level data on antidiabetic prescriptions and payments with causal inference and machine learning methods. We find that payments cause physicians to prescribe more brand drugs, resulting in a cost increase of $30 per dollar received. Responses differ widely across physicians, and are primarily explained by variation in patients' out-of-pocket costs. A gift ban is estimated to decrease drug costs by 3-4%. Taken together, these novel findings reveal how payments shape prescription choices and drive up costs.
We present \textbf{EGPF} (Equilibrium-Guided Personalization Framework), a mathematically rigorous architecture unifying Bayesian game theory, category theory, information theory, and generative AI for hyper-personalized physician engagement in the pharmaceutical domain. Our framework models the pharma--physician interaction as an incomplete-information Bayesian game where physician behavioral types are inferred via functorial mappings from observational categories, equilibrium strategies guide content generation through large language models (LLMs), and information-theoretic feedback loops ensure adaptive recalibration. We formalize behavior composition through category-theoretic functors, natural transformations, and monoidal structures, enabling modular, composable physician archetypes that respect structural invariants under domain shift. We introduce a novel \textit{Rate-Distortion Equilibrium} (RDE) criterion that bounds the personalization--privacy tradeoff, an \textit{Evolutionary Game Dynamics} layer for population-level behavior modeling, a \textit{Mechanism Design} module for incentive-compatible engagement, and a \textit{Sheaf-Theoretic} extension for multi-scale behavi
The use of Large language models (LLMs) to augment clinical decision support systems is a topic with rapidly growing interest, but current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in the clinical environment. This study evaluates Ask Avo, an LLM-derived software by AvoMD that incorporates a proprietary Language Model Augmented Retrieval (LMAR) system, in-built visual citation cues, and prompt engineering designed for interactions with physicians, against ChatGPT-4 in end-user experience for physicians in a simulated clinical scenario environment. Eight clinical questions derived from medical guideline documents in various specialties were prompted to both models by 62 study participants, with each response rated on trustworthiness, actionability, relevancy, comprehensiveness, and friendly format from 1 to 5. Ask Avo significantly outperformed ChatGPT-4 in all criteria: trustworthiness (4.52 vs. 3.34, p<0.001), actionability (4.41 vs. 3.19, p<0.001), relevancy (4.55 vs. 3.49, p<0.001), comprehensiveness (4.50 vs. 3.37, p<0.001), and friendly format (4.52 vs. 3.60, p<0.001). Our findings suggest that specialize