Movement directly reflects neurological and musculoskeletal health, yet objective biomechanical assessment is rarely available in routine care. We introduce Portable Biomechanics Laboratory (PBL), a secure platform for fitting biomechanical models to video collected with a handheld, moving, smartphone. We validate this approach on over 15 hours of data synchronized to ground truth motion capture, finding mean joint-angle errors < 3$°$ and pelvis-translation errors of a few centimeters across patients with neurological-injury, lower-limb prosthesis users, pediatric in-patients, and controls. In > 5 hours of prospective deployments to neurosurgery and sports-medicine clinics, PBL was easy to setup, yielded highly reliable gait metrics (ICC > 0.9), and detected clinically relevant differences. For cervical-myelopathy patients, its measurement of gait quality correlated with modified Japanese Orthopedic Association (mJOA) scores and were responsive to clinical intervention. Handheld smartphone video can therefore deliver accurate, scalable, and low-burden biomechanical measurement, enabling greatly increased monitoring of movement impairments. We release the first clinically-v
Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision-making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill-Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, re
Developing AI models that are useful in clinical practice, requires efficient collaboration between clinicians and AI developers. This poses a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other's expertise. Coding agents may help close this gap. They can write and refine code on their own, and they carry working knowledge of both medicine and AI to understand commands formulated by both medical experts and developers. We present a prototype that lets clinicians drive AI development directly. A clinician describes the task in plain language, and the system turns the description into a working pipeline, refines it through repeated experiments together with the clinician, and returns a model that meets the stated clinical objective. Across five clinical tasks, the system reliably produces models that matched the clinician's request and reached competitive performance. Most notably, on chest rad
We introduce Clinical ModernBERT, a transformer based encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT the current state of the art natural language text encoder featuring architectural upgrades such as rotary positional embeddings (RoPE), Flash Attention, and extended context length up to 8,192 tokens our model adapts these innovations specifically for biomedical and clinical domains. Clinical ModernBERT excels at producing semantically rich representations tailored for long context tasks. We validate this both by analyzing its pretrained weights and through empirical evaluation on a comprehensive suite of clinical NLP benchmarks.
Empiric antibiotic prescribing in high-risk clinical contexts often requires decision making under conditions of incomplete information, where inappropriate coverage or unjustified escalation may compromise safety and antimicrobial stewardship. While clinical decision-support systems have been proposed to assist in this process, many approaches lack explicit governance and evaluation mechanisms defining scope, abstention conditions, recommendation permissibility, and expected system behavior. This work specifies a governance and evaluation framework for deterministic clinical decision-support systems operating under explicitly constrained scope. Deterministic behavior is adopted to ensure that identical inputs yield identical outputs, supporting transparency, auditability, and conservative decision support in high-risk prescribing contexts. The framework treats governance as a first-class design component, separating clinical decision logic from rule-based mechanisms that determine whether a recommendation may be issued. Explicit abstention, deterministic stewardship constraints, and exclusion rules are formalized as core constructs. The framework defines an evaluation methodology
We introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare workflows. The narrative and unstructured nature of clinical notes is a major obstacle for healthcare intelligentization. We address a critical problem of structuring clinical notes into clinical data, according to international interoperability standards. We collect and annotate data for three subtasks, namely, international patient summary, clinical impression and medical encounter. We then supervised fine-tuned a state-of-the-art LLM using public and credentialed clinical data. The training is orchestrated in a way that the target model can first support basic clinical tasks such as abbreviation expansion and temporal information extraction, and then learn to perform more complex downstream clinical tasks. Moreover, we address several modeling challenges in the healthcare context, e.g., extra long context window. Our blind pairwise evaluation shows that SoftTiger outperforms other popular open-source models and GPT-3.5, comparable to Gemini-pro, with a mild gap from GPT-4. We believe that LLMs may become a step-stone towards healthcare digitalization and democratization.
Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query late
Advances in markerless motion capture are expanding access to biomechanical movement analysis, making it feasible to obtain high-quality movement data from outpatient clinics, inpatient hospitals, therapy, and even home. Expanding access to movement data in these diverse contexts makes the challenge of performing downstream analytics all the more acute. Creating separate bespoke analysis code for all the tasks end users might want is both intractable and does not take advantage of the common features of human movement underlying them all. Recent studies have shown that fine-tuning language models to accept tokenized movement as an additional modality enables successful descriptive captioning of movement. Here, we explore whether such a multimodal motion-language model can answer detailed, clinically meaningful questions about movement. We collected over 30 hours of biomechanics from nearly 500 participants, many with movement impairments from a variety of etiologies, performing a range of movements used in clinical outcomes assessments. After tokenizing these movement trajectories, we created a multimodal dataset of motion-related questions and answers spanning a range of tasks. We
Markerless biomechanics increasingly relies on 3D skeletal keypoints extracted from video, yet downstream biomechanical mappings typically treat these estimates as deterministic, providing no principled mechanism for frame-wise quality control. In this work, we investigate predictive uncertainty as a quantitative measure of confidence for mapping 3D pose keypoints to 3D anatomical landmarks, a critical step preceding inverse kinematics and musculoskeletal analysis. Within a temporal learning framework, we model both uncertainty arising from observation noise and uncertainty related to model limitations. Using synchronized motion capture ground truth on AMASS, we evaluate uncertainty at frame and joint level through error--uncertainty rank correlation, risk--coverage analysis, and catastrophic outlier detection. Across experiments, uncertainty estimates, particularly those associated with model uncertainty, exhibit a strong monotonic association with landmark error (Spearman $ρ\approx 0.63$), enabling selective retention of reliable frames (error reduced to $\approx 16.8$ mm at 10% coverage) and accurate detection of severe failures (ROC-AUC $\approx 0.92$ for errors $>50$ mm). R
We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was "substantial". These results required a clinical workflow-aligned AI Consult i
Digital Twins hold great potential to personalize clinical patient care, provided the concept is translated to meet specific requirements emerging from established clinical workflows. We present a general and unspecialized Digital Twin design combining knowledge graphs and ensemble learning to reflect the entire patient's clinical journey and assist clinicians in their decision-making. Such a design is predictive, modular, evolving, informed, interpretable and explainable, thus opening broad clinical applications.
Bioinformatics platforms have significantly changed clinical diagnostics by facilitating the analysis of genomic data, thereby advancing personalized medicine and improving patient care. This study examines the integration, usage patterns, challenges, and impact of the Galaxy platform within clinical diagnostics laboratories. We employed a convergent parallel mixed-methods design, collecting quantitative survey data and qualitative insights from structured interviews with fifteen participants across various clinical roles. The findings indicate a wide adoption of Galaxy, with participants expressing high satisfaction due to its user-friendly interface and notable improvements in workflow efficiency and diagnostic accuracy. Challenges such as data security and training needs were also identified, highlighting the platform's role in simplifying complex data analysis tasks. This study contributes to understanding the transformative potential of Galaxy in clinical practice and offers recommendations for optimizing its integration and functionality. These insights are crucial for advancing clinical diagnostics and enhancing patient outcomes.
Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model's reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our
This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME) corpus, we adapted a state-of-the-art AMR parser utilizing continuous training. Our approach incorporates data augmentation techniques to enhance the accuracy of AMR structure predictions. Notably, through this learning strategy, our parser achieved an impressive F1 score of 88% on the THYME corpus's colon cancer dataset. Moreover, our research delved into the efficacy of data required for domain adaptation within the realm of clinical notes, presenting domain adaptation data requirements for AMR parsing. This exploration not only underscores the parser's robust performance but also highlights its potential in facilitating a deeper understanding of clinical narratives through structured semantic representations.
Specialised pre-trained language models are becoming more frequent in NLP since they can potentially outperform models trained on generic texts. BioBERT and BioClinicalBERT are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like Knowledge Distillation (KD), it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from 15 million to 65 million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including Natural Language Inference, Relation Extraction, Named Entity Reco
Despite the plethora of AI-based algorithms developed for anomaly detection in radiology, subsequent integration into clinical setting is rarely evaluated. In this work, we assess the applicability and utility of an AI-based model for brain aneurysm detection comparing the performance of two readers with different levels of experience (2 and 13 years). We aim to answer the following questions: 1) Do the readers improve their performance when assisted by the AI algorithm? 2) How much does the AI algorithm impact routine clinical workflow? We reuse and enlarge our open-access, Time-Of-Flight Magnetic Resonance Angiography dataset (N=460). We use 360 subjects for training/validating our algorithm and 100 as unseen test set for the reading session. Even though our model reaches state-of-the-art results on the test set (sensitivity=74%, false positive rate=1.6), we show that neither the junior nor the senior reader significantly increase their sensitivity (p=0.59, p=1, respectively). In addition, we find that reading time for both readers is significantly higher in the "AI-assisted" setting than in the "Unassisted" (+15 seconds, on average; p=3x10^(-4) junior, p=3x10^(-5) senior). The c
Like other fields of Traditional Medicines, Unani Medicines have been found as an effective medical practice for ages. It is still widely used in the subcontinent, particularly in Pakistan and India. However, Unani Medicines Practitioners are lacking modern IT applications in their everyday clinical practices. An Online Clinical Decision Support System may address this challenge to assist apprentice Unani Medicines practitioners in their diagnostic processes. The proposed system provides a web-based interface to enter the patient's symptoms, which are then automatically analyzed by our system to generate a list of probable diseases. The system allows practitioners to choose the most likely disease and inform patients about the associated treatment options remotely. The system consists of three modules: an Online Clinical Decision Support System, an Artificial Intelligence Inference Engine, and a comprehensive Unani Medicines Database. The system employs advanced AI techniques such as Decision Trees, Deep Learning, and Natural Language Processing. For system development, the project team used a technology stack that includes React, FastAPI, and MySQL. Data and functionality of the a
Automatic Speech Recognition (ASR) in medical contexts has the potential to save time, cut costs, increase report accuracy, and reduce physician burnout. However, the healthcare industry has been slower to adopt this technology, in part due to the importance of avoiding medically-relevant transcription mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others. We demonstrate that this metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins. We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP) and make it publicly available for the community to further develop clinically-aware ASR metrics. To our knowledge, this is the first public dataset of its kind. We demonstrate that CBERTScore more closely matches what clinicians prefer.