Accurate prediction of the laser energy absorption and corresponding thermal spread is essential for safe and effective outcomes in magnetic resonance-guided laser interstitial thermal therapy (MRgLITT), as it enables clinicians to anticipate thermal spread and ensure complete ablation of the epileptogenic focus while minimizing collateral damage. However, current planning tools rely on simplified models that neglect patient-specific anatomy and the cooling effects of cerebrospinal fluid (CSF), often resulting in incomplete or asymmetric ablations. To address these limitations, this work sought to improve preoperative planning for MRgLITT by integrating the bioheat transfer equation (BHTE) with U-Net-based deep learning to predict patient-specific temperature maps. We leveraged a large dataset of paired presurgical MR and intraoperative temperature maps from 340 MRgLITT procedures for mesial temporal lobe epilepsy. We proposed a hybrid model, physics-assisted U-Net (PA-U-Net) that used planning MRIs, tissue segmentations, and a physics prior to predict intraoperative temperature maps. We compared its performance with two baseline models: a physics-only model (BHTE) and a deep learning model (U-Net). All three models were evaluated against ground-truth magnetic resonance thermometry (MRT). We evaluated the spatial and morphological accuracy of an ablation zone, delineated by a 42 °C temperature threshold, using the Dice similarity coefficient and a roundedness (4πA/P2) metric, as well as pixel-wise root-mean-square error (RMSE, °C) and the percentage of well-predicted voxels (|prediction - ground truth|≤ 5 °C). A generalized linear model analysis was performed to evaluate how local tissue composition, particularly CSF proximity, influenced ablation geometry and model performance. The models were trained on 313 MRgLITT cases and evaluated on 27 test acquisitions. The PA-U-Net achieved the highest overall spatial agreement with the ground-truth MRT thermal distribution, with a Dice score of 0.74, significantly higher than that of the U-Net (p < 0.001). Although PA-U-Net and BHTE showed similar overall Dice scores, threshold-wise analysis revealed that PA-U-Net consistently maintained superiority at higher isotherm thresholds, indicating improved localization of hotter ablation cores. Per-case analysis showed that PA-U-Net outperformed BHTE in 18 of 27 test cases, demonstrating stronger and more consistent performance across subjects. Roundedness analysis showed that both U-Net and BHTE differed significantly from the ground truth (p < 0.001), whereas PA-U-Net showed no significant difference (p > 0.05), indicating the closest alignment with the true ablation geometry. Ground-truth roundedness showed no significant dependence on CSF proximity across the full dataset (p = 0.23), but within the subset of cases where PA-U-Net outperformed both BHTE and U-Net, the relationship exhibited a marginal negative trend (p = 0.06), suggesting that higher CSF content around the laser tip was associated with lower roundedness. Pixel-wise analyses showed that PA-U-Net achieved an RMSE of 2.68 ± 0.47 °C and a well-predicted voxel percentage of 86%. Embedding a physics-based heat prior into the U-Net, PA-U-Net achieved both anatomical adaptability and physical consistency, accurately reproducing temperature magnitude and spatial shape characteristics of true thermal distributions. This hybrid framework outperformed data-driven and physics-only models, particularly in complex anatomical regions where traditional physics-based methods fail. These results demonstrate that physics-assisted deep learning can substantially enhance MRgLITT treatment planning and lay the groundwork for future AI-assisted, patient-specific surgical planning tools.
Transoral robotic surgery (TORS) has become a well-established surgical technique for the treatment of oropharyngeal cancer, but the significant learning curve and lack of standardized credentialing have resulted in wide variability in surgical outcomes. This study aims to define procedure-specific competence standards for TORS and test whether a hierarchical task analysis (HTA)-derived procedure-based assessment (PBA) distinguishes experience levels. We examined the ability of PBA and Global Evaluative Assessment of Robotic Skills (GEARS) scores to discriminate between novice and experienced surgeons and to assess their association with operative efficiency and margin quality. We built an HTA by deconstructing the TORS lateral oropharyngectomy into tasks and subtasks. Then, PBA metrics for mucosal incision and deep dissection were developed. Two independent raters scored 40 porcine tongue TORS videos (20 novice, 20 experienced) using PBA and GEARS and recorded global and phase times. Experienced surgeons scored higher on total PBA (39.98 vs 35.35, p = 0.0055) and GEARS (22.60 vs 19.63, p = 0.0009) and showed less score variability. The largest gaps were lateral tasks: lateral mucosal incision 4.65 vs 3.60 (p = 0.0015) and lateral deep dissection 4.58 vs 3.85 (p = 0.0115). Margin scores were higher in experienced surgeons (4.38 vs 3.80, p = 0.0149). Procedures were faster overall (298.47 s vs 466.43 s, p = 0.0003) with shorter mucosal incision and deep dissection times. An HTA-derived PBA reliably differentiates TORS expertise, aligns with speed and margin quality, and identifies lateral tasks as high-yield training targets. These metrics support standardized training, assessment, and integration into VR simulation for competency-based credentialing.
Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates , and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments . These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without any task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.
Image-guided robot-assisted partial nephrectomy (RAPN) that incorporates three-dimensional (3D) models has improved both oncological and functional outcomes. However, registration between the physical endoscopic image and the 3D virtual model is often performed manually at the kidney level, limiting accuracy and usability. This study aimed to develop a framework built on deep learning to perform renal artery segmentation and angle estimation, enabling autonomous registration in image-guided RAPN. A total of 75 images were extracted from 50 RAPN videos. The renal arteries in these images were manually annotated and used to evaluate a U-Net segmentation model. Segmented masks by the autonomous segmentation model were approximated as ellipsoid to calculate renal artery angles. The model was evaluated using sevenfold cross-validation and further tested on endoscopic images from four unseen clinical cases, verifying its adaptability. The predicted angles were then applied to rotate the corresponding 3D kidney models, demonstrating the feasibility of autonomous registration. The segmentation model achieved an average Dice similarity coefficient of 0.817 in sevenfold cross-validation. The prediction errors for the renal artery angles of four RAPN cases were RAPN1 = 0.57°, RAPN2 = 3.4°, RAPN3 = 4.9°, and RAPN4 = 4.1°. The required 3D model rotations were computed from the predicted angles: RAPN1 =  + 19.7°, RAPN2 = -19.2°, RAPN3 = -28.4°, and RAPN4 = -13.3°. These findings demonstrate that deep learning-based segmentation and angle estimation of the renal artery can be performed accurately, providing a foundation for autonomous registration in image-guided RAPN.
Robotic-assisted surgery (RAS) is becoming increasingly popular with one of its praised benefits being improved ergonomics. While other types of interventions like open or laparoscopic surgery have been assessed regarding the noise exposure and evaluations regarding the ergonomics are based on these findings, there are no studies analyzing the changed environment of tele-operated RAS. Twelve robotic interventions with an average length of 3.5 h, including (hemi-)colectomies, anterior rectum resections and sigmoidectomies were assessed using a sound level meter. Recordings were monitored by a medical expert, manually annotating individual phases of the intervention from the patient's entry till the exit of the operating room. Measured sound data were afterward mapped to one out of six corresponding surgical phases. The data showed an average noise level of 60.5(±3.35) dB(A) and a L(A)eq of 63.49 dB(A) with the robotic phase being significantly, but only 0.15 dB(A) louder than the non-robotic surgical part (60.4(±3.14) dB(A)). Overall the peri-surgical phase was louder than the surgery itself with the post-surgical phase being the loudest of all phases with a level of 61.2 (±4.82) dB(A). Regarding sound levels tele-operated RAS seems comparable with open and laparoscopic surgery, despite the robotic phase being to a small extent louder within this evaluation. Nevertheless sound levels appear high and further reduction should be considered in order to improve ergonomics and patient safety.
Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create photorealistic and dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains an open challenge. We introduce TWINOR, a real-to-sim infrastructure for constructing photorealistic and dynamic digital twins of ORs for embodied AI research. The system reconstructs static geometry from pre-scan videos and continuously models human and equipment motion through multi-view perception of OR activities. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and facilitates future embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter-level accuracy while preserving dynamic interaction across surgical workflows, enabling realistic renderings and a virtual playground for embodied perception benchmarks. In our experiments, TWINOR synthesizes stereo and monocular RGB streams as well as depth observations for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 evaluated on TWINOR-synthesized data achieve performance within their reported accuracy ranges on real-world indoor datasets, demonstrating that TWINOR provides sensor-level realism sufficient for emulating real-world perception and localization challenges in dynamic OR scenes. By establishing a perception-grounded real-to-sim pipeline, TWINOR enables the automatic construction of dynamic, photorealistic digital twins of ORs. As a safe and scalable environment for experimentation and benchmarking, TWINOR opens new opportunities for translating embodied intelligence from simulation to real-world clinical environments, and sets the stage for future research on interaction, autonomy, and human-robot collaboration in the OR.
Ultrasound-guided radiofrequency ablation (RFA) of benign thyroid nodules is an effective, minimally invasive alternative to surgery but has a steep learning curve and limited formal training options. Toward addressing this gap, we developed a mixed reality simulator for thyroid nodule RFA. We implemented a real-time, voxel-based heat-transfer model of a thyroid nodule that computes temperature, thermal damage, and temperature-dependent impedance within a mixed reality simulator. The model was calibrated and verified with published RFA data from a thermal property-matched thyroid phantom and validated against published ex vivo lesion volumes. The simulator provides configurable nodule size and location, renders RFA ultrasound artifacts and lesion visualization, computes quantitative ablation metrics, and includes an interactive virtual RFA generator interface. Simulated temperature-time curves matched phantom sensor readings with a root mean square error of 1.4 °C. Simulated lesion volumes were within - 7.3% to + 0.9% of the ex vivo reference across 1.0-0.125 mm3 voxel volumes and lesion aspect ratios were lower by 4.7-10.5%. In a post-use survey, a single expert clinician rated visual realism, feedback fidelity, and training utility favorably. The simulator closely reproduced phantom temperature profiles and ex vivo lesion sizes. Its architecture is configurable and extensible to other organs and thermal ablation modalities. Formal educational studies are warranted to evaluate training effectiveness of the simulator.
Incorrectly formed medical device ensembles can cause therapy to be performed at the wrong site or display safety-critical data from the wrong patient. This early, exploratory study investigates which wireless ensemble-creation techniques minimize risk while remaining safe and usable in intraoperative workflows. The goal is to identify the most promising methods among 5G, NFC, and pop-up pairing. Eight techniques (5G proximity/location, pop-up pairing, QR Code, Near-Field Communication (NFC) Tag and Reader, device-ID copying, manual selection, device interaction, and SDC context lookup) and five clinical situations (patient bedside, transport, prep room, operating room (OR), post OR) were examined. Five interoperability and clinical experts rated the importance of risk- and usability-related criteria for each situation, as well as how well the criteria are being fulfilled by the different methods. The criteria weights and the method-specific ratings were combined in a quantitative decision matrix to identify the three most promising approaches, which were implemented in a click-dummy prototype and evaluated through formative testing with two clinicians. The analysis rated 5G, NFC, and pop-up pairing the highest when combining all categories. Subsequent formative tests with clinicians from the University Hospital Bonn support technical feasibility and highlight clear trade-offs. In this small sample (n = 2), NFC was perceived as the safest and most preferred technique. 5G offered speed and automation potential, but raised critical concerns regarding accuracy. Pop-up pairing was considered flexible yet error-prone when multiple devices entered pairing mode simultaneously. This is a systematic comparison of wireless SDC-ensemble techniques. Early clinical feedback favors NFC for its perceived safety, while 5G and pop-up pairing approaches require additional refinements to mitigate residual risks. As one participant pointed out: "I prefer a reliable and slower over an automated but error-prone method".
Markerless inside-out surgical instrument tracking using a tool-mounted camera offers a promising solution to the limited clinical adoption of the existing navigation systems, which primarily rely on outside-in optical tracking and are constrained by line-of-sight issues. However, its performance in the surgical environment, with its unique challenges, remains largely unexplored. This work benchmarks state-of-the-art inside-out methods, namely, visual Simultaneous Localization and Mapping (vSLAM) methods. To this end, we collected a first-of-its-kind dataset in spine endoscopy, providing ground-truth tool poses. We recorded endoscopic spine surgeries performed on a high-fidelity training model in a real operating room environment, containing synchronized stereo images from tool-mounted cameras, sub-millimetric ground truth pose data from a commercial optical tracking system, and the endoscopic feed. Using this dataset, the instrument tracking accuracy of a selected number of vSLAM algorithms was compared. The best performing approach achieved a root mean squared absolute trajectory error of 2.0 mm and 1.47 degrees, reaching accuracies of around 1 mm and 1 degree on selected sequences. However, it shows degraded performance in the presence of challenges such as occlusions and scene-object dynamics. Markerless inside-out tracking using vSLAM demonstrated high accuracy, indicating potential feasibility for navigated endoscopic spine applications. Our evaluation revealed that current algorithms remain insufficiently robust for routine clinical use. The presented study and dataset establish a foundation for future research toward reliable, real-time inside-out navigation in minimally invasive surgery.
Pituitary adenoma resection via the endoscopic transsphenoidal approach is technically demanding, with outcomes influenced by surgical skill. However, the association between technique and outcomes remains poorly defined. Existing workflow analyses focus on broad procedural steps and phases, but a more detailed, action-level approach is needed to capture skill-related variation. While AI shows promise in automating workflow analysis, its use at the action level is limited. This study develops and validates a reproducible action-level classification ontology for endoscopic pituitary adenoma resection, establishing the structured annotation foundation required for future AI-based workflow and skill analysis. Endoscopic videos of primary pituitary adenoma resections were collected from two high-volume international pituitary centres. A multi-disciplinary panel of neurosurgeons and data scientists iteratively reviewed and annotated surgical actions to establish a standardized classification system. Actions were categorized into triplets (instrument, target, verb), with additional temporal annotations. To evaluate framework reliability, an independent annotator followed a structured annotation guide, and inter-annotator agreement was measured using Cohen's Kappa. A consensus-based classification ontology was developed, comprising 9 verbs, 12 instruments, and 7 targets from the review of 18 endoscopic pituitary adenoma resections (9 microadenomas, 9 macroadenomas). Action distribution differed between micro- and macroadenomas, with grasping being the predominant action in microadenomas (72% of right-hand frames) and blunt dissection and traction dominating macroadenomas. The left hand primarily performed non-meaningful movements (88% of macroadenoma frames, 55% of microadenoma frames), while the right hand was responsible for more deliberate tool-tissue interactions. Inter-rater reliability analysis demonstrated substantial to near-perfect agreement (κ = 0.69-0.95), confirming the reproducibility of the annotation system. While acknowledging that conclusions remain limited by dataset size and validation stability, this study establishes a robust and interpretable action classification ontology for pituitary adenoma resection. The ontology enables high-quality, standardized labelling for future computer-vision AI works, and lays the groundwork for evaluating whether action-level annotation improves surgical outcome prediction and automated skill assessment.
This study compares two augmented reality (AR)-guided imaging workflows, one based on ultrasound shape completion and the other on cone-beam computed tomography (CBCT), for planning and executing lumbar needle interventions. The aim is to assess how imaging modality influences user performance, usability, and trust during AR-assisted spinal procedures. Both imaging systems were integrated into an AR framework, enabling in situ visualization and trajectory guidance. The ultrasound-based workflow combined AR-guided robotic scanning, probabilistic shape completion, and AR visualization. The CBCT-based workflow used AR-assisted scan volume planning, CBCT acquisition, and AR visualization. A between-subject user study was conducted and evaluated in two phases: (1) planning and image acquisition, and (2) needle insertion. Planning time was significantly shorter with the CBCT-based workflow, while SUS, SEQ, and NASA-TLX were comparable between modalities. In the needle insertion phase, the CBCT-based workflow yielded marginally faster insertion times, significantly lower overall placement error, and better subjective ratings with higher Trust. The ultrasound-based workflow achieved adequate accuracy for facet joint insertion, but showed larger errors for lumbar puncture, where reconstructions depended more heavily on shape completion. The findings indicate that both AR-guided imaging pipelines are viable for spinal intervention support. CBCT-based AR offers advantages in efficiency, precision, usability, and user confidence during insertion, whereas ultrasound-based AR provides adaptive, radiation-free imaging but is limited by shape completion in deeper spinal regions. These complementary characteristics motivate hybrid AR guidance that uses CBCT for global anatomical context and planning, augmented by ultrasound for adaptive intraoperative updates.
Objective assessment of robotic surgical skills is particularly important in pediatric surgery, where limited case volume restricts training opportunities. This study presents a virtual reality (VR)-based framework for automated evaluation of robotic suturing skills in a neonatal surgical scenario and investigates its agreement with expert video-based assessment. A real-time VR simulator was developed to emulate neonatal robotic suturing with the SmartArm system. An automated skills assessment module was implemented using an 11-point subset of a validated 29-point suturing checklist. Each checklist item was reformulated into quantitative geometric and kinematic criteria directly extracted from the simulation . Ten suturing trials were recorded and independently evaluated by an expert pediatric surgeon using video review. Automated scores were compared with expert scores using accuracy, precision, recall, and F1-score. The simulator enabled stable real-time execution of robotic suturing tasks and deterministic extraction of performance metrics. The automated assessment achieved an accuracy of 67.3%, with a precision of 0.933, recall of 0.560, and F1-score of 0.700 relative to expert evaluation. Higher agreement was observed for clearly defined metrics, while discrepancies were primarily associated with criteria dependent on visual judgment in 2D video assessment. VR-based automated assessment of robotic pediatric suturing is feasible and provides objective, repeatable evaluation of performance. By translating clinically defined checklist items into measurable simulation-derived parameters, the proposed framework offers a scalable alternative to manual video-based skills assessment in robotic surgery training.
Accurate segmentation of lung parenchyma in dynamic pulmonary magnetic resonance imaging (MRI) is required for clinical diagnosis and treatment planning. However, supervised deep learning algorithms rely on annotated datasets, which are scarce for pulmonary MRI. This study aims to leverage existing annotated computed tomography (CT) data to enable unsupervised segmentation of pulmonary MRI. A new framework was proposed for unsupervised segmentation of pulmonary MRI. First, a masked autoencoder is pretrained to learn modality-invariant features. Next, an initial segmenter is trained using labeled CT images, combined with a temporal consistency loss on 4D MR images. The initial segmenter generates predictions for MR images, which are further processed through a select-and-refine pipeline to produce high-quality pseudolabels. Finally, a final segmenter is trained using the pseudolabeled MRI, combined with the temporal consistency constraint. The model was trained using 31 unlabeled 4D MR images and 30 labeled CT images, and evaluated on 20 and 12 4D MR images acquired from two different centers. The proposed method achieves accurate and robust segmentation of lung parenchyma and outperforms state-of-the-art cross-modality methods, with Dice scores of 97.75 ± 0.57% and 97.72 ± 0.55%, and average surface distances of 1.80 ± 1.40 mm and 1.34 ± 0.69 mm across the two test sets. The proposed method effectively transfers segmentation knowledge from CT to MRI, enabling accurate segmentation of lung parenchyma. By eliminating the dependency on MRI annotations, our technique offers a practical and promising solution for segmentation of dynamic pulmonary MRI.
3D reconstruction in minimally invasive surgery (MIS) enables enhanced surgical guidance through improved visualisation, tool tracking, and augmented reality. However, traditional RGB-based keypoint detection and matching pipelines struggle with surgical challenges, such as poor texture and complex illumination. We investigate whether using snapshot hyperspectral imaging (HSI) can provide improved results on keypoint detection and matching surgical scenes. We developed HyKey, a hyperspectral keypoint detection and description model made up of a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from HSI. The model was trained using synthetic homographic augmentation and epipolar geometry constraints on a robotically acquired dual-camera RGB-HSI laparoscopic dataset of ex vivo organs with calibrated camera poses. We benchmarked performance against established RGB-based methods, including SuperPoint and ALIKE. Our HSI-based model outperformed RGB baselines on registered RGB frames, achieving 96.62% mean matching accuracy and 67.18% mean average accuracy at 10 ∘ on pose estimation, demonstrating consistent improvements across multiple evaluation metrics. Integrating spectral information from an HSI cube offers a promising approach for robust monocular 3D reconstruction in MIS, addressing limitations of texture-poor surgical environments through enhanced spectral-spatial feature discrimination. Our model and dataset are available at https://github.com/alexsaikia/HyKey-Hyperspectral-Keypoint-Detection.
Invasive intracranial pressure (ICP) monitoring is associated with up to a 17% complication rate. There are no clinically accepted non-invasive, continuous ICP monitors. Optic nerve sheath diameter (ONSD) has a high diagnostic accuracy for invasive ICPs; however, it only provides information at a point in time. We aimed to build a system for non-invasive automated ONSD image acquisition as a proxy for semi-continuous ICP monitoring. We built a frame-based system attachable to a patient's head for robotic ONSD acquisition, termed ICP goggles. Phantom optic nerve sheaths filled with gel were constructed in 5-, 8-, and 11-mm diameters to be interchanged in a skull model orbit. Twenty trials of ICP goggle ONSD measurements were performed for each phantom nerve sheath size and compared to manual measurements using a Pearson's correlation coefficient. Intra-rater reliability of the automated system was evaluated using an intra-class correlation coefficient. A one-way Analysis of Variance (ANOVA) test was used to evaluate the classification accuracy of the ICP goggle measurements for each phantom size. Sixty total trials were performed across the three phantom optic nerve sheath sizes. Mean ICP goggle measured ONSDs correlated with manual measurements (R2 = 0.99), across the three phantom sizes. Intra-rater reliability of the automated system was high (intra-class correlation coefficient 0.995). ICP goggle ONSD measurements were able to differentiate each of the three phantom nerve sheaths without overlapping measurements across nerve sheath sizes (p < 0.001, ANOVA). We demonstrate a proof-of-concept model with validation data for automated ONSD image acquisition. These results validate mechanical performance and measurement consistency, but do not establish clinical-grade interpretation or robustness to real-world artifacts (e.g., motion, eyelids, and speckle) or anatomically accurate localization. Future work will focus on realistic globe-orbit phantoms, improved segmentation, coupling strategy, and staged human feasibility testing.
Simultaneous high-resolution impedance manometry (HRIM) and videofluoroscoic swallow studies (VFSS) can address important limitations of VFSS. However, clinicians must analyze each modality independently, doubling workload and leaving the challenge of manometric region delineation unresolved. We hypothesize that spatially registering HRIM and VFSS would allow manometric regions to be defined directly on VFSS frames, reducing clinician burden. Achieving this requires reliable detection of the HRIM catheter in VFSS images. We introduce a template-free, knowledge-based algorithm that automatically localizes the catheter centreline in VFSS frames. The method identifies the main visible portion of the catheter in each frame, and then recursively adds catheter segments based on proximity and directional alignment. This approach defines a region-of-interest containing the individual HRIM sensors. The algorithm was validated on two datasets comprising frames from 122 single-swallow VFSS videos of head and neck cancer patients. The segmentation module achieved 93.8% precision, 83.8% recall, and an F1-score of 88.5% for a 1.77 mm tolerance. The framework demonstrated robust performance across diverse anatomies and imaging conditions, outperforming existing knowledge-based methods. By relying on geometric and directional priors rather than pixel intensities, it delivers consistent, interpretable predictions without requiring large annotated datasets. This algorithm lays the groundwork for manometric region delineation directly on imaging data, and could likely be extended to other clinical applications involving thin radiopaque structures, provided that the pre-processing pipeline is adjusted and the hyperparameters are appropriately fine-tuned.
Up to 4% of adults will develop strabismus in their lifetime. The most common surgical intervention involves adjusting the length of one or more extraocular muscles to correct the angular deviation. This correction depends on surgical expertise and statistical reference tables, which often fail to yield optimal results for patients with atypical eye morphology. Our work proposes a physics-based modeling approach to personalized surgical planning, accounting for patient-specific eye anatomy. We built a physics-based simulator of the eye and its muscles, incorporating patient-specific geometry and Hill-type muscle biomechanics. We solve an optimization problem to find the surgical dosage that minimizes angular deviation. The model is implemented as a fully differentiable simulation, enabling efficient optimization. We validated the framework by comparing its predictions with standard surgical tables for emmetropic eyes before applying it to anatomically atypical virtual patients. Our model's predictions for emmetropic eyes were first validated, demonstrating a strong fit with standard surgical tables. More importantly, for high-myopia models, the framework computed a clinically significant increase in the required surgical dosage compared to standard eyes. This computed recession difference is highly relevant as surgical plans are adjusted in 0.5 mm increments. Our results show that our model provides a calibrated surgical plan that, unlike standard tables, also accounts for pathologies involving atypical eye shapes. This patient-specific model represents a step toward personalized surgical planning, with the potential to improve dosage accuracy and surgical outcomes for atypical cases.
Medical images acquired using different scanners and protocols can differ substantially in their appearance. This phenomenon, scanner domain shift, can result in a drop in the performance of deep neural networks which are trained on data acquired by one scanner and tested on another. This significant practical issue is well-acknowledged, however, no systematic study of the issue is available across different modalities and diagnostic tasks. In this paper, we present a broad experimental study evaluating the impact of scanner domain shift on convolutional neural network performance for different automated diagnostic tasks. We evaluate this phenomenon in common radiological modalities, including X-ray, CT, and MRI. We find that network performance on data from a different scanner is almost always worse than on same-scanner data, and we quantify the degree of performance drop across different datasets. Notably, we find that this drop is most severe for MRI and X-ray, yet small for CT, on average, which we attribute to the standardized nature of CT acquisition systems which is not present in MRI or X-ray. We also study how injecting varying amounts of target domain data into the training set, as well as adding noise to the training data, insufficiently helps with generalization, showing a need for more powerful domain adaptation methods. Our results provide extensive experimental evidence and quantification of the extent of performance drop caused by scanner domain shift in deep learning across different modalities, with the goal of guiding the future development of robust deep learning models for medical image analysis.
Segmentation neural networks have demonstrated promising results for interventional needle localization on MRI. However, these networks require large training datasets with tedious annotation processes and additional post-processing steps that may introduce variability in the final localization results. This study aimed to develop keypoint detection networks for direct localization of the needle entry point and tip on intra-procedural liver MRI with a more efficient annotation process and more robust performance. 2D and 3D keypoint detection networks were developed by enhancing the stacked hourglass model and incorporating multi-task learning of predicting part affinity fields that connect the keypoints. The proposed networks were evaluated on intra-procedural single-slice controlled-breathing (SS-CB) and multislice controlled-breathing (MS-CB) images acquired from pre-clinical MRI-guided percutaneous liver intervention in thirteen in vivo pig subjects and compared with the results of segmentation-based UNet and Swin Transformer networks and human intra-reader variation. The 2D and 3D keypoint detection networks achieved median needle tip and axis localization errors of 1.56 mm (1 pixel) and 1.1° for the SS-CB datasets, and 2.21 mm (~ 1.5 pixel) and 1.45° for the MS-CB datasets, respectively. Average computational times were 10 ms (2D) and 30 ms (3D). The needle localization accuracy of the keypoint networks was significantly (Wilcoxon signed-rank tests p < 0.001) higher than the UNet and Swin Transformer segmentation-based results and comparable to human intra-reader variation. The proposed keypoint detection networks achieved rapid pixel-level needle localization on single-slice and multislice intra-procedural liver MRI with higher accuracy and a more efficient annotation process compared to segmentation-based models.
Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions. Although limited to monocular 2D gaze prediction relying on manual annotations, our research clearly demonstrates the clinical value of gaze analysis from ceiling-mounted cameras. Future work will explore semantic understanding, multi-view learning, and few-shot approaches to further improve scalability and robustness.