Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models le
Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-t
Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical w
Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm's movement based on the calculated
MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1.
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results s
This paper presents an innovative approach to intraoperative Optical Coherence Tomography (iOCT) image segmentation in ophthalmic surgery, leveraging statistical analysis of speckle patterns to incorporate statistical pathology-specific prior knowledge. Our findings indicate statistically different speckle patterns within the retina and between retinal layers and surgical tools, facilitating the segmentation of previously unseen data without the necessity for manual labeling. The research involves fitting various statistical distributions to iOCT data, enabling the differentiation of different ocular structures and surgical tools. The proposed segmentation model aims to refine the statistical findings based on prior tissue understanding to leverage statistical and biological knowledge. Incorporating statistical parameters, physical analysis of light-tissue interaction, and deep learning informed by biological structures enhance segmentation accuracy, offering potential benefits to real-time applications in ophthalmic surgical procedures. The study demonstrates the adaptability and precision of using Gamma distribution parameters and the derived binary maps as sole inputs for segmen
Cataract surgery remains one of the most widely performed and effective procedures for vision restoration. Effective surgical planning requires integrating diverse clinical examinations for patient assessment, intraocular lens (IOL) selection, and risk evaluation. Large language models (LLMs) have shown promise in supporting clinical decision-making. However, existing LLMs often lack the domain-specific expertise to interpret heterogeneous ophthalmic data and provide actionable surgical plans. To enhance the model's ability to interpret heterogeneous ophthalmic reports, we propose a knowledge-driven Multi-Agent System (MAS), where each agent simulates the reasoning process of specialist ophthalmologists, converting raw clinical inputs into structured, actionable summaries in both training and deployment stages. Building on MAS, we introduce CataractSurg-80K, the first large-scale benchmark for cataract surgery planning that incorporates structured clinical reasoning. Each case is annotated with diagnostic questions, expert reasoning chains, and structured surgical recommendations. We further introduce Qwen-CSP, a domain-specialized model built on Qwen-4B, fine-tuned through a multi
This review article discusses current technological advances in biomedical devices,emphasizing cardiovascular and ophthalmic application diagnostic,monitoring, and prosthetic instruments and systems. The scope encompasses various aspects, including implantable retinal prosthetic devices, portable device for carotid stiffness measurement, automatic identification algorithms for arteries, cuffless evaluation of carotid pulse pressure, wearable neural recording systems, and arterial compliance probes. Additionally, the paper explores advancements in pulse wave velocity measurement, real time heart rate estimation from wrist type signals, and the clinical significance of non invasive pulse wave velocity measurement in assessing arterial stiffness. The synthesis of these studies provides insights into the evolving landscape of biomedical devices, their validation, reproducibility, and potential clinical implications, emphasizing their role in enhancing diagnostics and therapeutic interventions in cardiovascular and ophthalmic domains.
Foundation models (FMs) have shown great promise in medical image analysis by improving generalization across diverse downstream tasks. In ophthalmology, several FMs have recently emerged, but there is still no clear answer to fundamental questions: Which FM performs the best? Are they equally good across different tasks? What if we combine all FMs together? To our knowledge, this is the first study to systematically evaluate both single and fused ophthalmic FMs. To address these questions, we propose FusionFM, a comprehensive evaluation suite, along with two fusion approaches to integrate different ophthalmic FMs. Our framework covers both ophthalmic disease detection (glaucoma, diabetic retinopathy, and age-related macular degeneration) and systemic disease prediction (diabetes and hypertension) based on retinal imaging. We benchmarked four state-of-the-art FMs (RETFound, VisionFM, RetiZero, and DINORET) using standardized datasets from multiple countries and evaluated their performance using AUC and F1 metrics. Our results show that DINORET and RetiZero achieve superior performance in both ophthalmic and systemic disease tasks, with RetiZero exhibiting stronger generalization on
Robot-assisted surgical systems have demonstrated significant potential in enhancing surgical precision and minimizing human errors. However, existing systems cannot accommodate individual surgeons' unique preferences and requirements. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) and are unsuitable for highly precise microsurgeries, such as ophthalmic procedures. Thus, we propose an image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level and preferred surgical techniques during ophthalmic cataract surgery. Our approach trains reinforcement and imitation learning agents simultaneously using curriculum learning approaches guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions and preferences into the training process, our approach enables the robot to implicitly learn and adapt to the individual surgeon's unique techniques through surgeon-in-the-loop demonstrations. This results in a more intuitive and personalized surgical experience for the surgeon while ensuring consistent performance for the autonomous robotic apprentice. We
This paper introduces a new Finite Element biomechanical model of the human face, which has been developed to be integrated into a simulator for plastic and maxillo-facial surgery. The idea is to be able to predict, from an aesthetic and functional point of view, the deformations of a patient face, resulting from repositioning of the maxillary and mandibular bone structures. This work will complete the simulator for bone-repositioning diagnosis that has been developed by the laboratory. After a description of our research project context, each step of the modeling is precisely described: the continuous and elastic structure of the skin tissues, the orthotropic muscular fibers and their insertions points, and the functional model of force generation. First results of face deformations due to muscles activations are presented. They are qualitatively compared to the functional studies provided by the literature on face muscles roles and actions.
Temporary plastic film barriers are widely used to separate occupied rooms from exterior renovation zones, yet their effect on indoor particulate exposure is poorly quantified. We monitored PM$_{2.5}$ in a Tampa, Florida, apartment for 48 days with a low-cost optical sensor (Temtop LKC-1000S+), spanning pre-barrier, barrier-on, and post-barrier periods. A quadratic baseline was fitted to "background" minutes devoid of identifiable indoor sources, allowing excess concentrations ($Δ$PM) to be partitioned into facade work, cooking, and passive accumulation without outdoor co-monitoring. The barrier prevented large construction spikes indoors but curtailed natural ventilation, doubling the mean baseline from 1.9 to 3.9 $μ$g m$^{-3}$. During this stage, passive build-up accounted for $45\,\%$ of the daily excess dose, with facade work and cooking contributing $31\,\%$ and $24\,\%$, respectively. Once the new window was installed and evening airing resumed, the baseline fell to 0.8 $μ$g m$^{-3}$, the lowest of the campaign. Our findings highlight the trade-off between dust shielding and background elevation and demonstrate that simple polynomial fitting bolsters low-cost IAQ diagnostics
Although modern face verification systems are accessible and accurate, they are not always robust to pose variance and occlusions. Moreover, accurate models require a large amount of data to train. We structure our experiments to operate on small amounts of data obtained from an NGO that funds ophthalmic surgeries. We set up our face verification task as that of verifying pre-operation and post-operation images of a patient that undergoes ophthalmic surgery, and as such the post-operation images have occlusions like an eye patch. In this paper, we present a system that performs the face verification task using one-shot learning. To this end, our paper uses deep convolutional networks and compares different model architectures and loss functions. Our best model achieves 85% test accuracy. During inference time, we also attempt to detect image forgeries in addition to performing face verification. To achieve this, we use Error Level Analysis. Finally, we propose an inference pipeline that demonstrates how these techniques can be used to implement an automated face verification and forgery detection system.
Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-m
Although all members of the ophthalmic community agree that distortion is an aberration affecting the geometry of an image produced by the periphery of an ophthalmic lens, there are several approaches for analyzing and quantifying this aberration. Various concepts have been introduced: ordinary distortion, stationary distortion and central static distortion are associated with a fixed eye behind the ophthalmic lens, whereas rotatory distortion, peripheral distortion, lateral static distortion, and dynamic distortion require a secondary position of gaze behind the lens. Furthermore, concept definitions vary from one author to another. The goal of this paper is to review the various concepts, analyze their effects on lens design and determine their ability to predict the deformation of an image as perceived by the lens wearer. These entities can be classified within 3 categories: the concepts associated with an ocular rotation, the concepts resulting from an optical approach, and the concepts using a perceptual approach. Among the various concepts reviewed, it appears that the Le Grand-Fry approach for analyzing and displaying distortion is preferable to others and allows modeling of
Deep neural networks power most recent successes of artificial intelligence, spanning from self-driving cars to computer aided diagnosis in radiology and pathology. The high-stake data intensive process of surgery could highly benefit from such computational methods. However, surgeons and computer scientists should partner to develop and assess deep learning applications of value to patients and healthcare systems. This chapter and the accompanying hands-on material were designed for surgeons willing to understand the intuitions behind neural networks, become familiar with deep learning concepts and tasks, grasp what implementing a deep learning model in surgery means, and finally appreciate the specific challenges and limitations of deep neural networks in surgery. For the associated hands-on material, please see https://github.com/CAMMA-public/ai4surgery.
Capsular contracture is a pathological response to implant-based reconstructive breast surgery, where the ``capsule'' (tissue surrounding an implant) painfully thickens, contracts and deforms. It is known to affect breast-cancer survivors at higher rates than healthy women opting for cosmetic cosmetic breast augmentation with implants. We model the early stages of capsular contracture based on stress-dependent recruitment of contractile and mechanosensitive cells to the implant site. We derive a one-dimensional continuum spatial model for the spatio-temporal evolution of cells and collagen densities away from the implant surface. Various mechanistic assumptions are investigated for linear versus saturating mechanical cell responses and cell traction forces. Our results point to specific risk factors for capsular contracture, and indicate how physiological parameters, as well as initial states (such as inflammation after surgery) contribute to patient susceptibility.
Background Analyzing kinematic and video data can help identify potentially erroneous motions that lead to sub-optimal surgeon performance and safety-critical events in robot-assisted surgery. Methods We develop a rubric for identifying task and gesture-specific Executional and Procedural errors and evaluate dry-lab demonstrations of Suturing and Needle Passing tasks from the JIGSAWS dataset. We characterize erroneous parts of demonstrations by labeling video data, and use distribution similarity analysis and trajectory averaging on kinematic data to identify parameters that distinguish erroneous gestures. Results Executional error frequency varies by task and gesture, and correlates with skill level. Some predominant error modes in each gesture are distinguishable by analyzing error-specific kinematic parameters. Procedural errors could lead to lower performance scores and increased demonstration times but also depend on surgical style. Conclusions This study provides insights into context-dependent errors that can be used to design automated error detection mechanisms and improve training and skill assessment.
Plastic surgery and disguise variations are two of the most challenging co-variates of face recognition. The state-of-art deep learning models are not sufficiently successful due to the availability of limited training samples. In this paper, a novel framework is proposed which transfers fundamental visual features learnt from a generic image dataset to supplement a supervised face recognition model. The proposed algorithm combines off-the-shelf supervised classifier and a generic, task independent network which encodes information related to basic visual cues such as color, shape, and texture. Experiments are performed on IIITD plastic surgery face dataset and Disguised Faces in the Wild (DFW) dataset. Results showcase that the proposed algorithm achieves state of the art results on both the datasets. Specifically on the DFW database, the proposed algorithm yields over 87% verification accuracy at 1% false accept rate which is 53.8% better than baseline results computed using VGGFace.