Limited-angle computed tomography (LACT) improves temporal resolution and reduces radiation dose, but suffers from severe artifacts due to missing projections. Clinical workflows record abundant patient- and acquisition-level metadata, yet such information remains underutilized in image reconstruction. To tackle the ill-posed LACT inverse problem, we propose a metadata-guided two-stage diffusion framework that leverages structured clinical contexts as semantic priors for robust reconstruction. In Stage-I, we learn a metadata-to-anatomy generative prior by conditioning a transformer-based diffusion model on clinical metadata (acquisition parameters, patient demographics, and diagnostic impressions), and sampling a coarse anatomical estimate from Gaussian noise. In Stage-II, a second conditional diffusion model performs coarse-to-fine refinement, using the Stage-I estimate as an image prior while re-injecting the same metadata to recover full-resolution anatomy. To preserve anatomical fidelity and suppress hallucinations, projection-domain data consistency is enforced periodically after denoising update via an ADMM-based solver. Experiments on the public multimodal CTRATE dataset demonstrate that the proposed framework outperforms iterative, CNN-based, and diffusion-based baselines, with the greatest gains under severe truncation, e.g., up to 5.23%/11.21% higher SSIM/PSNR than the strongest metadata-free diffusion competitor at 90°. On real clinical cardiac CT, it yields coronary artery calcium scores closer to full-view references, indicating improved clinical utility. Furthermore, the proposed method is generalized to out-of-distribution angular ranges and projection geometries, and ablation results confirm complementary contributions from different metadata types under limited-angle conditions. Our results highlight clinical metadata as actionable semantic priors to synergize with physics-informed constraints to improve both reconstruction fidelity and clinical quantification in LACT.
Metadata-guided cross-modality 3D MRI synthesis aims to generate target-contrast volumes from source-modality data conditioned on clinically available metadata, which is important for enhancing clinical imaging flexibility. However, existing methods still suffer from two main limitations: 1) They neglect spatial dependencies within volumetric representations, yielding structurally ambiguous features that blur anatomical boundaries and hinder precise semantic integration. 2) They rely on conventional cross-attention between visual and textual features, limiting the precision of visual-semantic alignment, which reduces robustness across challenging conditions. To address these issues, we propose RTFSyn, a metadata-guided 3D MRI synthesis framework that achieves effective vision-language collaboration through a refine-then-fusion paradigm. The proposed RTFSyn benefits from several merits. First, we design an axis-aware visual refinement module that captures directional dependencies within volumetric features, enabling redundancy suppression and improved structural representation before fusion. Second, we propose a cross-modal adaptive fusion module that leverages pixel packing-recovery to realize efficient cross-attention for improved alignment, while text-conditioned dynamic convolution enables fine-grained semantic injection, together enhancing vision-language collaboration. Lastly, an implicit neural decoder reconstructs the target modality as a continuous function, enabling flexible high-fidelity synthesis. Under this synergistic paradigm, RTFSyn seamlessly unites robust spatial refinement with adaptive feature fusion to achieve highly precise cross-modal alignment. Extensive experiments across four multi-center datasets demonstrate that RTFSyn not only surpasses state-of-the-art methods quantitatively, but also exhibits robust performance under diverse imaging artifacts, zero-shot evaluations, and multi-dimensional clinical validations, all with favorable computational efficiency. The high fidelity, robustness, and efficiency of RTFSyn demonstrate its great potential for clinical applications.
3-D medical imaging modalities, including CT and MRI, provide high-resolution views essential for precision medicine. However, the increasing volume and complexity of 3-D medical images challenge manual analysis, particularly in classification and segmentation tasks. Although deep learning has shown considerable promise, it struggles to characterize small-scale, low-contrast anatomical structures, generalize across imaging domains, and mitigate annotation scarcity. Self-supervised learning (SSL) has emerged as an effective and annotation-efficient solution, yet existing methods, largely adapted from natural images, often fail to capture the anatomical heterogeneity and complex semantic dependencies inherent in 3-D medical data. To address these limitations, we propose AG-SSD (Anatomy-Guided Self-Supervised Distillation), a framework that explicitly incorporates anatomical priors into SSL. AG-SSD comprises three complementary modules: 1) cross-view anatomical consistency (CVAC), which generates multi-scale, anatomically consistent positive pairs via overlap-aware cropping; 2) edge-aware adaptive masking (EAAM), which prioritizes anatomy-sensitive, high-edge regions to enhance local feature learning and robust global representation; and 3) cross-view attention alignment (CVAA), which leverages attention-based fusion to achieve semantic compensation and alignment across views, mitigating semantic drift to stabilize distillation. These modules are optimized using a unified objective that combines intra-view patch distillation, inter-view [CLS] token distillation, and masked patch reconstruction. Extensive experiments on CT and MRI datasets demonstrate that AG-SSD consistently outperforms state-of-the-art SSL methods in both classification and segmentation under annotation-scarce scenarios, highlighting its potential as a scalable, label-efficient paradigm for 3-D medical image analysis and clinical applications.
Ultrasound molecular imaging (USMI) is an imaging approach that utilizes targeted microbubbles (MBs) to highlight biomarkers of disease. While differential targeted enhancement (DTE) is the current state-of-the-art for USMI, its reliance on destructive pulses hinders real-time clinical application. We have developed a neural network-based nondestructive USMI, validated in vivo using a transgenic mouse model of spontaneous breast cancer. To enhance training efficacy despite a limited animal number (N=14), we utilized several augmentation strategies including the use of several targeted MB types for each animal to generate independent image and texture patterns, alternative DTE approaches (sham and injection DTE), and random patch selection, overall resulting in a total of 15,350 patches to train the network. The resulting nondestructive USMI produces an image of the pixelwise MB classification score of the presence of targeted MBs. Our nondestructive USMI achieved a correlation coefficient of 0.954 with DTE, a continuous dice coefficient of 0.863 for a molecular signal coverage of the lesion over 20%, and a higher AUC than DTE ( 0.954 vs. 0.845 ) compared to the reference image developed from the contrast enhanced ultrasound (CEUS) image and manual lesion contour. Nondestructive imaging during continuous motion of the transducer under elevation sweeps yielded fewer artifacts and higher AUC than DTE ( 0.953 vs. 0.892 ), compared to the reference image. This demonstrates the potential of free-hand and real-time nonde-structive imaging. Overall, nondestructive imaging showed comparable performance to DTE under stationary conditions and superior performance to DTE under transducer motion, indicating its clinical imaging potential.
Assessing the quality of automatic image segmentation is crucial in clinical practice, but often very challenging due to the limited availability of ground truth annotations. Reverse Classification Accuracy (RCA) is an approach that estimates the quality of new predictions on unseen samples by training a segmenter on those predictions, and then evaluating it against existing annotated images. In this work we introduce ConfIC-RCA (Conformal In-Context RCA), a novel method for automatically estimating segmentation quality with statistical guarantees in the absence of ground-truth annotations, which consists of two main innovations. First, In-Context RCA, which leverages recent in-context learning models for image segmentation and incorporates retrieval-augmentation techniques to select the most relevant reference images. This approach enables efficient quality estimation with minimal reference data while avoiding the need of training additional models. Second, we introduce Conformal RCA, which extends both the original RCA framework and In-Context RCA to go beyond point estimation. Using tools from split conformal prediction, Conformal RCA produces prediction intervals for segmentation quality providing statistical guarantees that the true score lies within the estimated interval with a user-specified probability. Validated across 10 different medical imaging tasks in various organs and modalities, our methods demonstrate robust performance and computational efficiency, offering a promising solution for automated quality control in clinical workflows, where fast and reliable segmentation assessment is essential. The code is available at https://github.com/mcosarinsky/Conformal-In-Context-RCA.
Fully supervised polyp segmentation relies on costly pixel-level annotations. Although semi- and weakly supervised methods reduce annotation requirements, they still depend on partial mask supervision. Text-supervised segmentation is a promising alternative; however, for polyps, the key challenge is to ground instance-specific phrases to the correct lesion region under cluttered backgrounds and large appearance variations. Existing approaches often rely on coarse text-image alignment, limiting precise region-level semantic correspondence. In this paper, we propose Text-Image Co-Alignment (TICoA), a text-supervised framework for polyp segmentation. TICoA leverages large language models (LLMs)-generated structured clinical descriptions as weak supervision and formulates segmentation as a fine-grained phrase-region coalignment problem. Through contrastive learning, TICoA explicitly associates query phrases with corresponding image regions to achieve robust semantic grounding under weak supervision. Architecturally, we adopt a State-Space Model (Mamba) to efficiently model long-range dependencies with linear computational complexity. To support effective cross-modal interaction, we further design a dedicated Mamba Fusion module with a Bi-Dimension Fusion (BiDF) strategy, which progressively propagates information along spatial and channel dimensions. Experiments on polyp datasets, with additional validation on skin lesion segmentation, demonstrate that TICoA is competitive with state-of-the-art weakly supervised methods. Our code and data are available at https://github.com/silentyuchen/TICoA.
Conventional registration approaches frequently underperform when applied to sparse feature alignment (e.g., retinal vessels and filamentous collagen fibers in second-harmonic generation (SHG) and bright-field (BF) images), as these tasks demand simultaneous handling of global affine registration and local deformation correction. End-to-end learning-based approaches struggle with minimal effective gradients from loss back-propagation of these sparse features, while descriptor matching methods, though helpful, lack fidelity loss and fail to adapt to local deformation. To address these issues, we propose Neural Affine Optimization (NeOn), which implicitly approximates discrete optimization using a few neural network layers, combined with a sampling-regression layer to handle affine transformations. NeOn allows iterative refinement with fidelity loss and provides a flexible transition between a purely affine configuration and a linear weighted blend of affine and deformation fields. NeOn's performance was validated on four public datasets. In multi-modal SHG-BF microscopy registration, NeOn achieved top rankings on the validation leaderboard for Task 3 of the Learn2Reg Challenge 2024. For retinal image registration, NeOn outperformed existing methods on both mono-modal and multi-modal datasets, reducing target registration error from 6.3 to 2.1 pixels in mono-modal and from 2.6 to 1.8 pixels in multi-modal registration. Furthermore, NeOn demonstrates strong generalization and can be effectively extended to 3D multi-modality image registration scenarios.
To be adopted in safety-critical domains like medical image analysis, AI systems must provide human-interpretable decisions. Variational Information Pursuit (VIP) offers an interpretable-by-design framework by sequentially querying input images for human-understandable concepts, using their presence or absence to make predictions. However, existing V-IP methods overlook sample-specific uncertainty in concept predictions, which can arise from ambiguous features or model limitations, leading to suboptimal query selection and reduced robustness. In this paper, we propose an interpretable and uncertainty-aware framework for medical imaging that addresses these limitations by accounting for upstream uncertainties in concept-based, interpretable-by-design models. Specifically, we introduce two uncertainty-aware models, EUAV-IP and IUA-VIP, that integrate uncertainty estimates into the V-IP querying process to prioritize more reliable concepts per sample. EUAV-IP skips uncertain concepts via masking, while IUAV-IP incorporates uncertainty into query selection implicitly for more informed and clinically aligned decisions. Our approach allows models to make reliable decisions based on a subset of concepts tailored to each individual sample, without human intervention, while maintaining overall interpretability. We evaluate our methods on five medical imaging datasets across four modalities: dermoscopy, X-ray, ultrasound, and blood cell imaging. The proposed IUAV-IP model achieves state-of-the-art accuracy among interpretable-by-design approaches on four of the five datasets, and generates more concise explanations by selecting fewer yet more informative concepts. These advances enable more reliable and clinically meaningful outcomes, enhancing model trustworthiness and supporting safer AI deployment in healthcare. Our code and models are available at: https://github.com/Nahiduzzaman09/ UAV-IP.
Glaucoma is a leading cause of irreversible blindness worldwide, with asymptomatic early stages often delaying diagnosis and treatment. Early and accurate diagnosis requires integrating complementary information from multiple ocular imaging modalities. However, most existing studies rely on single- or dual-modality imaging, such as fundus and optical coherence tomography (OCT), for coarse binary classification, thereby restricting the exploitation of complementary information and hindering both early diagnosis and stage-specific treatment. To address these limitations, we propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder. The attention module, named multimodal-channel graph attention (MCGA), boosts glaucoma classification performance by emulating two key clinical reasoning steps: first, it uses a multi-head modality gating mechanism to replicate ophthalmologists' confidence scoring of fundus, OCT, and VF modalities; then, MCGA leverages a relational graph attention network to cross-examine structural-functional consistencies of weighted modalities. The experiments on GLEAM demonstrate that tri-modal fusion significantly outperforms single-modal and dual-modal configurations. Moreover, our proposed HAMM achieves superior performance compared with state-of-the-art multimodal learning methods. The dataset and code are publicly available via https://github.com/microewing/HAMM.
Sparse-View CT (SVCT) reconstruction improves temporal resolution and reduces radiation dose, yet its clinical use is hindered by artifacts due to view reduction and domain shifts from scanner, protocol, or anatomical variations, leading to performance degradation in out-of-distribution (OOD) scenarios. We propose a Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction (CDPIR) framework to tackle the OOD problem in SVCT. CDPIR integrates cross-distribution diffusion priors, derived from a Scalable Interpolant Transformer (SiT), with model-based iterative reconstruction methods. Specifically, we train a SiT backbone, an extension of the Diffusion Transformer (DiT) architecture, to establish a unified stochastic interpolant framework, leveraging Classifier-Free Guidance (CFG) across multiple datasets. By randomly dropping the conditioning with a null embedding during training, the model learns a more transferable cross-distribution prior that encourages domain-invariant anatomical structures while allowing domain-specific appearance modulation. During sampling, the globally sensitive transformer-based diffusion model exploits the cross-distribution prior within the unified stochastic interpolant framework, enabling flexible and stable control over multi-distribution-to-noise interpolation paths and decoupled sampling strategies, thereby improving adaptation to OOD reconstruction. By alternating between data fidelity and sampling updates, our model achieves state-of-the-art performance with superior detail preservation in SVCT reconstructions. Extensive experimental results demonstrate that CDPIR significantly outperforms existing approaches, particularly under OOD conditions, highlighting its robustness and potential clinical value in challenging imaging scenarios. The code is available at https://github.com/ Graeme-Lee/CDPIR.
Coronary computed tomography angiography (CCTA) is a pivotal non-invasive imaging modality for diagnosing cardiac disease. However, due to the temporal resolution limitations, cardiac structures, specifically coronary arteries, may suffer from motion artifacts when CCTA is applied to patients with arrhythmias or high heart rates. Limited-angle CT (LA-CT) emerges as a promising alternative by significantly reducing the acquisition time, thereby mitigating the motion artifacts. Yet, LA-CT unavoidably leads to severe wedge artifacts, posing a significant challenge. Therefore, to reconstruct motion-free cardiac CT images while suppressing the wedge artifacts, we propose a Segmentation-Guided Accelerating Diffusion Model (SGADM) tailored for LA-CT imaging. While diffusion models have demonstrated exceptional performance in medical imaging, their extensive sampling procedures impose high computational costs, hindering clinical applicability. To address this issue, SGADM employs an innovative diffusion model that directly generates high-quality CT images. Moreover, SGADM adopts the diffusion perceptual loss to ensure data distribution consistency between two successive sampling steps. As a result, SGADM can provide satisfactory results in fewer than 10 steps. Additionally, SGADM incorporates segmentation guidance to enhance the spatial-positional accuracy of generated coronary arteries explicitly. Both quantitative and qualitative evaluations on simulated and real datasets reveal that the SGADM effectively restores high-quality CCTA images with minimal motion artifacts, highlighting its potential for clinical applications.
Deep learning (DL) methods can reconstruct highly accelerated magnetic resonance imaging (MRI) scans, but they rely on application-specific large training datasets and often generalize poorly to out-of-distribution data. Self-supervised deep learning algorithms perform scan-specific reconstructions, but still require complicated hyperparameter tuning based on the acquisition and often offer limited acceleration. This work develops a bilevel-optimized implicit neural representation (INR) approach for scan-specific MRI reconstruction. The method explicitly formulates the undersampled MRI reconstruction problem as a bilevel optimization problem and automatically optimizes the multidimensional hyperparameters of the reconstruction method for a given acquisition protocol, enabling a tailored reconstruction without training data. The proposed algorithm uses Gaussian process regression to optimize INR hyperparameters, accommodating various acquisitions. The INR includes a trainable positional encoder for high-dimensional feature embedding and a small multilayer perceptron for decoding. The bilevel optimization is computationally efficient, requiring only a few minutes per typical 2D Cartesian scan. On the scanner hardware, the subsequent scan-specific reconstruction-using offline-optimized hyperparameters-is completed within seconds, while achieving comparable or improved image quality compared to previous model-based and self-supervised learning methods.
Contrast-enhanced CT (CECT) is essential for clinical evaluation of vessel structures and function. However, high contrast agent dose increases the risk of renal injury. Reducing contrast agent dose decreases the contrast between vessels and surrounding tissues, which complicates diagnosis. Despite their potential in CECT synthesis, existing methods often suffer from edge unclarity, contrast anomalies, and texture distortion, limiting their clinical applicability. This paper proposes a novel Multi-granularity Adversarial Generation Integrated Consistency Representation (MAGIC) for high-quality synthesis from low-contrast-enhanced CT to clinical usable CECT. MAGIC addresses current problems through four innovations: 1) Multi-Granularity Refined Booster (MRB) introduces contextual refinement and cross granularity boosting mechanisms for mining the multi-granularity contextual information to enhance feature representation capability, thus improving tissue edge clarity. 2) Supervised Contrast Enhancement Module (SCEM) imbues MAGIC with the ability to enhance tissue contrast, which leverages supervised images to adaptively adjust the contrast information of soft tissue structures and vessels, effectively overcoming the challenge of contrast anomalies. 3) Hierarchical Harmonized Consistency Representation (HHCR) utilizes domain consistency to construct a novel auxiliary loss for harmonizing the semantic and content relationships of multi-level hierarchical features to improve tissue texture performance, ensuring accurate restoration of real textures. 4) Dual-path Dynamic Collaborative Discriminator (DDCD) is designed with complementary strategies and injects content priors to dynamically collaborate the discrimination process, thereby comprehensively evaluating the fidelity of the synthesized results. Qualitative and quantitative results demonstrate that MAGIC significantly outperforms existing methods in edge clarity, image contrast, and texture restoration, underscoring its substantial clinical potential.
Atrial fibrillation, characterized by high prevalence and poor prognosis, presents a significant global health burden. Accurate segmentation and measurement of left ventricular and left atrial appendage morphology and function are essential for reliable risk assessment. However, these tasks are hindered by ambiguous bound-aries, complex cardiac motion, and sparse annotations. To address these challenges, we propose a Keypoint-Guided Medical Video Segmentation Model with Spatiotemporal Feature Fusion (KG-STS). First, we propose a shape-constrained point encoder that explicitly encodes boundary points to improve the representation of ambiguous boundaries. Next, we introduce a motion-aware alignment module that models cardiac motion by forming coherent motion information across frames. Building on these two modules, we develop a keypoint-guided spatiotemporal feature fusion module that integrates spatial boundary representations with temporal motion cues to enhance decoding features under sparse annotations, enabling temporally consistent segmentation and supporting morphological measurement. We evaluate the segmentation and measurement performance of our method on a self-constructed multi-view transesophageal echocardiography dataset and two publicly available transthoracic echocar-diography datasets. The results demonstrate that KG-STS achieves superior temporal consistency in segmentation and higher accuracy in morphological measurements compared to competing methods.
Aggregating features of tens of thousands of patches into Whole Slide Images (WSIs) representations via aggregators is a crucial step in computational pathology. However, existing aggregation strategies overlook the morphological variability of tissue regions in WSIs stemming from differences in clinical procedures and tumor characteristics, leading to two critical limitations: 1) attention collapse in long sequences caused by significant variation in patch numbers across WSIs (ranging from thousands to tens of thousands per WSI); 2) attention misallocation due to under-trained positional embeddings resulting from the non-uniform spatial coordinates introduced by irregular patch distributions. Consequently, current attention-based methods struggle to generalize across this morphological variability, resulting in inconsistent aggregation performance and compromised model reliability in clinical settings. To address these issues, we propose a Entropy-Stabilized Attention-based Multiple Instance Learning (StableMIL) framework, which incorporates an entropy-stabilized attention mechanism to ensure consistent aggregation across WSIs with varying patch numbers and a Randomly Projected 2D rotary position embedding to enhance spatial representation robustness across irregular patch distributions. Extensive theoretical and experimental analyses on nine WSI datasets spanning diverse cancer types, across both classification and survival prediction tasks, demonstrate that StableMIL effectively overcomes the challenges of handling long instance sequences and out-of-distribution spatial coordinates. Our framework consistently outperforms representative baselines, particularly in survival prediction, with stable improvements observed across all evaluated cancer types and morphological scenarios, highlighting its potential for real-world clinical applications. Our source code is available at https://github.com/theeeqi/stableMIL.
Spatial Transcriptomics (ST) technology detects gene expression from tissue biopsies, playing an emerging role in cancer diagnosis and precision medicine. However, the high cost of ST technology limits its broader application. Recently, deep learning approaches have provided insight into predicting gene expression based on H&E-stained histopathology images. Nevertheless, the relationship between morphological features and gene expression is highly complex. To address these challenges, we propose DiffBulk, a novel two-stage framework that leverages conditional diffusion models to learn expressive image representations enriched with gene expression information. In the first stage, we introduce a gene-to-image conditional diffusion model equipped with a permutationinvariant open-embedding gene encoder, which enables unified training across diverse gene panels. In the second stage, diffusion-derived features are fused with representations from a pathology foundation model, effectively bridging the domain gap and improving downstream gene expression prediction. We evaluate DiffBulk on high-quality Xenium ST data curated from the HEST dataset and the CrunchDAO challenge, constructing tile-level pseudo-bulk datasets for training and evaluation. Extensive experiments demonstrate that DiffBulk consistently outperforms state-of-the-art baselines across all metrics for gene expression prediction. These findings highlight the potential of diffusion-based gene-image representation learning and suggest promising directions for future research.
Longitudinal brain analysis is essential for understanding healthy aging and identifying pathological deviations. Longitudinal registration of sequential brain MRI underpins such analyses. However, existing methods are limited by reliance on densely sampled time series, a trade-off between accuracy and temporal smoothness, and an inability to prospectively forecast future brain states. To overcome these challenges, we introduce TimeFlow, a learning-based framework for longitudinal brain MRI registration. TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Given only two scans from an individual, TimeFlow estimates accurate and temporally coherent deformation fields, enabling non-linear extrapolation to predict future brain states. This is achieved by our proposed inter-/extrapolation consistency constraints applied to both the deformation fields and deformed images. Remarkably, these constraints preserve temporal consistency and continuity without requiring explicit smoothness regularizers or densely sampled sequential data. Extensive experiments demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both future timepoint forecasting and registration accuracy. Moreover, TimeFlow supports novel biological brain aging analyses by differentiating neurode-generative trajectories from normal aging without requiring segmentation, thereby eliminating the need for labor-intensive annotations and mitigating segmentation inconsistency. TimeFlow offers an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond the observed study period.
Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based frame-work that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff also produces customized contents tailored for diverse tasks, e.g., colitis, polyps, and adenomas for diagnosis. Incorporating synthetic videos into training promotes discriminative representation learning and improves diagnosis accuracy by 7.1%. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.
Whole Slide Images (WSIs) have been widely used in computational pathology (CPath) for various tasks. However, obtaining high-quality annotations remains a major bottleneck. Task-aware unsupervised anomaly detection models offer a promising alternative, as they are trained solely on task-specific normal data and can be adapted to clinically defined objectives, such as cancer detection, depending on the problem formulation. Despite this potential, anomaly detection models have not been thoroughly explored in the context of WSIs. Existing approaches often directly adopt techniques from other domains, leading to suboptimal performance due to domain discrepancies and the unique characteristics of WSIs. Given that feature reconstruction-based methods have become popular in anomaly detection research, this study first analyzes the designs of such models in the context of conditional reconstruction, revealing the potential directions to adapt and further improve these models. Based on our analysis, we revisit and refine them to better accommodate the distinct properties of WSIs. Moreover, we propose an Explicit Conditional Reconstruction framework, termed as ECR4AD, which can significantly enhance model performance. Our method is comprehensively evaluated on four datasets covering breast and prostate cancer metastasis detection, as well as Gleason grading of prostate cancer, all conducted at the tile level on images extracted from WSIs. The experimental results show that ECR4AD consistently achieves substantial improvements in AUROC across all datasets, demonstrating its effectiveness for tile-level task-aware unsupervised anomaly detection in CPath. The code can be found at https://github.com/uobinxiao/wsi_anomaly_detection.git.
Universal medical image registration through a single model handling various registration tasks has attracted increasing interest. However, existing deep learning-based methods face two major challenges in adapting to universal registration tasks: 1) they lack generalizable feature representation capabilities for cross-task registration; 2) they rely solely on model architectures with fixed parameters, which limits their flexibility to dynamically adapt to different registration tasks and inherently compromises their generalization capability for zero-shot performance on unseen tasks. To address these limitations, we propose CIM-VTP, a novel two-stage universal registration framework. In the first stage, our proposed Correlation-guided Image Modeling (CIM)-based pretraining strategy leverages cross-image correlation to guide the masked modeling process, which facilitates spatial correspondence capturing that is essential for registration and provides universal representation capabilities as a foundation for registration learning. In the second stage, we introduce a registration task classifier to identify the type of a given input task, which explicitly quantifies the similarity between current inputs and previously seen tasks. The obtained task similarity scores are then fed as prior information into our carefully designed multi-resolution Visual-Textual Task Prompt (VTP) modules, which integrate task-relevant knowledge through prompt learning to adaptively adjust decoder parameters for different input domains. Extensive experiments across six different registration tasks demonstrate that the proposed CIM-VTP exhibits superior universal image registration performance. The code will be released at https://github.com/xiehousheng/CIM-VTP.