共找到 20 条结果
Stable Diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. The researchers have explored employing Stable Diffusion for training-free segmentation. Most existing approaches refine cross-attention map by self-attention map once, demonstrating that self-attention map contains useful semantic information to improve segmentation. To fully utilize self-attention map, we present a deep experimental analysis on iteratively refining cross-attention map with self-attention map, and propose an effective iterative refinement framework for training-free segmentation, named iSeg. Our iSeg introduces an entropy-reduced self-attention module that utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves cross-attention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks (weakly-supervised semantic segmentation, open-vocabulary semantic segmentation, unsupervised segmentation, and mask generation on synthetic dataset) reveal the merits of proposed contributions, leading to promising performance. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of $3.8\%$ in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kinds of images and interactions, and also be used as a post-processing, or in different frameworks, to improve training-free segmentation. The project is available at https://linsun449.github.io/iSeg.
Pre-trained Vision-Language Models (VLMs) have demonstrated strong zero-shot generalization capabilities. Despite their effectiveness on various downstream tasks, they remain vulnerable to adversarial samples. Existing methods fine-tune VLMs to improve their robust performance by performing adversarial training on a certain dataset. However, this can lead to model overfitting and is not a true zero-shot scenario. In this paper, we propose a truly zero-shot and training-free approach that can improve the zero-shot adversarial robustness of VLMs on the evaluated benchmarks. Specifically, we first discover that simply adding Gaussian noise can enhance the VLM's zero-shot robustness. Then, we treat the adversarial examples with added Gaussian noise as anchors and strive to find a path in the embedding space that leads from the adversarial examples to the cleaner samples. Furthermore, to avoid the overfitting issue caused by fixed hyperparameters, we propose an adaptive parameter adjustment method based on the distance between the anchors and adversarial samples in the embedding space. We largely preserve the original VLMs' zero-shot generalization abilities in a truly zero-shot and training-free manner on the evaluated benchmarks compared to previous methods. Extensive experiments on 16 datasets demonstrate that our method can achieve stronger zero-shot robust performance, improving the top-1 robust accuracy by an average of 10.83%.
Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates , and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments . These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without any task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.
3D object detection from vision inputs powers autonomous driving and embodied AI but remains compute- and energy-intensive at inference. While neuromorphic (spike-driven) computation promises event-driven sparsity and efficiency, prior training-free conversions have largely focused on 2D tasks. They are not directly applicable to modern 3D detectors due to normalization-dependent inconsistencies. We present VISTA-3D, a training-free unfolding method for vision-based 3D object detection. The approach replaces all LayerNorm blocks with a calibrated Exponential Normalization (ExpNorm) and emits incremental temporal updates whose sum matches the one-shot ANN output, producing a temporally unfolded representation suitable for spike-style execution. On the KITTI benchmark, VISTA-3D preserves the original detector's accuracy, achieving the same 3D Average Precision under the standard R40 evaluation protocol for both the monocular and depth-augmented variants. Experiments on nuScenes further show that the unfolding mechanism generalizes to larger transformer-based detectors, maintaining competitive accuracy while enabling sparse spike-style computation. VISTA-3D reduces analytic latency and achieves a normalized SOP-based energy proxy of approximately 0.17, without introducing new parameters. A controlled ablation confirms that directly enabling test-time spiking without replacing LayerNorm severely degrades performance, whereas our calibrated unfolding preserves accuracy robustly. VISTA-3D provides a principled, plug-and-play route toward neuromorphic-ready 3D perception, offering a stable temporal representation that may serve as a basis for future fully spiking 3D detectors.
Medical morning glory syndrome (MGS) is a rare congenital disease. Approximately 50% of MGS patients present with retinal detachment. Widespread screening for MGS significantly aids in early detection, but it places a considerable burden on healthcare professionals. Recently, AI-assisted diagnostic methods have made significant strides and achieved satisfactory accuracy. However, current AI-assisted methods heavily rely on large datasets to promote feature learning. The unavailable MGS data presents a challenge in optimizing the model parameters. To address this limitation, we propose a training-free method named TF-VSF, leveraging the prior knowledge from foundation models and MGS-specific pathological structures to generate low-dimensional, refined feature representations for the diagnostic grading task. Specifically, the channel-based visual recalibration (CVR) module introduces the pretrained prior knowledge from SAM, generating a coarse segmentation mask, which is then refined by a pyramid calibration module to filter the high-dimensional semantic structures in a no-parameter manner. Then, the semantic-based location perception (SLP) module utilizes the pretrained contrastive language-image pretraining (CLIP) prior knowledge to generate the semantic implicit feature presentation with the edge energy control, which is then fused with the refined features in the CVR module. Finally, the grading results are achieved through independent component analysis (ICA) feature reduction and density constraint clustering. We developed a dataset of 1016 MGS fundus images. Compared with the self-supervised and fully trained methods, TF-VSF achieves 95.87% in accuracy and 93.50% in F1-score, surpassing comparable methods in general image domains, and the medical image domain of self-supervised methods, fully trained methods, and training-free methods. TF-VSF represents a novel framework that bridges the gap in AI-assisted diagnostic technology for rare diseases.
Training-free conditional generation based on flow matching aims to leverage pre-trained unconditional flow matching models to perform conditional generation without retraining. Recently, a successful training-free conditional generation approach incorporates conditions via posterior sampling, which relies on the availability of a score function in the unconditional diffusion model. However, flow matching models lack an explicit score function, rendering this strategy inapplicable. Approximate posterior sampling for flow matching has been explored, but it is limited to linear inverse problems. In this paper, we propose Flow Matching-based Posterior Sampling (FMPS) to broaden its scope of application. We introduce a correction term by steering the velocity field. This correction term can be reformulated to incorporate a surrogate score function, thereby bridging the gap between flow matching models and score-based posterior sampling. Hence, FMPS enables posterior sampling to be adjusted within the flow-matching framework. Furthermore, we propose two practical implementations of the correction mechanism: one to improve generation quality and the other to enhance computational efficiency. Experimental results on diverse conditional generation tasks demonstrate that our method achieves superior generation quality compared to existing state-of-the-art approaches, validating the effectiveness and generality of FMPS.
Despite large models drive unprecedented growth in data and model parameters, many real-world problems prioritize interpretability and generality, and lack sufficient training data. For instance, in Compressed Sensing (CS) where sparse reconstruction solves underdetermined systems, traditional iterative methods remain the practical choice due to their interpretability and out-of-the-box applicability to arbitrary conditions, but suffer from poor quality and inefficiency at low sampling rates. To address this, we propose Coefficients Learning (CL), a novel training-free framework for sparse reconstruction. CL employs ultra-small neural models with only $n$ trainable parameters for a length-$n$ signal. It retains the interpretability and generality of traditional iterative methods by adopting their residual-based solving process, while enhancing efficiency and accuracy by replacing closed-form solutions with automatic differentiation and embedding prior knowledge into the model losses. We evaluate CL extensively on synthetic and real one-dimensional and two-dimensional signals. A detailed analysis is first conducted using an implemented CLOMP. To demonstrate general applicability, CL is also implemented on three types of classic iterative CS reconstruction methods. Results show that CL maintains the generality of iterative methods while significantly boosting accuracy. Although it adds minor overhead for convex optimization or message-passing methods, it achieves efficiency gains of 100 to 1000 times for greedy algorithms. On the tested nine diverse image datasets, CL improves median reconstruction accuracy by approximately 163%, 78%, and 35% at sampling rates of 0.04, 0.25, and 0.5, respectively, compared to classic iterative methods. This training-free CS reconstruction method can truly empower countless industrial or medical machines that rely on sparse solution.
Given a text-to-image diffusion model pretrained on large-scale text-image pairs, can we align the model with human pReferences without further fine-tuning? In this paper, we analyze the effect of alignment tuning in diffusion models by comparing the diffusion denoising trajectory between base and aligned models. Our findings reveal that alignment tuning primarily affects superficial stylistic aspects during denoising, rather than fundamental content, suggesting superficial alignment behaviors. Based on this discovery, we introduce a novel, training-free alignment approach (RSTFA) that leverages rejection sampling at specific stylistic timesteps, ensuring human preference alignment without fine-tuning or heavy inference overhead. We provide a theoretical analysis and derive a bias bound for our rejection-sampling alignment scheme. Empirically, we show that RSTFA better preserves sample diversity than reinforcement-learning-based tuning methods. Extensive experiments on Pick-a-Pic, COCO, HPD V2, and PartiPrompts show that our method not only achieves superior alignment with human preferences compared to state-of-the-art methods, but also reduces computational demands, establishing efficient, human-centered diffusion model alignment.
Maritime Domain Awareness (MDA) relies heavily on data acquired from high-resolution optical spaceborne sensors; however, processing this massive quantity of sensor data via traditional supervised deep learning is severely bottlenecked by its dependency on exhaustively annotated datasets. Under extreme data scarcity, conventional architectures suffer severe performance degradation, rendering them impractical for time-critical, zero-day deployments. To overcome this barrier, we propose a training-free inference paradigm that leverages the extensive pre-trained knowledge of Large Vision-Language Models (VLMs). Specifically, we introduce a Domain Knowledge-Enhanced In-Context Learning (DK-ICL) framework coupled with a Macro-Topological Chain-of-Thought (MT-CoT) strategy. This approach bridges the perspective gap between natural images and top-down optical sensor imagery by translating expert remote sensing heuristics into a strict, step-by-step reasoning pipeline. Extensive evaluations demonstrate the substantial efficacy of this framework. Armed with merely 4 visual exemplars per category as in-context triggers, our MT-CoT augmented VLMs outperform traditional models trained under identical scarcity by over 38% in F1-score. Crucially, real-world case studies confirm that this zero-gradient approach maintains robust generalization on unannotated, out-of-distribution coastal clutters, achieving performance parity with data-heavy networks trained on 50 times the data volume. By substituting massive human annotation and GPU optimization with scalable logical deduction, this paradigm establishes a resource-efficient foundation for next-generation intelligent maritime sensing networks.
Designing noise-robust parameterized quantum circuits (PQCs) is a central challenge in the noisy intermediate-scale quantum (NISQ) regime. Existing quantum architecture search methods rely on training large SuperCircuits and evaluating SubCircuits under noisy execution, resulting in high computational cost and architecture assessments that depend on task-specific optimization and device noise. In this work, we propose a training-free quantum architecture search framework based on information-theoretic expressibility measures rather than performance-based estimators. We empirically show that noise-free KL-divergence-based expressibility exhibits a consistent monotonic association with noisy task loss across diverse circuit architectures and realistic hardware noise models. Leveraging this relationship, we introduce an expressibility-guided evolutionary search that requires neither SuperCircuit training nor noisy execution during the search phase. Since expressibility is evaluated independently of hardware noise, the method is inherently device-agnostic, enabling architectures to be reused across multiple quantum devices without re-running the search. Experiments using IBM-derived Qiskit noise models demonstrate that the proposed approach achieves competitive performance compared to SuperCircuit-based baselines, while substantially reducing computational cost. These results establish expressibility as an effective information-theoretic surrogate for ranking PQC architectures under realistic noise.
Diffusion models have promoted the development of controllable text-to-image generation with structure control. However, there still remain two limitations: (1) existing methods coupling text prompt and structure control inevitably lead to pixel-level structure control dominates the generation process, thus misalignment with text prompt; (2) they suffer poor structure consistency due to the fact that these methods typically focus only on spatial domain features, while neglecting the essential frequency domain representations including texture detail. To alleviate the above issues, we propose FreeMD, a novel training-free multi-domain text-to-image generation method to realize better semantic alignment with text prompt and achieve excellent structure consistency with structure control simultaneously. Specifically, we design two independent guidance branches to decouple text prompt and structure control: appearance guidance branch and structure guidance branch. The former utilizes principal component supervision for elaborate appearance representations to transfer appearance in text to the generated image. Such a skillful paradigm is capable of facilitating semantic alignment with text prompt in generation process. Collaboratively, the latter designs a multi-domain guidance strategy combining spatial domain and frequency domain by comprehensive supervision for structure representations, thus improving structure consistency with control. Thanks to the decoupled architecture and multi-domain guidance strategy, FreeMD accurately aligns with text prompt as well as achieves structurally coherent with control signals. Moreover, FreeMD can be plug-and-play in various pre-trained generative models to accomplish common downstream tasks. Extensive experiments demonstrate FreeMD outperforms the existing methods in controllability and generation quality.
Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompts-particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
Automated optical inspection (AOI) for printed circuit boards (PCBs) requires localizing small, sparse defects under illumination drift and minor placement misalignment, while supporting fast, auditable pass/fail decisions. This paper presents a training-free, reference-based digital image processing framework with no learning/training stage that compares each defective query image with a small library of defect-free reference templates (for the same PCB layout/revision) using a small set of interpretable control parameters. A reference is selected by coarse-to-fine matching (fast pre-screening followed by SSIM refinement on a central region), and an optional global alignment is applied only when it increases SSIM to limit defect-driven over-correction. Defects are highlighted by a defect-likelihood field that fuses an SSIM-derived structural dissimilarity map with a normalized absolute-difference map, followed by connected-component extraction to produce confidence-ranked bounding boxes. The method achieves Precision = 0.9663, Recall = 0.9987, and F1 = 0.9822 at the best-F1 operating point (0.149 false positives per image). Under the adopted box-matching protocol, average precision reaches 0.984. Precision-recall and FROC curves are reported to support threshold selection under different false-alarm budgets.
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
Random telegraph signal (RTS) analysis is increasingly important for characterizing meaningful temporal fluctuations in physical, chemical, and biological systems. The simplest RTS arises from discrete stochastic switching events between two binary states, quantified by their transition amplitude and dwell times in each state. Quantitative analysis of RTSs provides valuable insights into microscopic processes such as charge trapping in semiconductors. However, analyzing RTS becomes considerably complex when signals exhibit multi-level structures or are corrupted by background white or pink noise. To address these challenges and support high-throughput RTS characterization, we propose a modular, training-free signal processing pipeline that integrates adaptive dual-tree complex wavelet transform (DTCWT) denoising with a lightweight Bayesian digitization strategy. The adaptive DTCWT denoiser incorporates autonomous parameter selection rules for its decomposition level and thresholds, optimizing white noise suppression without manual tuning. Complementing this stage, our Bayesian digitizer formulates RTS level assignment as a probabilistic latent-state inference problem incorporating temporal regularization without iterative optimization, effectively resolving binary trap states even under residual notorious background pink noise. Quantitative benchmarking on large synthetic datasets with known ground truth demonstrates improved RTS reconstruction accuracy, trap-state resolution, and dwell-time estimation across diverse noise regimes and multi-trap scenarios, while achieving up to 83× speedups over classical and neural baselines. Qualitative validation on experimental RTS data when no ground truth is available illustrates practical usability and flexibility for real-time or large-scale analysis in real measurement settings. Together, the proposed framework establishes a scalable and reproducible foundation for autonomous RTS analysis and systematic benchmarking, with potential to support future extensions toward more complex and device-specific RTS studies.
With the prevalence of pre-trained vision-language models like CLIP, leveraging the generic knowledge embedded in CLIP for domain adaptation has proved to be a promising direction. However, most existing CLIP-based methods are limited to closed-set settings. This is primarily because CLIP needs the semantic labels of unknown classes for inference, thus making it not applicable to Open-Set Domain Adaptation (OSDA). To utilize the complementary roles of CLIP and the source model, our paper proposes a novel Semantic-guided Target Adaptation (SemTA) framework for OSDA in a training-free manner. Specifically, we introduce an unknown semantic discovery module. It uses the cluster centroids of the target data to obtain the semantic labels of unknown classes from the worldwide corpus. Then, the semantic-based inference can be performed with CLIP. Additionally, the dual sample attention mechanism is implemented to output sample-based inference. Representative features from both the source model and CLIP serve as the key to improve task specificity. Compared to previous OSDA methods which reject unknown data by confidence threshold, the proposed approach is more practical and offers better interpretability. Comprehensive evaluations on four benchmarks reveal our method sets a new state-of-the-art even without training. Our code will be publicly available soon.
Tabular data are among the most common data types, and TabPFN has recently emerged as a powerful foundation model offering fast, training-free predictions. However, its applicability to high-class-count classification remains limited, as fine-tuning or retraining incurs heavy computational costs. We address this gap by framing multiclass prediction within the Error-Correcting Output Codes (ECOC) paradigm, a training-free approach whose effectiveness depends critically on codebook design and decoding. We present the first systematic study of ECOC-based extensions for TabPFN and introduce MultiTabPFN, a modular framework with Classwise Principal components-based Indexing (CPI)-a novel codebook method that encodes class-level geometry into compact binary codes. Compared to the conventional ECOC constructions, CPI explicitly balances separability and redundancy in the code space, thereby providing a principled path for scaling tabular foundation models to many-class settings. Combined with confidence-aware decoding, MultiTabPFN consistently outperforms standard ECOC baselines across synthetic tasks and 36 real-world benchmarks, establishing a practical and training-free extension of TabPFN to high-class-count tabular classification.
Accurate morphometric measurements are crucial for musculoskeletal radiography, but they remain labor-intensive and prone to inter-reader variability. Current artificial intelligence-based solutions often require large annotated training datasets and narrow applications. We present and validate a training-free artificial intelligence framework that automatically derives morphometric measurements across multiple anatomies and radiographic views using universal landmark matching. In this retrospective study, 600 standard radiographs of the foot, knee, and shoulder are analyzed. Additionally, a cohort of 240 challenging radiographs containing orthopedic implants was constructed to stress-test the approach. Landmarks from reference radiographs are transferred to unseen radiographs using a pre-trained generalist dense-matching method, and are then used to derive measurements in a post-processing step. The resulting measurements were compared with manual annotations and measurements by two radiologists. Mean landmark matching error is 2.68 ± 2.70 mm using a single reference radiograph and improves to 2.15 ± 2.38 mm with 40 reference radiographs. Measurement accuracy ranges from 1.81° (I-II metatarsal angle) to 8.65° (congruence angle). Increasing the number of reference images improved measurement accuracy, and mostly approached inter-reader agreement. Performance is mixed on the challenging cohort, demonstrating the limitations and strengths of the approach. This anatomy-agnostic framework enables training-free morphometry across multiple regions, with measurement-dependent performance often comparable to inter-reader agreement. Challenging cases highlight specific limitations, motivating the use of quality control and reference-set tuning for deployment. Its minimal setup enables rapid adaptation to new anatomies and measurements, and clinically practical runtimes require GPU inference. Question Can a generalist artificial intelligence framework be used to accurately and automatically perform morphometric measurements across different musculoskeletal radiographs without anatomy-specific training? Findings The training-free approach achieved performance that approaches expert-level agreement for most measurements, while highlighting measurement-specific limitations in challenging cases. Multiple reference radiographs improved results. Clinical relevance This approach automates repetitive morphometric measurements that are prone to inter-reader variability, reducing manual workload while providing reproducible results that can approach expert radiologist performance. Its adaptability and minimal setup enable integration into routine workflows.
Industrial image anomaly detection (IAD) is a pivotal topic with huge value. Due to the nature of anomalies, real anomalies in a specific modern industrial domain (i.e., domain-specific anomalies) are usually too rare to collect, which severely hinders IAD. Thus, zero-shot anomaly synthesis (ZSAS), which synthesizes pseudo anomaly images without any domain-specific anomaly, emerges as a vital technique for IAD. However, existing solutions are either unable to synthesize authentic pseudo anomalies, or require cumbersome training. Thus, we focus on ZSAS and propose a brand-new paradigm that can realize both authentic and training-free ZSAS. It is based on a chronically-ignored fact: Although domain-specific anomalies are rare, real anomalies from other domains (i.e., cross-domain anomalies) are actually abundant and directly applicable to ZSAS. Specifically, our new ZSAS paradigm makes three-fold contributions: First, we propose a novel method named Cross-domain Anomaly Injection (CAI), which directly exploits cross-domain anomalies to enable highly authentic ZSAS in a training-free manner. Second, to supply CAI with sufficient cross-domain anomalies, we build the first Domain-agnostic Anomaly Dataset (DAAD) within our best knowledge, which provides ZSAS with abundant real anomaly patterns. Third, we propose a CAI-guided Diffusion Mechanism, which can further break the quantity limit of real anomalies and enable unlimited anomaly synthesis. Our head-to-head comparison with existing ZSAS solutions justifies the superior performance of our paradigm for IAD and demonstrates it as an effective and pragmatic ZSAS solution.
Camouflaged Object Detection (COD) is pivotal for segmenting objects that seamlessly blend into their surroundings. While prior endeavors demonstrate impressive performance through training on predefined labels, they heavily rely on labor-intensive data annotation and struggle to adapt to open-world scenarios. In this light, we propose RA-COD, a training-free paradigm that enables COD by retrieving the most similar samples from the prototype repository. The efficacy of RA-COD hinges on 1) capturing the nuanced resemblance between objects and their environments and 2) excelling in dense prediction tasks. To achieve (1), the crux lies in ensuring diversity and discriminability within the prototype repository. In this context, we propose GenPro, an automated pipeline for crafting Generative Prototypes. GenPro integrates a range of foundation models, including the Diffusion Model, Vision-Language Model, Segment Anything Model (SAM), and DINOv2, in a complementary manner that synergistically generates diverse and distinguishable prototype samples. To achieve (2), we propose C2F to retrieve camouflaged objects in a Coarse-to-Fine regime. We commence with pixel-level retrieval in the feature space, which generates a coarse mask that effectively captures class discrimination and object localization. Further refinement is achieved by extracting bounding boxes from this coarse mask to prompt SAM in generating mask proposals for region-level retrieval. Evaluations on four benchmarks showcase that RA-COD achieves state-of-the-art performance compared to existing training-free methods.