Existing video inpainting methods typically combine optical flow propagation with Transformer architectures, achieving promising inpainting results. However, they lack adaptive inpainting strategy optimization in diverse scenarios, and struggle to capture high-level temporal semantics, causing temporal inconsistencies and quality degradation. To address these challenges, we make one of the first attempts to introduce reinforcement learning into the video inpainting domain, establishing a closed-loop framework named CLIP-RL that enables adaptive strategy optimization. Specifically, video inpainting is reformulated as an agent-environment interaction, where the inpainting module functions as the agent's execution component, and a pre-trained inpainting detection module provides real-time quality feedback. Guided by a policy network and a composite reward function that incorporates a weighted temporal alignment loss, the agent dynamically selects actions to adjust the inpainting strategy and iteratively refines the inpainting results. Compared to ProPainter, CLIP-RL improves PSNR from 34.43 to 34.67 and SSIM from 0.974 to 0.986 on the YouTube-VOS dataset. Qualitative analysis demonstrates that CLIP-RL excels in detail preservation and artifact suppression, validating its superiority in video inpainting tasks.
Video inpainting modifies local regions in video while ensuring spatial and temporal coherence. However, existing methods-both traditional and recent diffusion-based ones-face key limitations: they lack unified support for both insertion and completion, and are restricted to single-object inpainting, making it difficult to handle multi-object scenarios involving grounding and interaction. In this paper, we propose MultiPaint, a unified framework for multi-task, multi-object, and multi-condition video inpainting. Firstly, we introduce dual-branch adapters to unify the insertion and completion tasks within a single model. Moreover, we propose a test-time scheduled feature composition strategy that enables multi-object inpainting with user-specified locations while better preserving interactions among objects, a setting that has been insufficiently addressed in prior work. Additionally, we introduce a multi-condition inpainting scheme that integrates text-guided, image-guided, and keyframe-guided modes via dynamic frame masking, providing more controllability in appearance customization. Extensive experiments show that MultiPaint achieves state-of-the-art performance on object insertion and scene completion among the recent works. We further demonstrate its versatility in downstream tasks including grounded video generation, object editing, object removal, image-guided inpainting, and long video inpainting.
Historical artefacts such as pottery, sculptures, paintings, and manuscripts often suffer damage, erosion, or loss of detail due to weather, ageing, environmental factors, or improper handling. Traditional restoration is labour-intensive, slow, and prone to human error, while digital restoration enables reversible, non-invasive, and precise reconstruction of cultural heritage objects. Thus, a significant focus in computer vision (CV) has shifted to inpainting of historical artefacts, which repairs and restores damaged or missing sections to preserve the artwork’s original integrity. Conventional image inpainting methods, whether based on pixel diffusion or patch-based, have limitations. While modern digital methods have improved the effectiveness of inpainting, they often struggle to maintain the original work’s aesthetic and unique qualities. This makes it difficult to completely and accurately restore the work’s authentic look and feel. With the constant advances in deep learning (DL), image inpainting techniques that leverage it have achieved remarkable results. Unlike conventional image inpainting techniques, Generative Adversarial Network (GAN)-driven approaches offer greater efficiency and generality. With this motivation, this study develops a new hybrid deep learning-enabled image inpainting model for smart historical artificial restoration, named the HDLIP-SHAR technique. The HDLIP-SHAR technique aims to train a DL model to identify and reconstruct missing or damaged portions of artefact images. Initially, adaptive median filtering (AMF) and contrast enhancement are applied to improve the image quality. Furthermore, a hybrid SqueezeNet CNN model is utilised to fully extract deep semantic features from historical artefact images to identify cracks, missing parts, and faded textures. Moreover, the U-Net model is applied for image segmentation and localisation of damaged regions. Finally, a transformer-based GAN model is used to restore and inpaint the missing areas of the image. The comparison analysis of the HDLIP-SHAR model demonstrated superior performance with an average PSNR of 64.59 dB, SSIM of 0.945, and LPIPS of 0.0401, outperforming other methods under the MuralDH dataset.
Focal brain lesions from Acquired Brain Injuries (ABIs) present as regions of abnormal signal intensity on T1-weighted Magnetic Resonance Imaging (MRI) scans. These can disrupt automated neuroimaging processing algorithms traditionally developed on and for healthy brains. Lesion filling (or inpainting) can replace lesioned image voxels with signal intensities approximating healthy tissue. This creates a 'lesion free' brain to use as input to the image processing algorithms thus aiming to reduce the presence of lesion induced errors. This scoping review provides a detailed overview of the available inpainting tools for use in neuroimaging analysis of patients with ABI. First, we define lesion inpainting and highlight its importance for pre-processing of MRI scans. Next, we classify the papers resulting from our search (24 in total) into: (a) Traditional Methods (Local Diffusion, Global Diffusion, Search Patch-Based, a priori Patch-Based, or Low Rank Sparse Decomposition) and (b) Deep Learning methods (Convolutional Neural Networks, Generative Adversarial Networks, or Denoising Diffusion Models). We then discuss the strengths and limitations of each different inpainting method. Finally, we provide recommendations for both the use, and development of inpainting tools, to increase the adoption of lesion inpainting across ABI studies.
With the growing demand for face editing applications, faceinpainting has become an increasingly important subfield within image inpainting research. While many existing methods use semantic segmentation guidance, they apply uniform weighting across all regions of the map. Since the missing areas of the image lack meaningful features, this uniform treatment provides insufficient guidance in missing areas, leading to unrealistic or structurally incoherent results. Moreover, these methods generally lack the capacity to adapt inpainting results to individual user preferences, thereby limiting their effectiveness in personalized face editing. To address these limitations, we propose SegPainter, a Mamba-based architecture for user-controllable face inpainting that enables customized restoration guided by user-defined semantic segmentation maps generated using the Image Segmentation Annotation Tool (ISAT) integrated with Meta's Segment Anything Model (SAM). Specifically, we propose Hard Mask Soft One-Hot Encoding (HMSOE) to adaptively weight regions in the segmentation map based on whether they correspond to known or missing areas of the masked image. This strategy amplifies semantic guidance in missing regions while attenuating it in known regions to avoid over-constraining existing content. We further introduce Semantic-Guided State Space Model (SG SSM) to dynamically modulate the Mamba layer with semantic features, adapting guidance to the masked image. To enhance the quality of inpainting results, we also propose Tri-Scan Inspection (TSI), a scanning mechanism designed to capture both global and local dependencies while preserving spatial continuity and facial structure. Extensive experiments on the CelebAMask-HQ and FFHQ datasets demonstrate that our framework outperforms state-of-the-art methods, producing sharper and more semantically consistent face inpainting results. Codes are available at the link: https://github.com/langka9/segpainter.git.
Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agentcooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.
In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images capture the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily life to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even SOTA reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with appropriate semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.
Objective.Photoplethysmography (PPG) and remote PPG (rPPG) are widely used non-invasive techniques for monitoring cardiovascular parameters. However, signal artifacts from motion, lighting variations, and environmental noise pose significant challenges for accurate physiological measurement, particularly in non-contact rPPG systems. To address these issues, we propose a novel generative inpainting framework designed to restore corrupted segments of PPG and rPPG signals.Approach.Our method leverages a large-scale synthetic dataset that spans a broad range of heart rates (30-180 bpm) and incorporates diverse artifact profiles to simulate real-world conditions. The inpainting model is built upon a custom Wasserstein GAN architecture, using a gradient penalty to ensure stable adversarial training. This selective reconstruction approach targets only the corrupted segments while preserving the integrity of high-quality signal portions.Main results.Results show that the proposed framework improves signal quality compared to corrupted signals. For synthetic datasets spanning heart rates from 30 to 180 bpm, signal-to-noise ratio increases from approximately -0.65 to -0.20 dB to 3.27-4.16 dB after inpainting, while the mean absolute error decreases from 0.08-0.09 to 0.05-0.06. Feature-level similarity also improves, with Fréchet Encoder Distance reduced from 0.12 to 0.03 for real PPG and from 0.07 to 0.01 for real rPPG, and consistent reductions observed across all synthetic heart-rate ranges (from 0.23-0.47 to 0.01-0.04). Heart-rate estimates derived from the reconstructed signals are statistically equivalent to those obtained from clean references.Significance.The proposed generative inpainting framework effectively restores degraded PPG and rPPG signals and preserves heart-rate estimates, supporting its use in non-critical physiological monitoring applications such as wellness monitoring and automotive contexts. Validation on real data was limited to relatively clean, resting-state recordings; further studies are required to assess performance under high-motion and real-world conditions.
The degraded regions of ancient murals often contain intricate textures and structural curves, presenting major challenges for traditional mural restoration. Although digital image inpainting offers a viable approach, Convolutional Neural Network (CNN)-based methods are commonly constrained by these complex scenarios and struggle with global consistency due to their limited receptive fields. To address these limitations, a CNN-Mamba hybrid inpainting architecture is proposed based on the two-stage task decomposition paradigm. This framework employs Structure-Guided Fusion Blocks (SGFBs) to adaptively fuse structural priors from the edge inpainting stage across multi-scale levels. To enhance holistic consistency, the proposed Multi-Way Mamba Process Blocks (MMPBs) are integrated into the bottleneck, specifically adapting State Space Models (SSMs) to capture global relations in 2D murals with linear complexity. Comprehensive evaluations on mural and landscape painting datasets show that the proposed method properly restores global styles, fills in coherent details, and achieves competitive performance compared to well-established methods.
We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight (Phongthawee et al. 2024), which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios.
Atomistic resolution is essential for understanding biomolecular structure and function, yet coarse-grained (CG) models remain indispensable for simulating large and dynamic systems. Reconstructing accurate all-atom structures from CG representations, particularly across varied CG schemes and biomolecular types, remains a fundamental challenge. Moreover, flexible or disordered regions may well suffer from failure in structure modeling, making inpainting missing regions another challenging task. Here, we present StruCloze, a deep learning framework for reconstructing atomistic structures from CG models and inpainting missing regions for both proteins and nucleic acids. StruCloze generalizes across various CG levels and biomolecule types on single pretraining, with fine-tuning required for optimal performance on specific representations. It achieves state-of-the-art accuracy in reconstructing both protein and nucleic acid structures and demonstrates superior transferability and speed compared to existing methods. Leveraging masked learning strategy, StruCloze also excels at inpainting structurally missing regions in structures, offering a practical tool for structural refinement and integrative modeling. Our framework provides a general solution for bridging reduced or incomplete representations with full atomistic detail of biomolecular structures, enabling rapid local structure prediction and further analysis on system dynamics.
Passive detection and recognition capabilities of Unmanned Underwater Vehicles (UUVs) are significantly degraded by propulsion system self-noise, characterized by pronounced modulation interference and low signal-to-noise ratios. Existing denoising methods commonly produce spectral holes becse of over-suppression and insufficiently mitigate modulation interference. To overcome these limitations, this paper proposes a two-stage denoising-inpainting framework. In the first stage, a mask-based denoising network rapidly attenuates prominent self-noise to obtain a preliminarily enhanced signal. In the second stage, the Spectrum Inpainting Network (SINet) is introduced to precisely reconstruct the target spectrogram. To restore spectral holes and suppress modulation interference, SINet integrates a Modulation-Hole Restoration module to better capture modulation and contextual information. Furthermore, the framework incorporates a Shaft-Frequency Suppression Loss to guide the network focusing toward residual components within the shaft-frequency band in the detection of envelope modulation on noise spectrum. Extensive experiments on the ShipsEar dataset and collected UUV self-noise data demonstrate that the proposed framework can effectively suppress modulation interference and enhance target signal fidelity. The interference shaft-frequency peak-to-average ratio and spectral mean squared error are reduced by 75% and 22%, leading to a notable 6.87% improvement in target recognition accuracy.
Accurate rectal tumor segmentation using magnetic resonance imaging (MRI) is paramount for effective treatment planning. It allows for volumetric and other quantitative tumor assessments, potentially aiding in prognostication and treatment response evaluation. Manual delineation of rectal tumors and surrounding structures is time-consuming and labor-intensive. Over the past few years, deep learning has shown strong results in automated tumor segmentation in MRI. Current studies on automated rectal tumor segmentation, however, focus solely on tumoral regions without considering the rectal anatomical entities and often lack a solid multicenter external validation. In this study, we improved rectal tumor segmentation by incorporating anomaly maps derived from anatomical inpainting. This inpainting was trained using a U-Net-based model and trained to reconstruct a healthy rectum and mesorectum from prostate T2-weighted images (T2WI). The rectal anomaly maps were generated from the difference between the original rectal and reconstructed pseudo-healthy slices. The derived anomaly maps were used in the downstream tumor segmentation tasks by fusing them as an additional input channel (AAnnUNet). Alternative methods for integrating rectal anatomical knowledge were evaluated as baselines, including Multi-Target nnUNet (MTnnUNet), which added rectum and mesorectum segmentation as auxiliary tasks, and Multi-Channel nnUNet (MCnnUNet), which utilized rectum and mesorectum masks as additional input channels. As part of this study, we benchmarked nine models for rectal tumor segmentation on a large multicenter (num = 705) dataset of preoperative T2WI and nnUNet outperformed the other eight models on the external test. The MTnnUNet demonstrated improvements in both fully-supervised and mixed-supervised settings where human-annoated tumor masks and AI-generated rectum and mesoretum masks were used compared to nnUNet, while the MCnnUNet showed benefits only in the setting where mixed-supervision were used. Importantly, anomaly maps were strongly associated with tumoral regions, and their integration within AAnnUNet led to the best tumor segmentation results across both settings. The effectiveness of AAnnUNet demonstrated the value of the anomaly maps, indicating a promising direction for improving rectal tumor segmentation and model robustness for multicenter data.
Serial section electron microscopy (ssEM) is essential for studying biological cell structures at nanometer resolution. However, supporting film folding (SFF) degradation frequently occurs during sample preparation, causing structural distortions and information loss that severely impair downstream analyses such as 3D reconstruction and neuron segmentation. We propose RegInpaint, a novel recovery framework that jointly addresses deformation correction and missing-information restoration caused by SFF degradation. RegInpaint formulates SFF recovery as a joint problem of 3D elastic registration and image inpainting, providing a generalizable solution for ssEM restoration. Experiments on four EM datasets show that RegInpaint consistently outperforms existing methods in image restoration quality and significantly improves neuron segmentation accuracy. Source code is freely available at https://github.com/zhangzhenbang2021/RegInpaint.git.
The operational effectiveness of Unmanned Surface Vehicles (USVs) in modern naval scenarios depends on robust situational awareness. While LiDAR sensors are integral to 3D perception, their performance is frequently affected by incomplete data resulting from long-range sparsity and target occlusion. This study investigates a framework to restore incomplete point clouds to support improved surface vessel classification. The framework first estimates the target's heading angle using a 2D area projection technique, combined with a descriptor to address orientation ambiguity. Subsequently, the 3D point cloud is converted into a 2D multi-channel image representation to leverage a deep learning-based image inpainting algorithm for data restoration. Finally, a high-density keypoint extraction method is applied to the completed point cloud to generate features for classification. This image-based approach is designed to prioritize computational efficiency and inference speed, facilitating deployment on resource-constrained maritime platforms. Experiments conducted on a simulator dataset reveal that the classification of restored point clouds yields higher accuracy compared to using the original, incomplete LiDAR data, particularly at extended distances (>70 m) and challenging aspect angles (0° and 180°). The results suggest the framework's potential to address perception failures in sparse data scenarios, thereby supporting the operational envelope of USVs in contested environments.
Detecting subtle focal liver lesions on abdominal computed tomography (CT) is challenging in routine clinical practice, especially for small, low-contrast, or morphologically heterogeneous tumors acquired under variable protocols. While fully supervised liver tumor segmentation can achieve high accuracy, it requires pixel-level annotations that limit scalability and generalizability. Reconstruction-based anomaly detectors trained without hepatic anatomical constraints reduce label burden but are sensitive to textural variability, contrast-phase differences, and produce noisy, unstable boundaries. We introduce an anatomically constrained, four-stage pipeline for liver CT anomaly detection: (1) a denoising diffusion probabilistic model (DDPM) trained on unremarkable axial slices to learn a healthy prior; (2) diffusion-based inpainting within an automatically segmented whole-liver mask to generate pseudo-normal liver appearance; (3) a compact encoder-decoder trained with a liver-masked, mean squared error loss to reconstruct healthy liver tissue from paired original and inpainted inputs; and (4) a liver-scoped difference map between the original and reconstructed healthy CT slices as the final anomaly score for localization. Trained exclusively on > 13,000 healthy CT slices and evaluated on 1,000 abnormal CT slices from 109 Liver Tumor Segmentation (LiTS) benchmark patients, the method achieves Dice 0.596, intersection-over-union 0.482, area under the receiver operating characteristic curve 0.861, and 95th percentile Hausdorff distance 80.5 pixels (px). Performance improves with lesion size, with a Dice score of 0.796 for the largest quartile. Anchoring anomaly detection to hepatic anatomy with a stable healthy prior yields data-efficient liver lesion localization suitable for CT triage and prioritization.
Traditional Chinese ink paintings on paper or silk are highly susceptible to degradation. Over time, physical decay such as creases not only damages the surface but also obscures the original brushwork. Virtual restoration, as a non-contact digital intervention, has emerged as a vital tool for heritage preservation. Yet, generic generative models-most notably GANs and Diffusion-often struggle with the dense, layered textures of the Jinling School(), particularly the Jimofa () technique. GANs tend to "hallucinate" details that clash with traditional brushwork logic, while Latent-based models can drift toward a modern aesthetic that feels disconnected from the original archaic spirit. To address these discrepancies, we propose a coarse-to-fine framework specifically calibrated for Jinling School landscapes. This coarse-to-fine architecture mirrors the traditional 'Bone-first, Ink-second' painting methodology of the Jinling School. By decoupling structural recovery (Skeleton) from texture deposition (Flesh), our computational process aligns physically with the artifact's original creation logic. Initially, a deep convolutional network restores macroscopic structural continuity, effectively smoothing creases and reclaiming the mountain's geometric silhouette. This is followed by a texture-aware refinement module that uses manifold texture grafting to inject high-frequency details into otherwise blurred regions. Experimental results indicate that, beyond restoring overall continuity, the framework appears able to recover aspects of the high-frequency "ink noise" and deep tonal peaks traditionally associated with the Jimofa technique. Crucially, comparative analysis confirms that the framework significantly reduces the risk of 'semantic hallucination' (e.g., the fabrication of non-existent objects) prevalent in large-scale generative models, ensuring distinct historical fidelity. Quantitative assessments-specifically average gradient and edge density-show a measurable improvement over baseline models, all while the system maintains a strict adherence to the principle of "minimal intervention" in undamaged areas. By mitigating the over-smoothing typical of conventional deep learning, this work suggests a path for the digital restoration of rare, small-sample artworks that seeks to balance visual plausibility with historical rigor.
Background and Objectives: Artificial intelligence (AI) is increasingly impacting medicine by improving healthcare delivery and simplifying diagnostic and therapeutic processes. Text-guided inpainting is a promising tool in orthofacial surgery for generating ideal, patient-specific facial profiles. Materials and Methods: A total of 89 patients with dentofacial deformities (DFDs) were evaluated. The DALL-E2 platform was used to generate profilometric transformations based on textual prompts. The resulting images were assessed by three groups: patients, expert surgeons, and the general population. Results: A total of 94% of surgeons, 85% of the general population, and 79% of patients rated the AI-modified profiles as more aesthetically pleasing than the originals. The prompt inspired by runway models had the highest agreement across groups. Conclusions: Generative AI and text-guided inpainting show potential for enhancing aesthetic planning in orthofacial surgery, offering personalized treatment paths and aiding virtual surgical planning.
Objective.Magnetic particle imaging reconstructs tracer distributions using a system matrix (SM) obtained through time-consuming, noise-prone calibration measurements. Methods for addressing imperfections in measured system matrices increasingly rely on deep neural networks, yet curated training data remain scarce. This study evaluates whether physics-based simulated system matrices can be used to train deep learning (DL) models for different SM restoration tasks, i.e. denoising, accelerated calibration, upsampling, and inpainting, that generalize to measured data.Approach.A large dataset of system matrices was generated using an equilibrium magnetization model extended with uniaxial anisotropy. The dataset spans particle, scanner, and calibration parameters for 2D and 3D trajectories, and includes background noise injected from empty-frame measurements. For each restoration task, DL models were compared with classical non-learning baseline methods. Quantitative performance was evaluated on simulated data using peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). For measured data, performance was assessed qualitatively by visual comparison of system matrices and the resulting reconstructions.Main results.The models trained solely on simulated system matrices generalized to measured data across all tasks: for denoising, DnCNN/RDN/SwinIR outperformed discrete cosine transform and soft thresholding baseline by>10 dB PSNR and up to +0.1 SSIM on simulations and led to perceptually better reconstructions of real data; for 2D upsampling, SMRnet exceeded bicubic by ∼ 20 dB PSNR and ∼ 0.08 SSIM at ×2-×4 but these gains did not transfer qualitatively to real measurements. For 3D accelerated calibration, SMRnet matched tricubic in noiseless cases and was more robust under noise, and for 3D inpainting, biharmonic inpainting was superior when noise-free but degraded with noise, while a PConvUNet maintained quality and yielded less blurry reconstructions.Significance.The demonstrated transferability of DL models trained on simulations to real measurements mitigates the data-scarcity problem, which intensifies with model scale. This enables the development of new methods beyond current measurement capabilities and supports pre-training of large models on simulated data.
In aerial sensor systems, detecting helicopters against diverse backgrounds remains challenging due to environmental camouflage. This paper proposes an end-to-end framework for generating adaptive camouflage patterns to evade YOLO-based object detection. Starting with synthetic sensor imagery (background + transparent helicopter overlay), we employ a fine-tuned YOLOv8m for precise VTOL mask extraction, followed by KMeans clustering with Gaussian blur for dominant color extraction from the background. These colors guide Stable Diffusion inpainting to synthesize full-screen camouflage textures, which are then masked and overlapped onto the helicopter region. Evaluated on a 920-image dataset across multiple backgrounds, our method achieves a 97.6% reduction in mAP@0.5 (from 0.8175 to 0.0196) on 751 camouflaged images against a fine-tuned YOLOv8m model, with recall dropping by 95.9%. Even against a helicopter-specialized Defence model, mAP@0.5 drops by 89.6% (from 0.1178 to 0.0123). Ablation studies confirm the synergy of YOLO masking and color-guided inpainting. This sensor-fusion approach enhances stealth in unmanned aerial surveillance, with implications for civilian aviation safety.