Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied foundation model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. [Background] However, existing embodied foundation models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored, lacking systematic and standardized benchmark suites. [Aim] To address this gap, this study aims to construct a comprehensive benchmark suite, AeroVerse, to facilitate the simulation, pre-training, finetuning, and evaluation of aerospace embodied foundation models. [Innovations] We develop AeroSimulator, a simulation platform that encompasses four realistic urban scenes for UAV flight simulation. Additionally, we construct the first large-scale real-world image-text pre-training dataset from a first-person UAV perspective, AerialAgent-Ego15k, and create a virtual image-text-pose alignment dataset, CyberAgent-Ego500k, to facilitate the pre-training of the aerospace embodied foundation model. We clearly define five downstream tasks for the first time, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and have constructed corresponding instruction datasets for fine-tuning. We also develop SkyAgent-Eval, a downstream task evaluation system based on GPT-4. Furthermore, we propose SkyAgent, the first UAV-agent large model integrating "perception-reasoning-navigating-planning", which incorporates an aerospace embodied chain-of-thought mechanism and a multitask curriculum learning strategy. [Results] By benchmarking ten mainstream models, our results reveal the significant limitations of existing 2D/3D visual-language models in complex aerospace embodied tasks and demonstrate the superior performance of SkyAgent, which outperforms existing methods by an average of 8.52% across four core tasks, underscoring the necessity and contribution of our work. The AeroVerse benchmark suite will be released to the community to promote exploration and development of aerospace embodied intelligence.
WiFi signals, in contrast to cameras, offer privacy protection and occlusion resilience for some practical scenarios such as smart homes, elderly care, and virtual reality. Recent years have seen remarkable progress in the estimation of single-person 2D pose, single-person 3D pose, multi-person 2D pose, and single-person mesh. This paper takes a step forward by introducing Person-in-WiFi 3D, a pioneering WiFi system that accomplishes multi-person 3D human perception. Person-in-WiFi 3D has two main updates. Firstly, compared to previous systems that used only one WiFi transmitter and one WiFi receiver, it has a greater number of WiFi devices, which enhances its capability for capturing 3D spatial reflections from multiple individuals. This allows it to respond more flexibly to complicated 3D perception scenarios, achieving a breakthrough from 2D to 3D. Secondly, it utilizes a DETR-like architecture based on Hungarian matching to achieve the end-to-end estimation. It employs a hierarchical refinement strategy from coarse to fine, effectively utilizing features at the global level, human-instance level, and keypoint level, thereby achieving finer-grained estimation and reconstruction. On the basis of these designs, Person-in-WiFi 3D is storage-efficient, accurate, and fast compared to its predecessor. Wedeployed a proof-of-concept system in 4m × 3.5m areas and collected a multi-person human perception dataset Wiception3D of over 97K frames with seven volunteers, encompassing diverse scenes and multi-person scenarios. Wiception3D includes nearly 98k training samples and nearly 8k test samples, annotated for 3D pose estimation and 3D mesh reconstruction tasks. Person-in-WiFi 3D achieves a keypoint localization error of 93mm in 3D pose estimation and 41mm in mesh reconstruction, comparable to cameras and millimeter-wave radars. The project page is at https://aiotgroup.github.io/Person-in-WiFi-3D.
The convergence of embodied intelligence and world models has catalyzed growing interest in integrating physical laws into AI systems. While prior surveys have examined world models and embodied intelligence separately, we focus on the progression that connects these capabilities as a unified developmental pathway from passive observation to active physical comprehension. This survey provides a systematic framework revealing how physical AI advances through four interconnected stages: perception transforms sensory data into structured physical representations, reasoning derives explanations from observed phenomena, modeling enables predictive simulation grounded in physical principles, and embodied interaction closes the loop through physical manipulation and environmental feedback. Each stage enables and enhances the next: perceptual grounding supports causal reasoning, reasoning unlocks predictive capabilities, and robust models drive genuine physical interaction. Through analysis of developments spanning architectural innovations, training methodologies, causal inference, and embodied systems, we synthesize how physical understanding emerges through cumulative integration across this progression. Our framework reveals the evolution from isolated, task-specific solutions toward integrated architectures that advance from pattern recognition toward causal reasoning and counterfactual prediction. This perspective provides foundations for next-generation physical AI systems with direct implications for safe, generalizable, and interpretable deployment across robotics, scientific discovery, and autonomous systems. We maintain a continuously updated taxonomy repository at https://github.com/AI4Phys/Awesome-AI-for-Physics.
Small Object Detection (SOD) is fundamentally constrained by the inherent scarcity of visual cues in size-limited instances. This low-entropy nature frequently induces ambiguity and collapse in the learned feature space, critically undermining the efficacy of downstream tasks. Restoration-based methods offer a promising, albeit flawed, solution to this representational bottleneck. On one hand, they excel at recovering fine-grained details; on the other, their effectiveness is compromised by a reliance on synthetic corruptions that generalize poorly at inference, a problem compounded by the inherent conflict between pixel-level fidelity and semantic abstraction. To overcome these limitations, we introduce Detection-Oriented RectificAtion (DORA), a unified framework built upon a novel degradation-then-rectification paradigm. The central insight lies in the principle: knowing what degrades, knowing how to rectify. DORA first explicitly learns to deconstruct complex visual corruptions into a versatile, learnable degradation basis set, providing a structured understanding of the inherent degradation of small instances. This encoded knowledge then forms the dynamic degradation-conditioned prompt, initiating a task-oriented rectification and effectively mitigating the distribution shift at inference. Furthermore, on the foundation of a preceding entity reconstruction task, we devise a synergistic contrastive function to alleviate the task conflict by cyclically aligning rectified entity embeddings with detection-friendly exemplars, thereby robustly bridging the granularity gap between detection and rectification, ultimately facilitating a harmonious optimization of the entire framework. As a paradigm-agnostic solution, DORA can be seamlessly integrated with a wide range of detectors. Comprehensive experiments on five challenging SOD datasets showcase the consistent and substantial performance gains across diverse architectures, underscoring the efficacy and broad potential of our task-oriented rectification strategy.
Robust model fitting aims to estimate model parameters from data contaminated by noise and outliers in computer vision. Traditional RANSAC-based methods suffer from model hypothesis ambiguity and inefficiency due to the problems of neglecting data preference distributions and employing iterative hypothesis sampling. Learning-based methods enhance traditional methods through deep features. However, their reliance on static coordinate representations inherently lacks motion cues, hindering the analysis of complex dynamic scenes. Furthermore, the local receptive fields of CNNs inadequately capture global context. To address these issues, we propose MPCFormer, a motion-aware Transformer method via multi-channel preference filtering and multi-scale consensus smoothing for robust model fitting. It reformulates robust model fitting as a joint optimization of point classification and model estimation by integrating correspondence learning and embedding spatiotemporal motion cues, eliminating iterative hypothesis sampling. Specifically, we design a motion preference filter to explore multi-channel motion information by residual-connected Transformer layers. It explicitly encodes data preference distributions for models via multi-head preference attention, generating confidence scores to adaptively suppress outlier interference and enhance model robustness. Additionally, we present a pyramid consensus smoother with multi-scale Transformer encoding. It hierarchically captures local-to-global motion consistency through a sparse feature pyramid, effectively resolving motion ambiguity from spatial discontinuities. This module enables precise inlier identification and reliable model estimation through multi-head consensus attention. Extensive experiments demonstrate that MPCFormer outperforms state-of-the-art baselines by 4.68% mAP@5°, 1.89% AUC@3 pixel, and 1.52% F-score, even at extreme outlier ratios (up to 95%).
Out-of-Distribution (OoD) detection is vital for the reliability of deep neural networks, the key of which lies in effectively characterizing the disparities between OoD and In Distribution (InD) data. In this work, such disparities are exploited through a fresh perspective of non-linear feature sub spaces. That is, a discriminative non-linear subspace is learned from InD features to capture representative patterns of InD, while informative patterns of OoD features cannot be well captured in such a subspace due to their different distribution. Grounded on this perspective, we exploit the deviations of InD and OoD features in such a non-linear subspace for effective OoD detection. To be specific, we leverage the framework of Kernel Principal Component Analysis (KPCA) to attain the discriminative non linear subspace and deploy the reconstruction error on such subspace to distinguish InD and OoD data. Two challenges emerge: (i) the learning of an effective non-linear subspace, i.e., the selection of kernel function in KPCA, and (ii) the computation of the kernel matrix with large-scale InD data. For the former, we reveal two vital non-linear patterns that closely relate to the InD-OoD disparity, leading to the establishment of a Cosine Gaussian kernel for constructing the subspace. For the latter, we introduce two techniques to approximate the Cosine-Gaussian kernel with significantly cheap computations. In particular, our approximation is further tailored by incorporating the InD data confidence, which is demonstrated to promote the learning of discriminative subspaces for OoD data. Our study presents new insights into the non-linear feature subspace for OoD detection and contributes practical explorations on the associated kernel design and efficient computations, yielding a KPCA detection framework with distinctively improved efficacy and efficiency.
Inspired by the impressive success of image-text foundation models, recent works have proposed to adapt these foundation models to video data, leading to efficient and effective video models for open-vocabulary action recognition. However, through a comprehensive evaluation, our work finds that state-of the-art open-vocabulary action recognition models still struggle with generalization to video domains that they have not en countered. To address this limitation, we introduce generalizable open-vocabulary action recognition, which aims to develop action recognition models capable of generalizing to both novel action categories and unseen video domains. Our work contributes a novel model named XOV-Action to overcome two critical challenges: (1) understanding novel action concepts of open-set categories, and (2) mitigating the scenario discrepancy between training and test datasets. Specifically, XOV-Action first proposes to capture diverse action-related concepts by learning diversified elaboration representations, which enables better generalization to open-set action categories. Second, XOV-Action learns scene agnostic video representations to overcome the scene bias, which improves the generalization in unseen video domains. Addition ally, to evaluate models in generalizable open-vocabulary action recognition, we contribute a new cross-domain action benchmark named XOVABench, which covers multiple video domains with varying degrees of gaps and consists of both closed-set and open-set action categories. Extensive quantitative and qualitative experiments demonstrate that our proposed XOV-Action can effectively improve the action recognition performance for both closed-set and open-set categories across video domains.
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved Mixed Precision Quantization framework for extremely low-bit Diffusion Models. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose Flexible Z-Order Residual Mixed Quantization that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose Object-Oriented Low-Rank Initialization to use prior quantization error for informative initialization. We then propose Memory-based Temporal Relation Distillation to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
In the past decade, many successful networks are on novel architectures, which almost exclusively use the same type of neurons. Recently, more and more deep learning studies have been inspired by the idea of NeuroAI and the neuronal diversity observed in human brains, leading to the proposal of novel artificial neuron designs. Designing well-performing neurons represents a new dimension relative to designing well-performing neural architectures. Biologically, the brain does not rely on a single type of neuron that universally functions in all aspects. Instead, in our brain, neurons are often task-based. In this study, we address the following question: since the human brain is a task-based neuron user, can the artificial network design go from the task-based architecture design to the task-based neuron design? Since methodologically there are no one-size-fits-all neurons, given the same structure, task-based neurons can enhance the feature representation ability relative to the existing universal neurons due to the intrinsic inductive bias for the task. Specifically, we propose a two-step framework for prototyping task-based neurons. First, symbolic regression is used to identify optimal formulas that fit input data by utilizing base functions such as polynomials. We introduce VSR that stacks all variables in a vector and regularizes each input variable to perform the same computation, which can increase the regression speed, facilitate efficacy in high dimensions, and enable parallel computation. Second, we parameterize the acquired elementary formula to make parameters learnable, which serves as the aggregation function of the neuron. The activation functions such as ReLU and the sigmoidal functions remain the same because they have proven to be good. As the initial step, we evaluate the proposed framework using polynomials as base functions. Empirically, systematic experimental results on synthetic data, classic benchmarks, and real-world applications show that the proposed task-based neuron design is not only feasible but also delivers competitive performance over other state-of-the-art models. We have shared our code in https://github.com/NewT123-WM/Task_based_neurons.
Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks. Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work pro pose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF)-Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.
Domain shift, characterized by degraded model performance during the transfer from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors-an approach limited by high computational cost, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pre-trained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings suggest that cross-domain performance degradation is often associated with decision-boundary misalignment, and that correcting such misalignment can serve as an effective alternative to feature adaptation, particularly when pretrained representations are sufficiently strong. Unlike fine-tuning entire pre-trained models, which risks introducing unpredictable feature distortions, we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretable analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single training cycle. Moreover, we introduce an Intra-Class Distance Metric (ICDM) that enables fully unsupervised hyperparameter selection without requiring target-domain labels. Evaluations on public benchmarks show that FPS achieves competitive performance across standard benchmarks, with notable gains in several settings and tasks. FPS scales efficiently with large multimodal models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable framework for domain adaptation tasks.
Diffusion-based text-to-image (T2I) models such as Stable Diffusion (SD) and DALL $c.$ E 2 enable versatile image generation but raise significant safety concerns due to their ability to produce harmful or not-safe-for-work (NSFW) content (e.g., nudity). Existing safety strategies, including prompt filtering and machine unlearning, remain limited, as they are vulnerable to biased data, model openness, and adversarial prompt attacks. Achieving safe alignment during reinforcement learning (RL) fine-tuning is thus essential, yet faces two significant challenges: alignment fragility, where models easily lose control after optimization, and the safety-quality paradox, where improving safety often degrades visual quality. To address these issues, we propose S-TRPO, a Safety-constrained Trust-Region Policy Optimization framework that enables safe and reliable alignment of diffusion models (DMs) within the manifold policy space. S-TRPO introduces a dynamic safety-control mechanism that combines danger-region perception with trust-region constraints to maintain both safety and generation fidelity. Specifically, a KL-based safety region and a static risk model jointly evaluate harmful prompt risk and restrict unsafe deviations in policy updates. Furthermore, a Lagrangian dual-control scheme balances safety constraints with image-quality optimization. Extensive experiments on real-world adversarial benchmarks demonstrate that, under white-box UnlearnDiffAtk evaluation, S-TRPO with full malicious fine-tuning reduces the attack success rate by 51.7% relative to DPOK, while maintaining comparable image-text alignment quality. These results highlight the effectiveness of S-TRPO in mitigating risky behaviors and enhancing the reliability of T2I diffusion systems.
In the context of novel view synthesis, 3D Gaussian Splatting (3DGS) has recently emerged as an efficient and competitive counterpart to Neural Radiance Field (NeRF), enabling high-fidelity photorealistic rendering in real time. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first reviews the reconstruction preliminaries of 3DGS, followed by the problem formulation, 2D foundation models, and related NeRF-based research areas that inform downstream 3DGS applications. We then categorize 3DGS applications into three foundational tasks: segmentation, editing, and generation, alongside additional functional applications built upon or tightly coupled with these foundational capabilities. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with a comparative analysis of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.
Object Concept Learning (OCL) aims to recognize high-level attributes and affordances of objects and to infer the causal relationships between them. The key is to accurately model the many-to-many mapping between objects and concepts: While an object may possess multiple concepts, a concept can also belong to multiple objects. Existing methods primarily rely on attention mechanisms to capture label correlations, which limits their ability to comprehend high-level concepts and to perform effective causal reasoning. Inspired by the human cognitive process of progressive understanding, a Hierarchical Cross-Modal Relational Reasoning (CORE) framework is proposed to enhance the understanding of object concepts through hierarchical interaction and reasoning between visual and textual modalities. Specifically, a coarse-to-fine relational reasoning module is developed, in which multi-step learnable prompts are employed to progressively localize the conceptual information of objects, thereby improving the accuracy of object-concept mapping. Subsequently, to facilitate the modeling of causal relationships between object attributes and affordances, a counterfactual reasoning mechanism is introduced. By constructing counterfactual samples and distinguishing the predictive outputs of factual and counterfactual parts, the model's ability to capture causality among concepts is enhanced. Significant performance gains and extensive visualization analysis demonstrate the superiority of our method.
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have led to significant progress in 2D body pose estimation. However, achieving a good balance between accuracy, efficiency, and robustness remains a challenge. For instance, CNNs are computationally efficient but struggle with long-range dependencies, while ViTs excel in capturing such dependencies but suffer from quadratic computational complexity. This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. The first one, EViTPose, operates in a computationally efficient manner without sacrificing accuracy by utilizing learnable joint tokens to select and process a subset of the most important body patches, enabling us to control the trade-off between accuracy and efficiency by changing the number of patches to be processed. The second one, UniTransPose, while not allowing for the same level of direct control over the trade-off, efficiently handles multiple scales by combining (1) an efficient multi-scale transformer encoder that uses both local and global attention with (2) an efficient sub-pixel CNN decoder for better speed and accuracy. Moreover, by incorporating all joints from different benchmarks into a unified skeletal representation, we train robust methods that learn from multiple datasets simultaneously and perform well across a range of scenarios, including pose variations, lighting conditions, and occlusions. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods while improving computational efficiency. EViTPose exhibits a significant decrease in computational complexity (30% to 44% less in GFLOPs) with a minimal drop of accuracy (0% to 3.5% less), and UniTransPose achieves accuracy improvements ranging from 0.9% to 43.8% across these benchmarks.
Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To overcome these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. By explicitly mapping the iterative optimization steps to network layers, we bridge the gap between physical interpretability and deep representation learning. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance on Infrared Small Target Detection (IRSTD) and Vessel Segmentation (VS) tasks, while maintaining competitive results on Defect Detection (DD) tasks involving larger targets. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage.
Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, we propose a novel framework, Language-VIsion Discriminator for dEtecting changes, LaVIDE, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we intro duce restricted prompt learning to generate context-aware textual prompts that align map semantics with image content, and an object-aware embedding enhancement strategy to integrate object level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the art methods by significant margins, achieving 18.4% and 5.2% improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.
In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), i.e., negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, e.g., radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. Remarkably, with less than 0.01% of trainable parameters, NEAT achieves comparable or superior performance to state-of-the-art post-training approaches. Our code is available at https://github.com/hhc1997/NEAT.
Semantic mismatch remains a key challenge in conventional knowledge distillation, where representational features are typically regressed from the teacher to the student in a one-to-one spatial matching fashion. In this paper, we address semantic mismatch by examining architectural differences between teacher and student networks. Specifically, due to the variations in network width and depth, the teacher network has a larger receptive field than the student, enabling it to integrate a broader spatial context. In contrast, the student model captures more localized features. This disparity exacerbates semantic misalignment. To alleviate this issue, we propose a novel one-to-all spatial matching knowledge distillation approach, wherein each pixel of the teacher's feature is distilled to all spatial locations of the student's feature map, weighted by a similarity map produced by a Target-aware Transformer (TaT). To further enhance TaT, we reduce its quadratic computational complexity and prevent incorrect spatial alignment, such as distilling background regions from the teacher to foreground regions in the student, and vice versa. In addition, we introduce the "looking broader" strategy, which rearranges the distilled representations of the student and teacher to align their receptive fields. This strategy is motivated by the observation that while individual pixels in student features typically have smaller receptive fields, aggregating multiple pixels can effectively bridge this gap. Therefore, we propose integrating feature pixels from multiple spatial positions using an efficient matrix multiplication. We validate our method through extensive experiments and demonstrate its superior performance and broad generalization capability across various backbone networks and vision tasks, including image classification, semantic segmentation, and object detection.
Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.