The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in the open-vocabulary setting, where compositions of both seen and unseen attributes and objects are evaluated. Recently, prompt tuning methods have demonstrated strong generalization capabilities in the closed setting, where only compositions of seen attributes and objects are evaluated, i.e., Compositional Zero-Shot Learning (CZSL). However, directly applying these methods to OV-CZSL may not be sufficient to generalize to unseen attributes, objects and their compositions, as it is limited to seen attributes and objects. Normally, when faced with unseen concepts, humans adopt analogies with seen concepts that have the similar semantics thereby inferring their meaning (e.g., "wet" and "damp", "shirt" and "jacket"). In this paper, we experimentally show that the distribution of semantically related attributes or objects tends to form consistent local structures in the embedding space. Based on the above structures, we propose Structure-aware Prompt Adaptation (SPA) method, which enables models to generalize from seen to unseen attributes and objects. Specifically, in
Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterio
We present the evaluation methodology, datasets and results of the BOP Challenge 2023, the fifth in a series of public competitions organized to capture the state of the art in model-based 6D object pose estimation from an RGB/RGB-D image and related tasks. Besides the three tasks from 2022 (model-based 2D detection, 2D segmentation, and 6D localization of objects seen during training), the 2023 challenge introduced new variants of these tasks focused on objects unseen during training. In the new tasks, methods were required to learn new objects during a short onboarding stage (max 5 minutes, 1 GPU) from provided 3D object models. The best 2023 method for 6D localization of unseen objects (GenFlow) notably reached the accuracy of the best 2020 method for seen objects (CosyPose), although being noticeably slower. The best 2023 method for seen objects (GPose) achieved a moderate accuracy improvement but a significant 43% run-time improvement compared to the best 2022 counterpart (GDRNPP). Since 2017, the accuracy of 6D localization of seen objects has improved by more than 50% (from 56.9 to 85.6 AR_C). The online evaluation system stays open and is available at: http://bop.felk.cvut.
Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification
The rapid advancement of Large Language Models (LLMs), particularly those trained on multilingual corpora, has intensified the need for a deeper understanding of their performance across a diverse range of languages and model sizes. Our research addresses this critical need by studying the performance and scaling behavior of multilingual LLMs in text classification and machine translation tasks across 204 languages. We systematically examine both seen and unseen languages across three model families of varying sizes in zero-shot and few-shot settings. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios, with striking disparities in performance between seen and unseen languages. Model scale has little effect on zero-shot performance, which remains mostly flat. However, in two-shot settings, larger models show clear linear improvements in multilingual text classification. For translation tasks, however, only the instruction-tuned model showed clear benefits from scaling. Our analysis also suggests that overall resource levels, not just the proportions of pretraining languages, are better predictors of model performance, shedding ligh
Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism
Zero-shot learning (ZSL) addresses the unseen class recognition problem by leveraging semantic information to transfer knowledge from seen classes to unseen classes. Generative models synthesize the unseen visual features and convert ZSL into a classical supervised learning problem. These generative models are trained using the seen classes and are expected to implicitly transfer the knowledge from seen to unseen classes. However, their performance is stymied by overfitting, which leads to substandard performance on Generalized Zero-Shot learning (GZSL). To address this concern, we propose the novel LsrGAN, a generative model that Leverages the Semantic Relationship between seen and unseen categories and explicitly performs knowledge transfer by incorporating a novel Semantic Regularized Loss (SR-Loss). The SR-loss guides the LsrGAN to generate visual features that mirror the semantic relationships between seen and unseen classes. Experiments on seven benchmark datasets, including the challenging Wikipedia text-based CUB and NABirds splits, and Attribute-based AWA, CUB, and SUN, demonstrates the superiority of the LsrGAN compared to previous state-of-the-art approaches under both Z
Separation provision and collision avoidance to avoid other air traffic are fundamental components of the layered conflict management system to ensure safe and efficient operations. Pilots have visual-based separation responsibilities to see and be seen to maintain separation between aircraft. To safely integrate into the airspace, drones should be required to have a minimum level of performance based on the safety achieved as baselined by crewed aircraft seen and be seen interactions. Drone interactions with crewed aircraft should not be more hazardous than interactions between traditional aviation aircraft. Accordingly, there is need for a methodology to design and evaluate detect and avoid systems, to be equipped by drones to mitigate the risk of a midair collision, where the methodology explicitly addresses, both semantically and mathematically, the appropriate operating rules associated with see and be seen. In response, we simulated how onboard pilots safely operate through see and be seen interactions using an updated visual acquisition model that was originally developed by J.W. Andrews decades ago. Monte Carlo simulations were representative two aircraft flying under visua
The Fermi Gamma-ray Space Telescope is currently celebrating its 15th anniversary of operation. Since its launch, the Fermi-Large Area Telescope (LAT), the main instrument onboard the Fermi satellite, has remarkably unveiled the sky at GeV energies providing outstanding results in time-domain gamma-ray astrophysics. In particular, LAT has observed some of the most powerful transient phenomena in the Universe (such as gamma-ray bursts, blazar flares, magnetar flares, ...) enabling the possibility to test our current understanding of the laws of physics in extreme conditions. In this paper I will review some of the main recent results with a focus on the transient phenomena seen by LAT with a multi-wavelength and multi-messenger connection.
Generalized zero-shot learning recognizes inputs from both seen and unseen classes. Yet, existing methods tend to be biased towards the classes seen during training. In this paper, we strive to mitigate this bias. We propose a bias-aware learner to map inputs to a semantic embedding space for generalized zero-shot learning. During training, the model learns to regress to real-valued class prototypes in the embedding space with temperature scaling, while a margin-based bidirectional entropy term regularizes seen and unseen probabilities. Relying on a real-valued semantic embedding space provides a versatile approach, as the model can operate on different types of semantic information for both seen and unseen classes. Experiments are carried out on four benchmarks for generalized zero-shot learning and demonstrate the benefits of the proposed bias-aware classifier, both as a stand-alone method or in combination with generated features.
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.
We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach leverages meta-learning to train a deep generative model that integrates variational autoencoder and generative adversarial networks. We propose a novel task distribution where meta-train and meta-validation classes are disjoint to simulate the ZSL behaviour in training. Once trained, the model can generate synthetic examples from seen and unseen classes. Synthesize samples can then be used to train the ZSL framework in a supervised manner. The meta-learner enables our model to generates high-fidelity samples using only a small number of training examples from seen classes. We conduct extensive experiments and ablation studies on four benchmark datasets of ZSL and observe that the proposed model outperforms state-of-the-art approaches by a significant margin when the number of examples pe
We consider a square expanding with constant speed seen from an observer moving away with constant acceleration and study the distribution of angles between rays from the observer towards the lattice points in the square. We prove the existence of the gap distribution as time tends to infinity and provide explicit formulas for the corresponding density function.
We perform a calculation of the interaction of the $ D \bar{D} $, $ D_{s} \bar{D}_{s} $ coupled channels and find two bound states, one coupling to $ D \bar{D} $ and another one at higher energies coupling mostly to $D_{s}^{+} D_{s}^{-}$. We identify this latter state with the $X_{0}(3930)$ seen in the $D^{+} D^{-}$ mass distribution in the $B^+ \to D^{+} D^{-} K^{+} $ decay, and also show that it produces an enhancement of the $D_{s}^{+} D_{s}^{-}$ mass distribution close to threshold which is compatible with the LHCb recent observation in the $B^+ \to D_{s}^{+} D_{s}^{-} K^{+} $ decay which has been identified as a new state, $X_{0}(3960)$.
Explaining the foundations for predictions obtained from graph neural networks (GNNs) is critical for credible use of GNN models for real-world problems. Owing to the rapid growth of GNN applications, recent progress in explaining predictions from GNNs, such as sensitivity analysis, perturbation methods, and attribution methods, showed great opportunities and possibilities for explaining GNN predictions. In this study, we propose a method to improve the explanation quality of node classification tasks that can be applied in a post hoc manner through aggregation of auxiliary explanations from important neighboring nodes, named SEEN. Applying SEEN does not require modification of a graph and can be used with diverse explainability techniques due to its independent mechanism. Experiments on matching motif-participating nodes from a given graph show great improvement in explanation accuracy of up to 12.71% and demonstrate the correlation between the auxiliary explanations and the enhanced explanation accuracy through leveraging their contributions. SEEN provides a simple but effective method to enhance the explanation quality of GNN model outputs, and this method is applicable in combi
Our aim is to determine the plasma properties of a coronal bright point and compare its magnetic topology extrapolated from magnetogram data with its appearance in X-ray images. We analyse spectroscopic data obtained with EIS/Hinode, Ca II H and G-band images from SOT/Hinode, UV images from TRACE, X-ray images from XRT/Hinode and high-resolution/high-cadence magnetogram data from MDI/SoHO. The BP comprises several coronal loops as seen in the X-ray images, while the chromospheric structure consists of tens of small bright points as seen in Ca II H. An excellent correlation exists between the Ca II BPs and increases in the magnetic field, implying that the Ca II H passband is a good indicator for the concentration of magnetic flux. Doppler velocities between 6 and 15 km/s are derived from the Fe XII and Fe XIII lines for the BP region, while for Fe XIV and Si VII they are in the range from -15 to +15 km/s. The coronal electron density is 3.7x10^9 cm^-3. An excellent correlation is found between the positive magnetic flux and the X-ray light-curves. The remarkable agreement between the extrapolated magnetic field configuration and some of the loops composing the BP as seen in the X-r
Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the
Physically unassociated background or foreground objects seen towards submillimetre sources are potential contaminants of both the studies of young stellar objects embedded in Galactic dust clumps and multiwavelength counterparts of submillimetre galaxies (SMGs). We employed the near-infrared and mid-infrared data from the Wide-field Infrared Survey Explorer (WISE) and the submillimetre data from the Planck satellite, and uncovered a source, namely WISE J044232.92+322734.9, whose WISE infrared colours suggest that it is a star-forming galaxy (SFG), and which is seen in projection towards the Planck-detected dust clump PGCC G169.20-8.96. We used the MAGPHYS+photo-$z$ spectral energy distribution code to derive the photometric redshift and physical properties of J044232.92. The redshift was derived to be $z_{\rm phot}=1.132^{+0.280}_{-0.165}$, while, for example, the stellar mass, IR (8-1 000 $μ$m) luminosity, and star formation rate were derived to be $M_{\star}=4.6^{+4.7}_{-2.5}\times10^{11}$ M$_{\odot}$, $L_{\rm IR}=2.8^{+5.7}_{-1.5}\times10^{12}$ L$_{\odot}$, and ${\rm SFR}=191^{+580}_{-146}$ ${\rm M}_{\odot}$ yr$^{-1}$. The derived value of $L_{\rm IR}$ suggests that J044232.92
We present the first linear-polarization mosaicked observations performed by the Atacama Large Millimeter/submillimeter Array (ALMA). We mapped the Orion-KLeinmann-Low (Orion-KL) nebula using super-sampled mosaics at 3.1 and 1.3 mm as part of the ALMA Extension and Optimization of Capabilities (EOC) program. We derive the magnetic field morphology in the plane of the sky by assuming that dust grains are aligned with respect to the ambient magnetic field. At the center of the nebula, we find a quasi-radial magnetic field pattern that is aligned with the explosive CO outflow up to a radius of approximately 12 arc-seconds (~ 5000 au), beyond which the pattern smoothly transitions into a quasi-hourglass shape resembling the morphology seen in larger-scale observations by the James-Clerk-Maxwell Telescope (JCMT). We estimate an average magnetic field strength $\langle B\rangle = 9.4$ mG and a total magnetic energy of 2 x 10^45 ergs, which is three orders of magnitude less than the energy in the explosive CO outflow. We conclude that the field has been overwhelmed by the outflow and that a shock is propagating from the center of the nebula, where the shock front is seen in the magnetic f
Using the Hubble Space Telescope and WFPC2 we have imaged the central 20pc of the giant H II region 30 Doradus nebula in three different emission lines. The images allow us to study the nebula with a physical resolution that is within a factor of two of that typical of ground based observations of Galactic H II regions. Most of the emission within 30 Dor is confined to a thin zone located between the hot interior of the nebula and surrounding dense molecular material. This zone appears to be directly analogous to the photoionized photoevaporative flows that dominate emission from small, nearby H II regions. The dynamical effects of the photoevaporative flow can be seen. The ram pressure in the photoevaporative flow, derived from thermal pressure at the surface of the ionization front, is found to balance with the pressure in the interior of the nebula derived from previous x-ray observations. By analogy with the comparison of ground and HST images of M16 we infer that the same sharply stratified structure seen in HST images of M16 almost certainly underlies the observed structure in 30 Dor. 30 Doradus is a crucial case because it allows us to bridge the gap between nearby H II regi