Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robust
Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which is critical to facilitate comprehension and navigation within an environment. While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. To address this gap between software and hardware capabilities, we introduce AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios.
We report the discovery of four extremely faint ($m_{\mathrm{F444W}}\gtrsim29$) red point sources in recent ultra-deep JWST/NIRCam images of the strong lensing galaxy cluster Abell S1063. All four sources sit in lensed arcs, on the symmetry points very close to the critical curves for their host-galaxies' redshifts ($z\sim1-4$). Remarkably, these point sources appear in most arcs that are sufficiently faint close to the critical curve's position ($<21\,\mathrm{nJy}\,\mathrm{arcsec}^{-2}$ in F115W). This suggests that -- unlike previous caustic-crossing events or lensed stars -- thanks to the unprecedented depth of the GLIMPSE observations paired with the extreme lensing magnification (up to $μ\sim10^4$) we might be resolving the lower-mass ($M\sim1-11\,\mathrm{M}_{\odot}$) red stellar population. Concretely, we detect three likely extremely magnified asymptotic giant branch (AGB) stars ($T_{\mathrm{eff}}\sim3200-3750$ K), and one yellow super-giant star ($T_{\mathrm{eff}}\sim6750$ K) -- possibly a yellow hyper-giant or a Cepheid. In addition to offering the first glimpse at low-mass extremely magnified stars, these detections open a possible window into stellar populations, evol
We report the measurement of the R3=[O III]5008/Hb ratios for 54 galaxies in the GLIMPSE-D survey. Thanks to gravitational lensing, our sample includes galaxies with -20 < Muv < -14 at z=6-9. We derive oxygen abundances using calibrated relationships. We observe a significant decline in R3 values below Muv > -18, which we interpret as evidence of decreasing metallicities in fainter regimes. We explore four prescription models of the evolution of R3 with UV emission based on the new measurements and results from previous surveys. Applying these models to the GLIMPSE [O III]+Hb luminosity functions, we measure and extrapolate the ionising photon production rate $\dot{N}_{ion}$ of galaxies down to very faint limits SFR(Ha) > 5e-3 Msun/yr. Our results support the dominant contribution of star-forming galaxies to reionisation, and are consistent with the recent discovery of ultra-faint metal-poor galaxies. Our measurements of the relative contribution of each luminosity bin show that galaxies with L(Ha)~1e41 to 1e42 erg/s dominate at 8<z<9, but the relative contributions become more uniform at $7<z<8$. Extreme models either under- or over-estimate the ionising ph
We present an overview of the JWST GLIMPSE program, highlighting its survey design, primary science goals, gravitational lensing models, and first results. GLIMPSE provides ultra-deep JWST/NIRCam imaging across seven broadband filters (F090W, F115W, F200W, F277W, F356W, F444W) and two medium-band filters (F410M, F480M), with exposure times ranging from 20 to 40 hours per filter. This yields a 5$σ$ limiting magnitude of 30.9 AB (measured in a 0.2 arcsec diameter aperture). The field is supported by extensive ancillary data, including deep HST imaging from the Hubble Frontier Fields program, VLT/MUSE spectroscopy, and deep JWST/NIRSpec medium-resolution multi-object spectroscopy. Exploiting the strong gravitational lensing of the galaxy cluster Abell S1063, GLIMPSE probes intrinsic depths beyond 33 AB magnitudes and covers an effective source-plane area of approximately 4.4 arcmin$^2$ at $z \sim 6$. The program's central aim is to constrain the abundance of the faintest galaxies from $z \sim 6$ up to the highest redshifts, providing crucial benchmarks for galaxy formation models, which have so far been tested primarily on relatively bright systems. We present an initial sample of $\s
As observations have yet to constrain the ionizing properties of the faintest (M$_{\rm UV}$ > -16) galaxies, their contribution to cosmic reionization remains unclear. The rest-frame ultraviolet (UV) continuum slope ($β$) is a powerful diagnostic of stellar populations and one of the few feasible indicators of the escape fraction of ionizing photons (f$_{\rm esc}$) for such faint galaxies at high-redshift. Leveraging ultra-deep JWST/NIRCam GLIMPSE imaging of strong lensing field Abell S1063, we estimate UV continuum slopes of 555 galaxies at z $>$ 6 with absolute magnitudes down to M$_{\rm UV}$ $\simeq -$12.5. We find a modest evolution of $β$ with redshift and a flattening in the $β$-M$_{\rm UV}$ relation such that galaxies fainter than M$_{\rm UV}$ $\sim -$16.5 no longer exhibit the bluest UV slopes. The 138 ultra-faint galaxies with M$_{\rm UV}$ $> -$16 are a diverse population encompassing dusty (30\%), old (15\%), and low-mass (50\%) galaxies. We apply the empirical $β$-f$_{\rm esc}$ relation from local Lyman continuum leakers, finding the mean f$_{\rm esc}$ peaks at $\sim 20\%$ at M$_{\rm UV}=-$16.5 and declines towards fainter galaxies, while remaining consistent wi
Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
The competition between metal synthesis and feedback from massive stars establishes the mass-metallicity relation (MZR) at low-redshifts. Examining this relation at higher redshifts, particularly at the low-mass end $\lesssim10^{8}\,{\rm M_\odot}$, is essential for understanding chemical enrichment and stellar feedback. In this study, we utilize the deep ($\sim30\,$hrs) JWST/NIRSpec G395M GLIMPSE-D survey of the lensed field Abell S1063, to explore the low-mass end of the MZR at high redshift ($z\sim6-8$). We identify eight [OIII]$λ$4364 emitters, enabling the most reliable "direct" metallicity measurements in galaxies down to stellar masses of $\sim10^{6-8}\,{\rm M_\odot}$. By combining our sample and galaxies with [OIII]$λ$4364 detections from the literature, we calculate direct metallicities for 21 galaxies. We compare our direct metallicities to those derived from strong-line diagnostics, and find them to be consistent with previous calibrations. We fit the MZR at $10^{6.7-9}\,M_{\odot}$ with $\sim0.3-0.5$ dex lower metallicity than local galaxies at similar stellar mass. We find the slope to be $0.25\pm0.10$, comparable to the local MZR; and the MZR exhibits a scatter of $\sim
Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but
Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.
Many online action prediction models observe complete frames to locate and attend to informative subregions in the frames called glimpses and recognize an ongoing action based on global and local information. However, in applications with constrained resources, an agent may not be able to observe the complete frame, yet must still locate useful glimpses to predict an incomplete action based on local information only. In this paper, we develop Glimpse Transformers (GliTr), which observe only narrow glimpses at all times, thus predicting an ongoing action and the following most informative glimpse location based on the partial spatiotemporal information collected so far. In the absence of a ground truth for the optimal glimpse locations for action recognition, we train GliTr using a novel spatiotemporal consistency objective: We require GliTr to attend to the glimpses with features similar to the corresponding complete frames (i.e. spatial consistency) and the resultant class logits at time $t$ equivalent to the ones predicted using whole frames up to $t$ (i.e. temporal consistency). Inclusion of our proposed consistency objective yields ~10% higher accuracy on the Something-Somethin
In this essay it is proved that, in a self-consistent semiclassical theory of gravity, the asymptotically measured orbital periods of test particles around central compact objects are fundamentally bounded from below by the compact universal relation $T_{\infty}\geq{{2πe\hbar}\over{\sqrt{G}c^2 m^2_{e}}}$ [here $\{m_e,e\}$ are respectively the proper mass and the electric charge of the electron, the lightest charged particle]. The explicit dependence of the lower bound on the fundamental constants $\{G,c,\hbar\}$ of gravity, special relativity, and quantum theory suggests that it provides a rare glimpse into the yet unknown quantum theory of gravity.
Deep learning has become the state-of-the-art approach to medical tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a multiscale convolutional neural network (CNN) which computes the final reconstruction. Despite good results on in-distribution test data, this often results in overfitting certain large-scale structures and poor generalization on out-of-distribution (OOD) samples. Moreover, the memory and computational complexity of multiscale CNNs scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions. In this paper, we introduce Glimpse, a local coordinate-based neural network for computed tomography which reconstructs a pixel value by processing only the measurements associated with the neighborhood of the pixel. Glimpse significantly outperforms successful CNNs on OOD samples, while achieving comparable or better performance on in-distribution test data and maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images which is orders of magnitude less than CNNs. Glimpse is fully differentiable a
Advanced large language models (LLMs) can generate text almost indistinguishable from human-written text, highlighting the importance of LLM-generated text detection. However, current zero-shot techniques face challenges as white-box methods are restricted to use weaker open-source LLMs, and black-box methods are limited by partial observation from stronger proprietary LLMs. It seems impossible to enable white-box methods to use proprietary models because API-level access to the models neither provides full predictive distributions nor inner embeddings. To traverse the divide, we propose **Glimpse**, a probability distribution estimation approach, predicting the full distributions from partial observations. Despite the simplicity of Glimpse, we successfully extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models. Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models, improving the score by 51% relative to the remaining space of the open source baseline. It demonstrates that the latest LLMs can effectively detect their own outputs, suggesting that advanced LLMs
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art perfo
We report the discovery of two galaxy candidates at redshifts between $15.7<z<16.4$ in JWST observations from the GLIMPSE survey. These robust sources were identified using a combination of Lyman-break selection and photometric redshift estimates. The ultra-deep NIRCam imaging from GLIMPSE, combined with the strong gravitational lensing of the Abell S1063 galaxy cluster, allows us to probe an intrinsically fainter population (down to $M_{\rm UV} =-17.0$ mag) than previously achievable. These galaxies have absolute magnitudes ranging from $M_{\rm UV} = -17.0$ to $-17.2$ mag, with blue ($β\simeq -2.87$) UV continuum slopes, consistent with young, dust-free stellar populations. The number density of these objects, log$_{\rm 10}$($φ$/[Mpc$^{-3}$ mag$^{-1}$])=$-3.47^{+0.13}_{-0.10}$ at $M_{\rm UV}=-17$ is in clear tension with pre-JWST theoretical predictions, extending the over-abundance of galaxies from $z\sim10$ to $z\sim 17$. These results, together with the scarcity of brighter galaxies in other public surveys, suggest a steep decline in the bright-end of the UV luminosity function at $z \sim 16$, implying efficient star formation and possibly a close connection to the halo m
We use the ultra-deep GLIMPSE JWST/NIRCam survey to constrain the faint-end of the H$β$+[OIII]$λλ$4960,5008 luminosity function (LF) down to $10^{39}$ erg/s at z=7-9 behind the lensed Hubble Frontier Field Abell S1063. We perform SED fitting on a Lyman-Break Galaxy sample, measuring combined H$β$+[OIII] fluxes to construct the emission-line LF. The resulting LF ($α$=-1.55 to -1.78) is flatter than the UV LF ($α<-2$), indicating a lower number density of low H$β$+[OIII] emitters at fixed MUV. We explore three explanations: (i) bursty star formation histories reducing the H$β$+[OIII]-to-UV ratio, (ii) metallicity effects on [OIII]/H$β$, or (iii) a faint-end turnover in the UV LF. Assuming an evolving [OIII]/H$β$ ratio, we derive a flatter [OIII]$λ$5008 LF ($α$=-1.45 to -1.66) and a steeper H$β$ LF ($α$=-1.68 to -1.95). The combination of decreasing metallicity and bursty star formation can reconcile the UV and H$β$+[OIII] LF differences. Converting the LF to the ionising photon production rate, we find that galaxies with H$α$ flux $>10^{39}$ erg/s (SFR(H$α$)$>5\times10^{-3} M_\odot$/yr) contribute 31%-90% and 46%-156% of the ionising photon budget at 7<z<8 and 8<z&l
We propose a method for human activity recognition from RGB data that does not rely on any pose information during test time and does not explicitly calculate pose information internally. Instead, a visual attention module learns to predict glimpse sequences in each frame. These glimpses correspond to interest points in the scene that are relevant to the classified activities. No spatial coherence is forced on the glimpse locations, which gives the module liberty to explore different points at each frame and better optimize the process of scrutinizing visual information. Tracking and sequentially integrating this kind of unstructured data is a challenge, which we address by separating the set of glimpses from a set of recurrent tracking/recognition workers. These workers receive glimpses, jointly performing subsequent motion tracking and activity prediction. The glimpses are soft-assigned to the workers, optimizing coherence of the assignments in space, time and feature space using an external memory module. No hard decisions are taken, i.e. each glimpse point is assigned to all existing workers, albeit with different importance. Our methods outperform state-of-the-art methods on t
Scientific peer review is essential for the quality of academic publications. However, the increasing number of paper submissions to conferences has strained the reviewing process. This surge poses a burden on area chairs who have to carefully read an ever-growing volume of reviews and discern each reviewer's main arguments as part of their decision process. In this paper, we introduce \sys, a summarization method designed to offer a concise yet comprehensive overview of scholarly reviews. Unlike traditional consensus-based methods, \sys extracts both common and unique opinions from the reviews. We introduce novel uniqueness scores based on the Rational Speech Act framework to identify relevant sentences in the reviews. Our method aims to provide a pragmatic glimpse into all reviews, offering a balanced perspective on their opinions. Our experimental results with both automatic metrics and human evaluation show that \sys generates more discriminative summaries than baseline methods in terms of human evaluation while achieving comparable performance with these methods in terms of automatic metrics.
We present near-infrared spectroscopic observations of massive stars in three stellar clusters located in the direction of the inner Galaxy. One of them, the Quartet, is a new discovery while the other two were previously reported as candidate clusters identified on mid-infrared Spitzer images (GLIMPSE20 and GLIMPSE13). Using medium-resolution (R=900-1320) H and K spectroscopy, we firmly establish the nature of the brightest stars in these clusters, yielding new identifications of an early WC and two Ofpe/WN9 stars in the Quartet and an early WC star in GLIMPSE20. We combine this information with the available photometric measurements from 2MASS, to estimate cluster masses, ages, and distances. The presence of several massive stars places the Quartet and GLIMPSE20 among the small sample of known Galactic stellar clusters with masses of a few 10^3 Msun, and ages from 3 to 8 Myr. We estimate a distance of about 3.5 kpc for Glimpse 20, and 6.0 kpc for Quartet. The large number of giant stars identified in GLIMPSE13 indicates that it is another massive (~ 6500 Msun) cluster, but older, with an age between 30 and 100 Myr, at a distance of about 3 kpc.