Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagat
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance bind
This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperform
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-s
Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments
Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal time-series generation approaches and supervised crop mapping models, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of optimal methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows
CERN's strategic R&D programme on technologies for future experiments recently started investigating the TPSCo 65nm ISC CMOS imaging process for monolithic active pixels sensors for application in high energy physics. In collaboration with the ALICE experiment and other institutes, several prototypes demonstrated excellent performance, qualifying the technology. The Hybrid-to-Monolithic (H2M), a new test-chip produced in the same process but with a larger pixel pitch than previous prototypes, exhibits an unexpected asymmetric efficiency pattern. This contribution describes a simulation procedure combining TCAD, Monte Carlo and circuit simulations to model and understand this effect. It proved able to reproduce measurement results and attribute the asymmetric efficiency drop to a slow charge collection due to low amplitude potential wells created by the circuitry layout and impacting efficiency via ballistic deficit.
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact ove
MALTA2 is a Depleted Monolithic Active Pixel Sensor designed to meet the challenging requirements of future collider experiments, in particularly extreme radiation tolerance and high hit rate. The sensor is fabricated in a modified Tower 180 nm CMOS imaging technology to mitigate performance degradation caused by 100 MRad of Total Ionising Dose and greater than 10^{15} 1 MeV n_{eq}/cm^2 of Non-Ionising Energy Loss. MALTA2 samples have been tested during the CERN SPS test beam campaign in 2023-2024, before and after irradiation at a fluence of 1 $\times$ 10^{15} 1 MeV n_{eq}/cm^2. The sensors were positioned at various inclinations relative to the beam, covering grazing angles from 0 to 60 degrees. This contribution presents measurements of detection efficiency and cluster size as functions of these angles, along with an estimation of the active depth of the depleted region based on the test beam results.
Caribou is a versatile data acquisition system used in multiple collaborative frameworks (CERN EP R&D, DRD3, AIDAinnova, Tangerine) for laboratory and test-beam qualification of novel silicon pixel detector prototypes. The system is built around a common hardware, firmware and software stack shared accross different projects, thereby drastically reducing the development effort and cost. It consists of a custom Control and Readout (CaR) board and a commercial Xilinx Zynq System-on-Chip (SoC) platform. The SoC platform runs a full Yocto distribution integrating the custom software framework (Peary) and a custom FPGA firmware built within a common firmware infrastructure (Boreal). The CaR board provides a hardware environment featuring various services such as powering, slow-control, and high-speed data links for the target detector prototype. Boreal and Peary, in turn, offer firmware and software architectures that enable seamless integration of control and readout for new devices. While the first version of the system used a SoC platform based on the ZC706 evaluation board, migration to a Zynq UltraScale+ architecture is progressing towards the support of the ZCU102 board and th
We have been developing X-ray SOIPIXs for next-generation satellites for X-ray astronomy. Their high time resolution ($\sim10~μ$s) and event-trigger-output function enable us to read out without pile-ups and to use anti-coincidence systems. Their performance in imaging spectroscopy is comparable to that in the CCDs. A problem in our previous model was degradation of charge-collection efficiency (CCE) at pixel borders. We measured the response in the sub-pixel scale, using finely collimated X-ray beams at $10~μ$mΦ$ at SPring-8, and investigated the non-uniformity of the CCE within a pixel. We found that the X-ray detection efficiency and CCE degrade in the sensor region under the pixel circuitry placed outside the buried p-wells (BPW). A 2D simulation of the electric fields shows that the isolated pixel-circuitry outside the BPW creates local minimums in the electric potentials at the interface between the sensor and buried oxide layers. Thus, a part of signal charge is trapped there and is not collected to the BPW. Based on this result, we modified the placement of the in-pixel circuitry so that the electric fields would converge toward the BPW. We confirmed that the CCE at pixel b
Multi-object tracking (MOT) requires detecting and associating objects through frames. Unlike tracking via detected bounding boxes or tracking objects as points, we propose tracking objects as pixel-wise distributions. We instantiate this idea on a transformer-based architecture, P3AFormer, with pixel-wise propagation, prediction, and association. P3AFormer propagates pixel-wise features guided by flow information to pass messages between frames. Furthermore, P3AFormer adopts a meta-architecture to produce multi-scale object feature maps. During inference, a pixel-wise association procedure is proposed to recover object connections through frames based on the pixel-wise prediction. P3AFormer yields 81.2\% in terms of MOTA on the MOT17 benchmark -- the first among all transformer networks to reach 80\% MOTA in literature. P3AFormer also outperforms state-of-the-arts on the MOT20 and KITTI benchmarks.
The variation of pose, illumination and expression makes face recognition still a challenging problem. As a pre-processing in holistic approaches, faces are usually aligned by eyes. The proposed method tries to perform a pixel alignment rather than eye-alignment by mapping the geometry of faces to a reference face while keeping their own textures. The proposed geometry alignment not only creates a meaningful correspondence among every pixel of all faces, but also removes expression and pose variations effectively. The geometry alignment is performed pixel-wise, i.e., every pixel of the face is corresponded to a pixel of the reference face. In the proposed method, the information of intensity and geometry of faces are separated properly, trained by separate classifiers, and finally fused together to recognize human faces. Experimental results show a great improvement using the proposed method in comparison to eye-aligned recognition. For instance, at the false acceptance rate of 0.001, the recognition rates are respectively improved by 24% and 33% in Yale and AT&T datasets. In LFW dataset, which is a challenging big dataset, improvement is 20% at FAR of 0.1.
Liver tumor segmentation and classification are important tasks in computer aided diagnosis. We aim to address three problems: liver tumor screening and preliminary diagnosis in non-contrast computed tomography (CT), and differential diagnosis in dynamic contrast-enhanced CT. A novel framework named Pixel-Lesion-pAtient Network (PLAN) is proposed. It uses a mask transformer to jointly segment and classify each lesion with improved anchor queries and a foreground-enhanced sampling loss. It also has an image-wise classifier to effectively aggregate global information and predict patient-level diagnosis. A large-scale multi-phase dataset is collected containing 939 tumor patients and 810 normal subjects. 4010 tumor instances of eight types are extensively annotated. On the non-contrast tumor screening task, PLAN achieves 95% and 96% in patient-level sensitivity and specificity. On contrast-enhanced CT, our lesion-level detection precision, recall, and classification accuracy are 92%, 89%, and 86%, outperforming widely used CNN and transformers for lesion segmentation. We also conduct a reader study on a holdout set of 250 cases. PLAN is on par with a senior human radiologist, showing
We prove and demonstrate here for the example of the large scale pixel detector of ATLAS that Serial Powering of pixel modules is a viable alternative and that has been devised and implemented for ATLAS pixel modules using dedicated on-chip voltage regulators and modified flex hybrids circuits. The equivalent of a pixel ladder consisting of six serially powered pixel modules with about 0.3Mpixels has been built and the performance with respect to noise and threshold stability and operation failures has been studied. We believe that Serial Powering in general will be necessary for future large scale tracking detectors.
To cope with the higher occupancy and radiation damage at the HL-LHC also the LHC experiments will be upgraded. The ATLAS Planar Pixel Sensor R&D Project (PPS) is an international collaboration of 17 institutions and more than 80 scientists, exploring the feasibility of employing planar pixel sensors for this scenario. Depending on the radius, different pixel concepts are investigated using laboratory and beam test measurements. At small radii the extreme radiation environment and strong space constraints are addressed with very thin pixel sensors active thickness in the range of (75-150) mum, and the development of slim as well as active edges. At larger radii the main challenge is the cost reduction to allow for instrumenting the large area of (7-10) m^2. To reach this goal the pixel productions are being transferred to 6 inch production lines and more cost-efficient and industrialised interconnection techniques are investigated. Additionally, the n-in-p technology is employed, which requires less production steps since it relies on a single-sided process. Recent accomplishments obtained within the PPS are presented. The performance in terms of charge collection and efficienc
The R&D activity presented is focused on the development of new modules for the upgrade of the ATLAS pixel system at the High Luminosity LHC (HL-LHC). The performance after irradiation of n-in-p pixel sensors of different active thicknesses is studied, together with an investigation of a novel interconnection technique offered by the Fraunhofer Institute EMFT in Munich, the Solid-Liquid-InterDiffusion (SLID), which is an alternative to the standard solder bump-bonding. The pixel modules are based on thin n-in-p sensors, with an active thickness of 75 um or 150 um, produced at the MPI Semiconductor Laboratory (MPI HLL) and on 100 um thick sensors with active edges, fabricated at VTT, Finland. Hit efficiencies are derived from beam test data for thin devices irradiated up to a fluence of 4e15 neq/cm^2. For the active edge devices, the charge collection properties of the edge pixels before irradiation is discussed in detail, with respect to the inner ones, using measurements with radioactive sources. Beyond the active edge sensors, an additional ingredient needed to design four side buttable modules is the possibility of moving the wire bonding area from the chip surface facing th
SOI (Silicon-On-Insulator) pixel sensor is promising technology for developing the high position resolution detector by integrating the small pixels and circuits in the monolithic way. The event driven (trigger mode) SOI based pixel sensor has also been developed for the application of X-ray astronomy with the purpose of reducing the noise using anti-coincidence event. This trigger mode SOI pixel sensor working with in the rate of kilo Hz is also a promising scatter detector for advanced Compton imaging to track the Compton recoiled electrons.
The CMS pixel barrel system will consist of three layers built of about 800 modules. One module contains 66560 readout channels and the full pixel barrel system about 48 million channels. It is mandatory to test each channel for functionality, noise level, trimming mechanism, and bump bonding quality. Different methods to determine the bump bonding yield with electrical measurements have been developed. Measurements of several operational parameters are also included in the qualification procedure. Among them are pixel noise, gains and pedestals. Test and qualification procedures of the pixel barrel modules are described and some results are presented.