Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. Among them, token-wise feature caching has been introduced to perform different caching ratios for different tokens in DiTs, aiming to skip the computation for unimportant tokens while still computing the important ones. In this paper, we propose to carefully check the effectiveness in token-wise feature caching with the following two questions: 1) Is it really necessary to compute the so-called "important" tokens in each step? 2) Are so-called important tokens really important? Surprisingly, this paper gives some counter-intuition answers, demonstrating that consistently computing the selected "important tokens" in all steps is not necessary. The selection of the so-called "important tokens" is often ineffective, and even sometimes shows inferior performance than random selection. Based on these observations, this paper introduces dual feature caching referred to as DuCa, which performs aggressive caching strategy and conservative caching strategy iteratively and selects the tokens for computing randomly. Extensive experimental results demonstrate the effectiveness of our method in DiT, PixArt, FLUX, and OpenSora, demonstrating significant improvements than the previous token-wise feature caching.
Tokenized pose estimation (TPE) has demonstrated remarkable performance in lightweight human pose estimation (HPE) models. However, existing TPE methods typically initialize keypoint tokens randomly, without explicitly incorporating human structure priors. These priors play a vital role in HPE by effectively mitigating common challenges such as occlusion and ambiguity. To this end, we propose a Structure-Aware Keypoint Position Embedding (SAKPE). This embedding explicitly encodes inherent structural properties of the human body, such as symmetry and order, into the positional coordinates of keypoint tokens. It also employs learnable scale and offset factors to adapt to diverse human poses, thereby fully exploiting the geometric constraints among keypoints. Furthermore, to better leverage the positional relationships among patch tokens, we introduce a Layer-adaptive Hybrid Patch Position Embedding (LHPPE). It dynamically fuses absolute and relative position embeddings of patch tokens based on attention distributions across Transformer layers, enabling the model to learn both absolute and relative positional information adaptively. Taking the two together, we propose a novel position embedding method for pose estimation, named Human-structure-aware Token Position Embedding (HTPE). It significantly improves the performance of various TPE models. Extensive experiments on COCO, CrowdPose, and OCHuman show that HTPE achieves state-of-the-art (SOTA) performance among lightweight methods, with a negligible increase in parameters and FLOPs. Notably, it demonstrates consistent improvements under occlusion, , achieving up to 3.3 AP gains. The source code can be found in https://github.com/guzejungithub/HTPE.
Gastric intestinal metaplasia (GIM) is often visually inconspicuous on routine endoscopy, while many artificial intelligence systems rely on dense supervision, lack calibrated probabilities, or provide limited evidence of transfer across datasets and devices. We developed a single-frame, four-class endoscopic classifier that jointly models anatomic context and metaplasia status. We propose a dual-stream architecture that combines an RGB Swin Transformer backbone with SHAP-guided lesion-aware multi-scale auxiliary tokens. The two streams are fused through class-token attention to obtain a compact and interpretable representation without relying on video context or pixel-level masks. To address class imbalance, training combines class-balanced focal loss, balanced-softmax/logit adjustment, class-aware sampling, and validation-tuned per-class thresholds. The model was evaluated on an internal four-class cohort of 666 endoscopic still frames and externally assessed, without retuning, on an unseen public endoscopy dataset recast as a binary normal-versus-abnormal task. On the internal cohort, the proposed model achieved macro-AUROC 0.950, macro-AUPRC 0.920, macro-F1 0.926, and accuracy 0.954, with expected calibration error 0.034 and inference latency of approximately 182 ms per 224 × 224 frame. On the unseen external dataset, the model retained AUROC 0.940, AUPRC 0.900, and F1 0.890 using frozen operating thresholds. Comparative and ablation analyses indicated that lesion-aware tokenization and token fusion contributed more strongly to performance gains than backbone choice alone, while calibration quality also improved. A dual-stream, single-frame token-fusion model can provide accurate, calibrated, and interpretable classification of gastric intestinal metaplasia while remaining compatible with low-latency edge-oriented inference. Although broader multicenter validation is still required, the results support the feasibility of deployment-oriented AI assistance for endoscopic GIM triage.
Silent gaps between sound tokens are a prominent feature of natural acoustic sequences, yet their role in shaping cortical selectivity for temporal structure remains unclear. Here we tested how varying inter-token interval silence influence neural discrimination of periodic vs aperiodic sound sequences across primary auditory cortex (A1) and non-primary auditory cortices (AuV and AuD) with single unit electrophysiology in anesthetized mice. Non primary cortices spiking activity can discriminate periodic from aperiodic sequences via their selective subpopulations, but including silence does not systematically enhance selectivity, however ITI-dependent selectivity gain is unique to A1 spikes. Pre-subsequent period activity builds up across the sequence, is stronger for periodic than aperiodic sequences, and this buildup is specific to A1 prominently expressed in A1 spikes and in the gamma-band component of A1 LFP while being absent in non-primary auditory cortices.
Human pose estimation (HPE) is a fundamental challenge in computer vision, aiming to detect anatomical keypoints in images. Traditional methods rely on CNN models, but recent advancements in Vision Transformer (ViT) models have shown superior performance. However, ViTs often require substantial computational resources. This paper introduces SPTPose, a method that employs self-distillation and token pruning to reduce computational costs while maintaining high performance. Our SPTPose-B achieves a mAP of 74.8% on the MSCOCO validation set with only 13.2 million parameters and 4.7 GFLOPs. The source code is available at https://github.com/duduxx123/SPTPose.
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have led to significant progress in 2D body pose estimation. However, achieving a good balance between accuracy, efficiency, and robustness remains a challenge. For instance, CNNs are computationally efficient but struggle with long-range dependencies, while ViTs excel in capturing such dependencies but suffer from quadratic computational complexity. This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. The first one, EViTPose, operates in a computationally efficient manner without sacrificing accuracy by utilizing learnable joint tokens to select and process a subset of the most important body patches, enabling us to control the trade-off between accuracy and efficiency by changing the number of patches to be processed. The second one, UniTransPose, while not allowing for the same level of direct control over the trade-off, efficiently handles multiple scales by combining (1) an efficient multi-scale transformer encoder that uses both local and global attention with (2) an efficient sub-pixel CNN decoder for better speed and accuracy. Moreover, by incorporating all joints from different benchmarks into a unified skeletal representation, we train robust methods that learn from multiple datasets simultaneously and perform well across a range of scenarios, including pose variations, lighting conditions, and occlusions. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods while improving computational efficiency. EViTPose exhibits a significant decrease in computational complexity (30% to 44% less in GFLOPs) with a minimal drop of accuracy (0% to 3.5% less), and UniTransPose achieves accuracy improvements ranging from 0.9% to 43.8% across these benchmarks.
The ultra-dense vehicle scenarios envisioned in 6G put high requirements on ultra-low latency, secure cooperation, and efficient task offloading decisions. Existing systems usually optimize latency or energy independently but ignore joint privacy problems and long-term trust sustainability. In this work, a distributed intelligence architecture based on the combination of federated learning (FL) and blockchain based trust management for vehicle-to-vehicle (V2V) edge computing is proposed. The proposed architecture enables collaborative prediction and decentralized incentive enforcement in a privacy-preserving manner without revealing raw vehicle data. In this paper, task allocation is defined as a multi-objective optimization problem, which jointly considers latency, energy consumption, communication stability and privacy exposure. The resultant problem is addressed by a learning-coupled primal-dual optimization, where the federated prediction is used to drive the offloading decisions and the dual update is used to impose the limitations of the system. A light-weight distributed ledger layer ensures secure coordination, automatic incentive allocation and reliable detection of fraudulent nodes. The extensive simulations in the integrated traffic-network-blockchain environments show that the proposed method outperforms the state-of-the-art baselines, achieving up to 30-40% reduction in the service latency, approximately 25% improvement in task completion rate, enhanced privacy preservation by the gradient-based learning, and up to 95% accuracy in detecting the malicious nodes. These results validate the efficacy of the suggested framework for attaining scalable, privacy-aware, and trustworthy distributed intelligence for next-generation 6G vehicular edge networks.
The transition toward sustainable cities requires integrated energy planning frameworks that coordinate multiple technologies, policy instruments, and social considerations. This study proposes a robust optimization framework for rich-renewables eco-sustainable urban communities, where multi-energy hubs including electricity, thermal, cooling, and hydrogen systems are jointly managed under uncertainty. A scenario-independent static robust model is developed to ensure reliable operation under renewable intermittency, supported by sensitivity analyses. The framework introduces hydrogen chemistry consortium processes, integrating electrolyzers, methanation, fuel cells, and carbon capture, utilization, and storage to enhance renewable utilization and reduce emissions. Both stationary storage systems and electric public transportation fleets are incorporated to provide distributed and mobile energy flexibility. Demand-side management and policy mechanisms, including carbon taxation and cap-and-trade, are embedded to align operations with environmental targets. A digital-social welfare layer evaluates affordability and equitable access. Simulation results across multiple scenarios demonstrate that the proposed framework reduces operational costs by over 45%, improves grid independence by more than 35%, and achieves emission reductions exceeding 90%. Welfare indicators also show significant improvement, confirming the effectiveness of the integrated approach.
Deep learning, particularly encoder-only transformer architectures, has demonstrated excellent performance in biomedical literature classification, facilitating evidence-based medicine, and knowledge synthesis. However, the opacity of these models' decision-making processes limits their clinical interpretability, trustworthiness, and widespread adoption. Traditional explainable artificial intelligence methods, such as Shapley Additive Explanations (SHAP) and integrated gradients (IG), address this issue but often incur substantial computational overhead for text classification. Generative large language models may offer a novel approach to generating interpretable, context-aware explanations as autonomous agents. As a proof-of-concept, the study aimed to investigate the effectiveness of GPT-4o as a standalone, end-to-end perturbation-based explainer for a BioLinkBERT text classifier. We compared its explanations against the SHAP partition explainer and IG as established baselines in terms of explanation faithfulness and semantic alignment. A stratified sample of 200 studies from the McMaster Premium Literature Service (PLUS) and Clinical Hedges databases was classified by a fine-tuned BioLinkBERT model for methodological rigor. The sampling specifically over-represented difficult, low-confidence predictions to rigorously test the explainers, with an equal number of studies sampled from each probability decile predicted by BioLinkBERT. GPT-4o, SHAP, and IG generated token-level feature attributions across a robust feature space of 80,901 tokens. GPT-based explanations were derived through a sophisticated, iterative masking perturbation workflow under 2 prompting schemes (token indices vs explicit subword tokens). Explanations were evaluated using a rank-based, modified area over the perturbation curve (AOPC), pairwise correlation analyses, and qualitative assessment of feature importance. Among the 200 studies, 80,901 tokens were included, and feature attributions were generated by the 4 explainers (6369 unique tokens). SHAP (AOPC 0.222, 95% CI 0.200-0.244) and IG (AOPC 0.225, 95% CI 0.202-0.247) provided consistent explanations, effectively identifying tokens relevant to study rigor (eg, "randomized" and "blind"). In contrast, despite evaluating a larger perturbation space, the GPT-4o prompting schemes did not achieve comparable faithfulness (AOPC 0.025-0.029) and produced divergent token attributions. Correlation analysis demonstrated moderate alignment between SHAP and IG (Pearson r=0.367), whereas GPT-4o exhibited limited correlation (Pearson r≤0.032) with the established baselines. Sensitivity analyses isolating only correctly classified instances yielded similar trends. Additionally, the iterative application programming interface calls required for GPT made it significantly more computationally intensive and costly to execute, whereas IG was the most temporally efficient. Despite their advanced contextual capabilities, current generative large language models are limited when deployed as standalone perturbation explainers. The findings reveal that GPT-4o struggles to accurately synthesize mathematical feature importance through iterative masking, lacking the reliability of traditional explainable artificial intelligence frameworks. Future research could build upon this work and investigate specialized prompt engineering, whole-word recombination strategies, and hybrid frameworks.
Disorganized speech is a core clinical feature of schizophrenia, reflecting disruptions in contextual integration. To objectively quantify the unfolding nature of natural speech, this study investigated the dynamic temporal trajectory of contextual coherence breakdown using sequential surprisal derived from autoregressive language models. We analyzed transcripts from 249 patients with schizophrenia and 159 matched healthy controls across eight diverse speech tasks. Using two Korean small language models (Polyglot-ko-5.8b and Kanana-1.5-8b-base), we calculated token-by-token surprisal, a measure of lexical predictability, from 100-token utterances. To account for the inherent volatility of autoregressive models at speech onset, temporal divergence of surprisal trajectories between groups was analyzed from tokens 11 to 100 using generalized additive mixed models. Patients exhibited progressively divergent surprisal trajectories as discourse unfolded, reflecting a breakdown in contextual predictability. Emerging early in the discourse, these deviations consistently intensified by tokens 50 to 70, indicating a rapid deterioration in the capacity to sustain global contextual constraints even within a short utterance. Both models also detected significantly higher overall mean surprisal in patients. Furthermore, this temporal divergence was most pronounced during unstructured narrative tasks, which lack the external visual cues of projective tasks. These findings provide quantitative evidence that language anomalies in schizophrenia are dynamic phenomena rapidly unfolding within a short discourse. By capturing precise temporal intervals of contextual disruption and paralleling clinical descriptions of disorganized discourse, this research highlights the utility of temporal natural language processing metrics in elucidating the psychopathology of schizophrenia.
Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters for these models at the field and population scales remains prohibitively labor-intensive. We present a novel algorithm that generates the 3D plant architecture from an image, to create a functional structural plant model from an image that reflects organ-level geometric and topological parameters, providing a more comprehensive representation of the plant's architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we propose a method that generates token sequences containing a procedural definition of the plant architecture. This work uses only synthetic images for training and testing, where "exact" architectural parameters were known, which allowed for testing of the hypothesis that organ-level architectural parameters could be extracted from imagery data using a vision language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. Then, a VLM was trained to predict plant architecture token sequences from images. Our results demonstrate that the model can predict plant architecture tokens with an F1 score of 0.73 in a teacher-forcing method. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. Our model achieves lower MAPE than feature regression-based methods in estimating bulk plant-level traits that require understanding of the occluded 3D structure of the plant, such as leaf count and leaf area. We conclude that generating plant architecture and parameter extraction from synthetic imagery are feasible using a VLM approach, supporting future extension to real imagery.
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
Recent applications of sequence models, such as Transformers and Mamba, in the vision domain have demonstrated promising performance. However, the processes of patch token extraction and token mixing inevitably result in the irreversible loss of spatial information. Existing sequence models for vision tasks rely heavily on residual connections to mitigate this issue, yet they overlook the limitations of residual connections in maintaining frequency stability. To address these challenges, we propose Multiple Wavelet Patch Partition (MWPP), a method that extracts patch tokens while preserving the spatial information within each patch. In addition, we introduce a frequency-aware Selective Wavelet Connection (SWC) to augment residual connections, thereby enhancing frequency stability and compensating for the information loss caused by token mixing. Building on MWPP and SWC, we design FracNeXt, a scalable fractal architecture that integrates both convolution and self-attention as token mixers. Under comparable experimental settings, FracNeXt achieves top-1 accuracies of 76.8% on ImageNet and 81.2% on CIFAR-100. Moreover, it delivers state-of-the-art performance across a variety of tasks, including object detection, optical character recognition, and time-series classification on diverse benchmarks. Furthermore, MWPP improves the F1 score of existing sequence models by up to 3.8%, while the proposed fractal architecture with SWC demonstrates superior robustness with respect to model depth.
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in solving complex tasks. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. The overview structure of this paper is shown in Figure 1. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from reasoning scenarios, object functions, and performance & efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring the safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs' inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field. A collection of efficient reasoning methods for LRMs (papers and codes) is provided at this link: https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs.
Remote Photoplethysmography (rPPG) provides a non-contact alternative to traditional heart rate monitoring. Estimating physiological signals from facial videos has recently attracted significant research interest. However, rPPG performance is sensitive to illumination variation and environmental interference, which can distort the extracted physiological signal. Since the background and face are affected by similar conditions, the effect of these conditions can be extracted from the background and isolated from the result. This paper proposes the Triple-Head Spatio-Temporal Transformer (TH-STT). TH-STT is a multi-task architecture designed to separate rPPG signals from environmental interference. In addition to facial tokens, a background anchor token is used as an environmental reference. Facial tokens and background anchor are processed using a shared transformer backbone. The proposed architecture has two auxiliary tasks to help purify the resulting rPPG. The Reaction-Driven Gating (RDG) mechanism was introduced, which tracks facial muscular activity. Furthermore, a Dynamic Anchor Locking (DAL) strategy is proposed to cancel environmental illumination interference. Experimental results on three benchmark datasets demonstrate improved and stable performance, with the TH-STT achieving a Mean Absolute Error (MAE) of 0.42 bpm on UBFC-rPPG and 1.08 on COHFACE.
Untargeted liquid chromatography-high-resolution mass spectrometry (LC-HRMS) detects thousands of molecular features per sample, yet only 2-20% receive confident structural annotations. A root cause of this "dark metabolome" is that tandem MS/MS acquisition is reactive: instruments select precursors only after ions appear, blind to what elutes next. We reframe chromatographic elution as an autoregressive sequence prediction task. Because reversed-phase elution order is governed by hydrophobicity, successive features form a physically constrained sequence, like tokens in language. We discretize the mass-to-charge (m/z) axis into 110 bins and train long short-term memory (LSTM) and Transformer models to predict the next eluting m/z bin from five annotation-free per-token features: m/z bin, mass defect, retention-time gap, polarity, and intensity rank. Trained on 15,242 features from four clinical lipidomics cohorts (342 plasma samples; SCIEX TripleTOF 6600+, Waters CSH C18), the LSTM reaches 98.4% top-1 accuracy (99.99% top-5; mean absolute error 3.6 Da) and the Transformer 98.0%. Ablation shows autoregressive context accounts for 55.5 percentage points while no single feature contributes more than 0.2 pp: the sequential pattern, not molecular properties, drives prediction. Models transfer across instruments sharing the method (r=0.999 on an independent Agilent 6530 dataset) but fail under a different column chemistry (5.1% top-1) or polarity mode (2.6%), confirming method- and mode-specificity. Fine-tuning on as few as two to five quality-control injections recovers held-out accuracy from 2.6% to nearly 50%, so cross-condition deployment needs minimal calibration. These results establish that elution sequences are highly predictable and lay the groundwork for predictive MS/MS acquisition to improve annotation coverage in untargeted metabolomics.
Effective patient education is essential in neurosurgery, but many materials exceed recommended readability levels, which can limit comprehension and informed consent. Simplification can also alter tone, potentially introducing bias. Recent studies have used large language models such as Chat Generative Pre-trained Transformer (ChatGPT) to simplify neurosurgical patient education materials (PEMs), but the impact of this process on sentiment and emotional tone remains unclear. Our objective was to assess the sentiment and emotional tone of neurosurgical PEMs before and after conversion to a lower reading level by ChatGPT. A total of 336 neurosurgical PEMs covering stroke, spinal stenosis, hydrocephalus, epilepsy, and pituitary brain tumors were analyzed for readability, sentiment, and emotion. Each was then simplified to a seventh grade level using GPT-4.0. Readability was evaluated using Flesch-Kincaid Grade, Flesch Reading Ease, Gunning Fog Index, Automated Readability Index, Coleman-Liau Index, and Simple Measure of Gobbledygook. Sentiment and emotional tone were described using the Valence Aware Dictionary and sEntiment Reasoner (VADER) algorithm and National Research Council Canada Emotion Lexicon. Paired statistical t-tests assessed the significance of changes. Simplification produced substantial improvements in readability across all 6 indices and all neurosurgical topics (P < .001). Sentiment shifted toward increased positivity, reflected by higher VADER compound scores, more positive tokens, and fewer neutral tokens. Disgust decreased significantly across every topic, whereas sadness, surprise, and joy increased modestly; fear and anger showed no significant change. Topic-level analyses mirrored global patterns, demonstrating consistent directional effects. Overall, simplification achieved large readability gains while introducing small but measurable alterations in emotional tone. The decrease in neutral and negative sentiment suggests a shift toward more persuasive language. Modest but consistent shifts in sentiment and emotional tone accompanying artificial intelligence-assisted simplification highlight the potential for unintended affective shifts during artificial intelligence simplification and warrant monitoring when deploying large language models for patient-facing materials. Current PEMs pose a communication barrier between patient and provider, but providers must be careful.
Text-to-Image Person Re-Identification (TI-ReID) aims to retrieve target pedestrians from large-scale image galleries using natural language descriptions. Despite recent progress achieved by dual-tower architectures based on vision-language pre-training, these methods remain susceptible to semantic misalignment and noise induced by occlusions, background clutter, and fine-grained attribute distractions. To mitigate these issues, we propose a Global Collaborative Discriminative Denoising Network (GCDD), a dual-tower fine-tuning framework built upon a CLIP visual encoder and a BERT text encoder. Specifically, GCDD introduces three complementary branches for robust feature enhancement. First, Discriminative Token Selection (DTS) performs adaptive hard filtering to suppress low-informative tokens. Second, Global-Guided Feature Adaptation (GFA) leverages modality-specific global semantics to recalibrate local features. Third, Query-Driven Aggregation (QDA) constructs more discriminative global representations via attentive pooling, where the backbone global feature serves as the query. The outputs of the three branches are fused through a parameter-free averaging strategy to produce the final representation. Extensive experiments on three standard TI-ReID benchmarks demonstrate that GCDD achieves strong competitive performance, validating the effectiveness of the proposed feature enhancement framework.
The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.
Recent Vision-Language-Action (VLA) models have rapidly emerged as general-purpose robotic policies that integrate language understanding, visual perception, and robot control. However, prior studies and surveys have primarily emphasized backbone architectures, action decoders, training recipes, and benchmark performance, whereas relatively limited systematic attention has been given to sensor modality selection, heterogeneous signal alignment and fusion, and their connection to action generation, all of which are critical to the performance and safety of real-world robotic manipulation. This survey addresses this gap by reinterpreting VLA within the framework of a sensor-fusion-action pipeline. This study first presents a systematic taxonomy of major sensor modalities, including RGB, depth, tactile sensing, force/torque, proprioception and inertial measurement unit, multi-spectral/thermal, and event-based vision, and compares them in terms of the physical information they provide, their characteristic failure modes, and their deployment constraints. This survey further reviews teleoperation-, human video-, and simulation-based data collection pipelines, together with representative dataset configurations, and analyzes the multi-modal design space from a sensor-centric perspective, including early and late fusion, cross-attention, token-level fusion, adapters, mixture of experts, and multi-rate action representations. In addition, this study identifies a strong bias in existing benchmarks toward RGB-centric inputs and single success-rate metrics and emphasizes the need for a multidimensional evaluation framework incorporating robustness, worst-case performance, safety, latency, and efficiency. By shifting the focus away from a model-centric narrative and explicitly accounting for real-world sensor complexity, this survey seeks to establish a sensor-centered foundation for the next generation of Physical AI.