Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs' attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https: //github.com/PKU-ICST-MIPL/FinePruner TIP2026.
We have recently seen some progress in the current field of human-human interaction generation. However, directly generating complex two-person interactive motions remains a significant challenge. Meanwhile, these models typically employ two independent timelines when generating motions for inter active scenarios involving two individuals. This design overlooks the temporal dependencies between motions at each timestep and fails to account for the roles of active and reactive participants during the generation process, often resulting in unrealistic and unnatural motions. In this work, we propose HiTMM, a novel framework for Human interaction generation based on Temporal Masked Modeling. HiTMM first decomposes the human interaction into two separate single-person motions. Individual motions within the interaction belong to the same type, enabling them to be mapped to a shared latent space through a coarse-to-fine approach that produces multi-layer discrete tokens. We then arrange all tokens of the two interacting individuals along a shared timeline. Subsequently, we employ a masked transformer and a residual transformer to model the base-layer and rest-layer motion tokens. Both the base-layer and rest-layer motion tokens are arranged along a single timeline, allowing the model to explicitly capture the temporal order and initiating role embedded in the sequence, where the first individual's motion initiates the interaction. Note that, our model utilizes a shared temporal representation, making it capable of performing temporal editing on specific regions within human interaction sequences. Experimental results show that our model achieves an FID of 5.017 on the InterHuman dataset, surpassing the current state-of-the-art model (vs 5.154 for InterMask), and an FID of 0.373 on the InterX dataset (vs 0.399 for InterMask). Project URL: https://jiaozicheng.github.io/HiTMM/.
The Transformer architecture widely adopted in the large language models (LLMs) suffers from limited inference efficiency due to the inherently sequential nature of autoregressive token generation. To address this issue, speculative decoding (SD) has been proposed to accelerate LLM inference by employing small speculative models (SSMs) to generate candidate tokens that are subsequently verified by the target LLM. However, the SD methods is often constrained by the key challenges: the low acceptance rate of tokens predicted by SSMs. To overcome the limitation, this paper proposes a Dual-Stream Network Architecture (DSNA), the architecture introduces two parallel processing streams that simultaneously model word sequences and feature sequences. The outputs of these two streams are progressively fused in subsequent stages to enhance the quality of candidate predictions. Furthermore, a dynamic multi-path decoding (DMPD) mechanism is introduced to leverage the enriched representations produced by the dual-stream architecture. This mechanism allows multiple candidate token paths to be evaluated simultaneously, enabling the model to accept multiple tokens within a single forward propagation step during the inference process. Extensive experiments show that our proposed the method consistently outperforms the state-of-the-art SD approaches, achieving significant improvements in both inference throughput and generation accuracy across multiple benchmarks.
Fine-grained bird image classification (FBIC) is crucial for ecological monitoring and biodiversity conservation, yet it remains challenging under camouflaged appearances, body occlusions, and arbitrary postures. To address these issues, we propose PteFBIC, which enhances fine-grained discriminability by modeling interregional relationships among pteryla-related appearance cues, including the regional organization of texture and color patterns as well as their cross-region transitions and complementarities. Specifically, we design a pteryla token construction module to generate pteryla-related tokens from an orientation-enhanced feature representation for subsequent relationship modeling. Furthermore, a pteryla relationship mining (PRM) module fuses global visual tokens with pteryla-related tokens to explicitly capture dependencies such as orientation-consistent texture organization, cross-region texture transitions, and complementary appearance variations. In addition, a key cue extraction (KCE) module is introduced to aggregate multiscale discriminative evidence, thereby improving robustness to pose variations and local occlusions. Experiments on CUB-200-2011 and NABirds demonstrate that PteFBIC consistently outperforms a wide range of state-of-the-art (SOTA) methods. The code of PteFBIC is available at https://github.com/she3333/PteFBIC.
Prompt tuning methods use learnable tokens for parameter-efficient downstream adaptation on large pre-trained models. However, for dual-modal visual-language pre-trained models (VLPMs), existing prompt tuning methods overlook the preservation of pre-trained text-image alignment during fine-tuning. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections, whose projection matrices need no training, to embed the information of learnable unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists VLPMs to achieve superior outcomes with only 0.04% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods. The code is released at github/SDPT.
Proportional, fair scheduling in OFDM-based vehicle-to-everything (V2X) uplink causes the resource-block allocation of each vehicle to vary from slot to slot, yet conventional semantic encoders produce a fixed number of output tokens regardless of the instantaneous channel capacity. When the encoder output exceeds the slot budget, transmitted features are truncated and the resulting federated learning gradient is corrupted-a problem that affected 23% of training rounds for non-line-of-sight vehicles in our experiments. The difficulty is worsened by a spatial pattern common in urban deployments: vehicles at congested intersections suffer the poorest propagation conditions while carrying the training data most relevant to safety, and throughput-driven client selection excludes them in favor of vehicles with strong channels but uninformative scenes. We address both issues within a single framework for OFDM-based V2X federated learning. On the transmission side, a Sensing-Guided Adaptive Modulation (SGAM) module derives a per-slot token budget from the current resource-block allocation and selects tokens through differentiable Gumbel-TopK pruning with a hard capacity clip, so the transmitted token count stays within the slot budget. On the scheduling side, a Channel-Decoupled Federated Learning (CDFL) module partitions clients independently by channel quality and data complexity, selects diverse representatives per partition via facility location optimization, and corrects for partition-size imbalance through inverse propensity weighting during model aggregation. Experiments on NuScenes with 20 non-IID vehicular clients under realistic OFDM channel simulation demonstrate a Macro-F1 of 0.710 (+8.7 points over the Oort-adapted baseline), zero budget violations throughout training, and a 75% reduction in training variance; the worst-class F1 more than doubles relative to FedAvg.
Background/Objectives: Breast cancer continues to be one of the most serious and common afflictions affecting women around the globe. Despite ultrasound imaging being an effective method for the detection of abnormalities in dense breast tissues, there are a number of drawbacks when utilizing this method, including the subjective nature of the imaging and the variant nature of the imaging due to the cognitive biases of the interpreting expert and the experience of the interpreting expert. The above factors are the cause of the increased need in the implementation of AI-driven models for diagnostic analysis. In this research, we provide a hybrid deep learning-based framework for cancer classification of the breast cancer ultrasound image dataset ('BUSI dataset'). Methods: The contributing models of the proposed architecture involve the combination of a light ViT encoder and an EfficientNetV2-RW-S feature extractor. The combination mentioned leverage the positive sensitivities of the convolutional neural networks (CNNs) and the global reasoning neural networks (i.e., transformers) in the explanation of the architecture. The reason being, EfficientNetV2 diminishes the capture of the fine-grained morphological components of the lesions, edges, and echogenic variances of the tissue, whereas the transformer model diminishes the long-range dependencies of the lesions and other surrounding tissues. Results: The experimental results from the proposed hybrid model of the architecture demonstrates an enhanced classification accuracy of 97.95%, in contrast to the self-standing models of the architecture, the hybrid model supersedes the isolated ViT model (i.e., 89%) and the isolated CNN model (i.e., 80%) frameworks. Furthermore, the proposed model hybrid architecture also diminishes the overall self-attention computational complexity of the proposed model by substantially diminishing the number of tokens reaching an overall count of 10 (from the vast 197 tokens). This further leads to a substantial decrease in the memory and cost expended during the attention processes. Conclusions: Overall, this study proposes a method for the improved diagnostic and computational analysis, suggesting the proposed architecture to be a potential framework for use in the contemporary clinical environments.
With the rapid advancement of vision-language models (VLMs) in general-purpose settings, their application to cross-modal retrieval and semantic understanding of large-scale multimodal remote sensing (RS) data is emerging as a key enabler for urban governance, environmental monitoring, and disaster response. However, the pervasive issue of semantic shift in RS image poses a significant challenge to the transferability of pre-trained VLMs. To address this limitation, we propose ReCoTR, an enhanced CLIP-based cross-modal retrieval framework tailored for remote sensing applications. ReCoTR tackles region-level granularity bias and contextual semantic drift through a Dual Consensus Token Evaluation (DCTE) module, which leverages a mixture-of-experts strategy to fuse inter-modal semantic consensus with intra-modal structural consistency, enabling fine-grained estimation of semantic confidence for visual tokens. Moreover, to mitigate representational contamination caused by background noise, we introduce the Semantic Confidence Token Compression (SCTC) module. This module selectively filters and aggregates tokens with high semantic relevance, thus reducing redundancy and alleviating the noise amplification inherent in CLIP's average pooling. Experimental results on three benchmark RS cross-modal retrieval datasets demonstrate that ReCoTR consistently outperforms existing methods on bidirectional image-text retrieval tasks, validating its effectiveness and robustness in remote sensing semantic alignment scenarios. Our source codes are available at: https://github.com/Jerry710/ReCoTR.git.
Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.
This study examined the impacts of Attachment and Biobehavioral Catch-up (ABC) on expressive language of Latine toddlers living with their biological parents (N = 173), randomized to home-based Early Head Start supplemented by ABC, or a control group. Mothers' mean age was 30.9 years (SD = 6.5); toddlers (49.7% male) mean age was 13.0 months (SD = 4.1). Compared to controls, children in the ABC group produced significantly higher numbers of utterances and morphemes, types, and tokens, as well as greater mean length of utterances in words and morphemes. An indirect effect of ABC through dyadic mutuality was documented for number of utterances, types, tokens, and morphemes. Findings are considered in the context of the literature on the developmental impacts of parenting interventions. This is a study that evaluated whether a parenting program (i.e., Attachment and Biobehavioral Catch-up), implemented as part of an early childhood prevention program (i.e., Early Head Start), improved the spoken language skills of toddlers from Latine families experiencing poverty. We found that the parenting program did improve these toddlers' language skills in many ways, including the numbers and types of words they used. We also found that the quality of the interaction between mothers and the toddlers facilitated their language use. Our findings suggest that children enrolled in early childhood preventative programs would benefit from these more targeted parenting programs, particularly regarding language development.
Deoxyribonucleic acid provides unmatched information density and longevity for data storage, yet its easy amplification by polymerase chain reaction enables unauthorized replication at negligible cost. We introduce ZAT-DNA, which encodes information in patterns of canonical adenine and noncanonical 2-aminoadenine. As DNA polymerases cannot distinguish adenine from 2-aminoadenine, polymerase-based amplification erases these patterns, enforcing molecular-layer non-replicability intrinsic to the base-pairing ambiguity. We validate ZAT-DNA for secure key storage, demonstrating error-free encoding, storage, and high-fidelity nanopore retrieval of 32-bit and 64-bit cryptographic keys. ZAT-DNA blocks polymerase-based copying and protects non-fungible tokens by preventing functional duplication. For larger datasets, we present a hybrid "Babel-DNA" architecture: multiple encrypted images are co-encoded in a single regular DNA pool, with each selectively decryptable only via its cognate, non-replicable ZAT-DNA key. This provides a practical framework for molecular access control, secure DNA-encoded databases, and scarce molecular tokens.
Despite recent advances in medical informatics, extracting tumor information from pathology reports remains a challenge in modern cancer registry and surveillance workflows. These documents often have an unstructured format, complex medical content, and a considerably lengthy context, creating significant challenges for automated phenotypic information extraction. Although some recent language models such as BERT, GatorTron, and GPT-4 have demonstrated efficacy in medical applications, they are either constrained by sequence length limitations or cloud-based computing that violates the handling of protected health information. We introduce two oncology pathology-optimized transformer models OncoPT, based on Longformer and BigBird architectures and trained on real-world pathology reports. OncoPT efficiently processes reports up to 4,096 tokens, making it suitable for hospitals' onsite deployment with limited resources. We apply OncoPT to a common malignancy (exemplified by breast cancer) and a rare malignancy (exemplified by gastric cancer), across five key tumor phenotypes: Subsite, Histology, Grade, Stage, and Laterality. The results demonstrate that OncoPT achieves state-of-the-art weighted F-1 on a private pathology dataset and surpasses commercial chatbots (ChatGPT 4o and o1) on the public CORAL dataset (up to 30% improvement). These findings highlight the robustness of OncoPT models with the added benefit of preserving the privacy of patient health information.
The growing interest in the application of Large Language Models (LLMs) for healthcare comes with a demand for better open-source LLMs, and stronger reassurances regarding their performance. To advance in this direction, this work conducts a thorough and transparent study of LLM model training and benchmarking in healthcare, releasing as open assets all resources needed for reproducing the Aloe models and its results (weights, data and code). This includes details on optimized data preprocessing and training, combining curated public data with synthetic samples for a total of 1.8B training tokens; enhanced safety, induced through Direct Preference Optimization (DPO), aligning Aloe models for ethical robustness and against jailbreaking attacks; and finally model performance, evaluated thoroughly through close-ended, open-ended, safety, and human assessments. To boost inference efficacy and test the upper bounds of open LLM performance, Aloe models are integrated with a Retrieval-Augmented Generation (RAG) system. The resultant models deliver competitive performance across healthcare benchmarks and medical fields while significantly improving safety and bias resilience. Model weights are released for research-only purposes, together with training and evaluation datasets, and RAG inference code. To enable the responsible release of such technology, this work is supported by a detailed healthcare-specific risk assessment. Building on top of base models like Llama 3.1 and Qwen 2.5, the Aloe models and their development recipe set a high standard for open-source medical LLMs, balancing top-tier performance with high ethical requirements.
Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and reinforcement learning. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more effective reasoning. We systematically analyze reasoning chain length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.
This study investigates public perception of the metaverse through a large-scale computational analysis of 52,874 English-language tweets. Leveraging sentiment analysis tools (VADER and RoBERTa) and unsupervised topic modeling (BERTopic), we categorize discourse into four thematic domains: general metaverse discussion, Meta's Horizon Worlds, metaverse-related cryptocurrency tokens, and virtual social events. Our findings reveal that 43.0% of tweets express positive sentiment, driven by enthusiasm for immersive innovation and digital transformation, while 23.6% convey skepticism, primarily concerning platform reliability, corporate dominance, and privacy. Sentiment surrounding Horizon Worlds reflects a paradox: underlying optimism is overshadowed by user frustration, with negative tweets generating disproportionately high engagement. Analysis of metaverse token discourse indicates robust investor interest, tempered by persistent concerns over market volatility and fraudulent schemes. Topic modeling further uncovers a notable narrative shift from speculative price-focused discussions toward utility-driven use cases. Virtual events (e.g., digital weddings, concerts) elicit the most positive sentiment (51.3%), with users frequently expressing emotional resonance and communal belonging, as visually reinforced by word cloud analysis. This research contributes to the literature on digital adoption and emerging technologies by mapping the evolving social discourse of the metaverse. It offers actionable insights for platform developers, investors, and educators seeking to align innovation with user expectations and provides a predictive lens for forecasting public readiness for the next generation of digital interaction.
Dynamic State Tracking (DST) is pivotal for personalized recommender systems and user modeling, aiming to estimate users' evolving latent states from sequential interactions. However, existing deep sequence paradigms predominantly treat interaction entities, such as semantic tokens and items, as isolated deterministic vectors. This approach often overlooks latent structural dependencies among entities, including knowledge graph topologies, and remains limited in quantifying the epistemic uncertainty inherent in stochastic user behaviors caused by random interactions and aleatoric noise. To address these dual challenges, we propose Adaptive G-UKT (Adaptive Graph-Enhanced Uncertainty-aware Knowledge Tracing), a unified probabilistic framework for temporal sequence modeling. Unlike traditional point-estimation models, we map hidden user states into Gaussian distributions, enabling the simultaneous tracking of semantic activation levels and estimation confidence through diagonal covariance. To mitigate data sparsity, we design an Adaptive Graph Learner that autonomously infers latent semantic correlations from raw data, coupled with an Adaptive Gaussian-HGNN that propagates uncertainty information across the dynamically learned topology. Furthermore, we introduce a Wasserstein attention mechanism to perform distribution-aware sequence retrieval and an uncertainty-guided contrastive learning strategy to enhance model robustness against noisy interactions. Extensive experiments on four large-scale real-world sequential datasets, namely ASSISTments2009, Bridge2Algebra2006, Algebra2005, and NIPS34, demonstrate that Adaptive G-UKT achieves competitive performance against state-of-the-art baselines, showing particularly significant gains in sparse data regimes. Crucially, visualization analysis confirms the model's capacity to autonomously uncover intrinsic structural topologies, bridging the critical gap between high-precision deep sequence learning and interpretable knowledge graph reasoning.
From work emerging through the middle of the 20th century, the essence of meaning has become widely accepted as being described by the three orthogonal dimensions of valence, arousal, and dominance. These essential dimensions have become the cornerstone of sentiment analysis across many fields. By reexamining first types and then tokens for the English language, and through the use of automatically annotated histograms-"ousiograms"-we find here that the essence of meaning conveyed by words is instead best described by a goodness-power-aggression-danger-structure (GPADS) circumplex framework; that large-scale English language corpora reveal a systematic bias toward safe, low-danger words; and that the power-danger-structure framework is the minimal framework that represents essential meaning. We find remarkable congruences between the GPADS framework and other spaces including mental states and fictional archetypes, and we construct and demonstrate a prototype ousiometer.
Recent advancements in large language models (LLMs) have showcased remarkable text generation capabilities. However, due to the inherent ambiguity of natural language and the unstructured nature of text modality, LLMs still struggle to integrate structured information (e.g., graphs) effectively. This hinders their ability to leverage high-quality structured data in specialized domains. Thus, recent research has explored various methods to integrate graph structures into LLMs to improve generation. However, existing methods typically compress the graph's structural information into only a single token, which is concatenated with detailed text tokens for LLMs, restricting their ability to capture deep semantic and structural information. To overcome these limitations, we propose GRaph-Augmented Fine-grained Fusion (GRAFF), a novel method that integrates fine-grained node-level structural information with corresponding text entities to LLMs via a lightweight, structure adapter module. Specifically, we introduce a dual-channel graph input mechanism to separate structural and semantic components for graph encoding, producing more expressive graph representations. We then incorporate a graph attention (GAT) module into LLMs' intermediate decoder layers to process structural information, enhancing the model's capability in graph-based question answering. Extensive experiments show that GRAFF significantly improves LLMs' graph-understanding ability in question answering, outperforming baselines by an average of 10.14% across four datasets. The official code for this work is available at https://github.com/hcpv/GRAFF.
Despite making significant advancements, deep learning is still confronted with inherent challenges due to its black-box nature. Existing approaches usually employ explicit data distribution learning methods to enhance interpretability. However, such methods tend to overlook the relational constraints among tokens during the process of interpretation and fall short in natural language understanding tasks. In this paper, we propose a textual white-box transformer for natural language understanding, named TWT, which executes an optimization objective of sparse rate reduction based on token relational constraints. Specifically, the low-rank sparse embedding strategy (LSES) and the label-interacted mapping mechanism (LMM) in the preprocessing layer are used to utilize the feature of natural language. The multi-head subspace self-attention (MSSA) and the token-conditioned iterative shrinkage-thresholding algorithm (T-ISTA) in the transformer layer are employed to maximize rate reduction and sparsify feature representation. Extensive experimental results on four widely used text classification datasets demonstrate that our proposed method performs better than the state-of-the-art baselines, and shows consistent performance while maintaining simplicity and interpretability. Beyond downstream supervised classification, we further investigate a self-supervised pretraining setting for TWT, in which structured textual embeddings are learned without explicit labels, complementing standard transformer architectures for interpretable representation learning in natural language understanding.
Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.