搜索 — ResearchTracker

Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs' attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https: //github.com/PKU-ICST-MIPL/FinePruner TIP2026.

HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions.

PubMed2026-05-01作者：Jiao Z, Sun Y, Zhang H

We have recently seen some progress in the current field of human-human interaction generation. However, directly generating complex two-person interactive motions remains a significant challenge. Meanwhile, these models typically employ two independent timelines when generating motions for inter active scenarios involving two individuals. This design overlooks the temporal dependencies between motions at each timestep and fails to account for the roles of active and reactive participants during the generation process, often resulting in unrealistic and unnatural motions. In this work, we propose HiTMM, a novel framework for Human interaction generation based on Temporal Masked Modeling. HiTMM first decomposes the human interaction into two separate single-person motions. Individual motions within the interaction belong to the same type, enabling them to be mapped to a shared latent space through a coarse-to-fine approach that produces multi-layer discrete tokens. We then arrange all tokens of the two interacting individuals along a shared timeline. Subsequently, we employ a masked transformer and a residual transformer to model the base-layer and rest-layer motion tokens. Both the base-layer and rest-layer motion tokens are arranged along a single timeline, allowing the model to explicitly capture the temporal order and initiating role embedded in the sequence, where the first individual's motion initiates the interaction. Note that, our model utilizes a shared temporal representation, making it capable of performing temporal editing on specific regions within human interaction sequences. Experimental results show that our model achieves an FID of 5.017 on the InterHuman dataset, surpassing the current state-of-the-art model (vs 5.154 for InterMask), and an FID of 0.373 on the InterX dataset (vs 0.399 for InterMask). Project URL: https://jiaozicheng.github.io/HiTMM/.

IEEE transactions on visualization and computer graphics

Optimizing speculative decoding via dynamic multi-path and dual-stream networks.

PubMed2026-04-25作者：Yang Y, Yang TT, Gao SS

The Transformer architecture widely adopted in the large language models (LLMs) suffers from limited inference efficiency due to the inherently sequential nature of autoregressive token generation. To address this issue, speculative decoding (SD) has been proposed to accelerate LLM inference by employing small speculative models (SSMs) to generate candidate tokens that are subsequently verified by the target LLM. However, the SD methods is often constrained by the key challenges: the low acceptance rate of tokens predicted by SSMs. To overcome the limitation, this paper proposes a Dual-Stream Network Architecture (DSNA), the architecture introduces two parallel processing streams that simultaneously model word sequences and feature sequences. The outputs of these two streams are progressively fused in subsequent stages to enhance the quality of candidate predictions. Furthermore, a dynamic multi-path decoding (DMPD) mechanism is introduced to leverage the enriched representations produced by the dual-stream architecture. This mechanism allows multiple candidate token paths to be evaluated simultaneously, enabling the model to accept multiple tokens within a single forward propagation step during the inference process. Extensive experiments show that our proposed the method consistently outperforms the state-of-the-art SD approaches, achieving significant improvements in both inference throughput and generation accuracy across multiple benchmarks.

Neural networks : the official journal of the International Neural Network Society

查看原文 ↗

PteFBIC: Exploiting Pterylotic Relationship for Fine-Grained Bird Image Classification via Rachidian Orientation Learning.

PubMed2026-05-12作者：Liu H, He S, Liu T

Fine-grained bird image classification (FBIC) is crucial for ecological monitoring and biodiversity conservation, yet it remains challenging under camouflaged appearances, body occlusions, and arbitrary postures. To address these issues, we propose PteFBIC, which enhances fine-grained discriminability by modeling interregional relationships among pteryla-related appearance cues, including the regional organization of texture and color patterns as well as their cross-region transitions and complementarities. Specifically, we design a pteryla token construction module to generate pteryla-related tokens from an orientation-enhanced feature representation for subsequent relationship modeling. Furthermore, a pteryla relationship mining (PRM) module fuses global visual tokens with pteryla-related tokens to explicitly capture dependencies such as orientation-consistent texture organization, cross-region texture transitions, and complementary appearance variations. In addition, a key cue extraction (KCE) module is introduced to aggregate multiscale discriminative evidence, thereby improving robustness to pose variations and local occlusions. Experiments on CUB-200-2011 and NABirds demonstrate that PteFBIC consistently outperforms a wide range of state-of-the-art (SOTA) methods. The code of PteFBIC is available at https://github.com/she3333/PteFBIC.

IEEE transactions on neural networks and learning systems

查看原文 ↗

SDPT: Synchronous Dual Prompt Tuning for Visual-Language Pre-trained Models.

PubMed2026-05-12作者：Zhou Y, Wu Y, Saiyin J

Prompt tuning methods use learnable tokens for parameter-efficient downstream adaptation on large pre-trained models. However, for dual-modal visual-language pre-trained models (VLPMs), existing prompt tuning methods overlook the preservation of pre-trained text-image alignment during fine-tuning. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections, whose projection matrices need no training, to embed the information of learnable unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists VLPMs to achieve superior outcomes with only 0.04% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods. The code is released at github/SDPT.

IEEE transactions on pattern analysis and machine intelligence

查看原文 ↗

Resource-Adaptive Semantic Transmission and Client Scheduling for OFDM-Based V2X Communications.

PubMed2026-04-23作者：Liu J, Chen Y, Wu W

Proportional, fair scheduling in OFDM-based vehicle-to-everything (V2X) uplink causes the resource-block allocation of each vehicle to vary from slot to slot, yet conventional semantic encoders produce a fixed number of output tokens regardless of the instantaneous channel capacity. When the encoder output exceeds the slot budget, transmitted features are truncated and the resulting federated learning gradient is corrupted-a problem that affected 23% of training rounds for non-line-of-sight vehicles in our experiments. The difficulty is worsened by a spatial pattern common in urban deployments: vehicles at congested intersections suffer the poorest propagation conditions while carrying the training data most relevant to safety, and throughput-driven client selection excludes them in favor of vehicles with strong channels but uninformative scenes. We address both issues within a single framework for OFDM-based V2X federated learning. On the transmission side, a Sensing-Guided Adaptive Modulation (SGAM) module derives a per-slot token budget from the current resource-block allocation and selects tokens through differentiable Gumbel-TopK pruning with a hard capacity clip, so the transmitted token count stays within the slot budget. On the scheduling side, a Channel-Decoupled Federated Learning (CDFL) module partitions clients independently by channel quality and data complexity, selects diverse representatives per partition via facility location optimization, and corrects for partition-size imbalance through inverse propensity weighting during model aggregation. Experiments on NuScenes with 20 non-IID vehicular clients under realistic OFDM channel simulation demonstrate a Macro-F1 of 0.710 (+8.7 points over the Oort-adapted baseline), zero budget violations throughout training, and a 75% reduction in training variance; the worst-class F1 more than doubles relative to FedAvg.

Sensors (Basel, Switzerland)

查看原文 ↗

A Hybrid Model for Ultrasound Image-Based Breast Cancer Diagnosis Using EfficientNet-V2 and Vision Transformer.

PubMed2026-04-15作者：Qahtan Mohammed Z, Tuama Alhussainy A, Salman Jasim I

Background/Objectives: Breast cancer continues to be one of the most serious and common afflictions affecting women around the globe. Despite ultrasound imaging being an effective method for the detection of abnormalities in dense breast tissues, there are a number of drawbacks when utilizing this method, including the subjective nature of the imaging and the variant nature of the imaging due to the cognitive biases of the interpreting expert and the experience of the interpreting expert. The above factors are the cause of the increased need in the implementation of AI-driven models for diagnostic analysis. In this research, we provide a hybrid deep learning-based framework for cancer classification of the breast cancer ultrasound image dataset ('BUSI dataset'). Methods: The contributing models of the proposed architecture involve the combination of a light ViT encoder and an EfficientNetV2-RW-S feature extractor. The combination mentioned leverage the positive sensitivities of the convolutional neural networks (CNNs) and the global reasoning neural networks (i.e., transformers) in the explanation of the architecture. The reason being, EfficientNetV2 diminishes the capture of the fine-grained morphological components of the lesions, edges, and echogenic variances of the tissue, whereas the transformer model diminishes the long-range dependencies of the lesions and other surrounding tissues. Results: The experimental results from the proposed hybrid model of the architecture demonstrates an enhanced classification accuracy of 97.95%, in contrast to the self-standing models of the architecture, the hybrid model supersedes the isolated ViT model (i.e., 89%) and the isolated CNN model (i.e., 80%) frameworks. Furthermore, the proposed model hybrid architecture also diminishes the overall self-attention computational complexity of the proposed model by substantially diminishing the number of tokens reaching an overall count of 10 (from the vast 197 tokens). This further leads to a substantial decrease in the memory and cost expended during the attention processes. Conclusions: Overall, this study proposes a method for the improved diagnostic and computational analysis, suggesting the proposed architecture to be a potential framework for use in the contemporary clinical environments.

Diagnostics (Basel, Switzerland)

ReCoTR: Reducing Semantic Cognitive Shift via Dual-Consensus Token Compression for Remote Sensing Image-Text Retrieval.

PubMed2026-05-11作者：Huang J, Chen Y, Du C

With the rapid advancement of vision-language models (VLMs) in general-purpose settings, their application to cross-modal retrieval and semantic understanding of large-scale multimodal remote sensing (RS) data is emerging as a key enabler for urban governance, environmental monitoring, and disaster response. However, the pervasive issue of semantic shift in RS image poses a significant challenge to the transferability of pre-trained VLMs. To address this limitation, we propose ReCoTR, an enhanced CLIP-based cross-modal retrieval framework tailored for remote sensing applications. ReCoTR tackles region-level granularity bias and contextual semantic drift through a Dual Consensus Token Evaluation (DCTE) module, which leverages a mixture-of-experts strategy to fuse inter-modal semantic consensus with intra-modal structural consistency, enabling fine-grained estimation of semantic confidence for visual tokens. Moreover, to mitigate representational contamination caused by background noise, we introduce the Semantic Confidence Token Compression (SCTC) module. This module selectively filters and aggregates tokens with high semantic relevance, thus reducing redundancy and alleviating the noise amplification inherent in CLIP's average pooling. Experimental results on three benchmark RS cross-modal retrieval datasets demonstrate that ReCoTR consistently outperforms existing methods on bidirectional image-text retrieval tasks, validating its effectiveness and robustness in remote sensing semantic alignment scenarios. Our source codes are available at: https://github.com/Jerry710/ReCoTR.git.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

查看原文 ↗

EEG-VLM: A Hierarchical Vision-Language Model With Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction.

PubMed2026-04-29作者：Qiu X, Ma G, Wang H

Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.

IEEE journal of biomedical and health informatics

查看原文 ↗

Improving toddlers' expressive language skills through attachment-based intervention: A randomized controlled trial with low-income, Latine families.

PubMed2026-05-12作者：Harden BJ, Martoccio TL, Shin SY

This study examined the impacts of Attachment and Biobehavioral Catch-up (ABC) on expressive language of Latine toddlers living with their biological parents (N = 173), randomized to home-based Early Head Start supplemented by ABC, or a control group. Mothers' mean age was 30.9 years (SD = 6.5); toddlers (49.7% male) mean age was 13.0 months (SD = 4.1). Compared to controls, children in the ABC group produced significantly higher numbers of utterances and morphemes, types, and tokens, as well as greater mean length of utterances in words and morphemes. An indirect effect of ABC through dyadic mutuality was documented for number of utterances, types, tokens, and morphemes. Findings are considered in the context of the literature on the developmental impacts of parenting interventions. This is a study that evaluated whether a parenting program (i.e., Attachment and Biobehavioral Catch-up), implemented as part of an early childhood prevention program (i.e., Early Head Start), improved the spoken language skills of toddlers from Latine families experiencing poverty. We found that the parenting program did improve these toddlers' language skills in many ways, including the numbers and types of words they used. We also found that the quality of the interaction between mothers and the toddlers facilitated their language use. Our findings suggest that children enrolled in early childhood preventative programs would benefit from these more targeted parenting programs, particularly regarding language development.

Child development

查看原文 ↗

ZAT-DNA enables DNA data storage with molecular-layer non-replicability.

PubMed2026-05-08作者：Song L, Wang G, Wei Y

Deoxyribonucleic acid provides unmatched information density and longevity for data storage, yet its easy amplification by polymerase chain reaction enables unauthorized replication at negligible cost. We introduce ZAT-DNA, which encodes information in patterns of canonical adenine and noncanonical 2-aminoadenine. As DNA polymerases cannot distinguish adenine from 2-aminoadenine, polymerase-based amplification erases these patterns, enforcing molecular-layer non-replicability intrinsic to the base-pairing ambiguity. We validate ZAT-DNA for secure key storage, demonstrating error-free encoding, storage, and high-fidelity nanopore retrieval of 32-bit and 64-bit cryptographic keys. ZAT-DNA blocks polymerase-based copying and protects non-fungible tokens by preventing functional duplication. For larger datasets, we present a hybrid "Babel-DNA" architecture: multiple encrypted images are co-encoded in a single regular DNA pool, with each selectively decryptable only via its cognate, non-replicable ZAT-DNA key. This provides a practical framework for molecular access control, secure DNA-encoded databases, and scarce molecular tokens.

Nature communications

查看原文 ↗

OncoPT: long-context transformer models for in hospital tumor phenotype extraction from pathology reports.

PubMed2026-05-02作者：Duong T, Le D, Williams V

Despite recent advances in medical informatics, extracting tumor information from pathology reports remains a challenge in modern cancer registry and surveillance workflows. These documents often have an unstructured format, complex medical content, and a considerably lengthy context, creating significant challenges for automated phenotypic information extraction. Although some recent language models such as BERT, GatorTron, and GPT-4 have demonstrated efficacy in medical applications, they are either constrained by sequence length limitations or cloud-based computing that violates the handling of protected health information. We introduce two oncology pathology-optimized transformer models OncoPT, based on Longformer and BigBird architectures and trained on real-world pathology reports. OncoPT efficiently processes reports up to 4,096 tokens, making it suitable for hospitals' onsite deployment with limited resources. We apply OncoPT to a common malignancy (exemplified by breast cancer) and a rare malignancy (exemplified by gastric cancer), across five key tumor phenotypes: Subsite, Histology, Grade, Stage, and Laterality. The results demonstrate that OncoPT achieves state-of-the-art weighted F-1 on a private pathology dataset and surpasses commercial chatbots (ChatGPT 4o and o1) on the public CORAL dataset (up to 30% improvement). These findings highlight the robustness of OncoPT models with the added benefit of preserving the privacy of patient health information.

NPJ digital medicine

查看原文 ↗

The Aloe Family recipe for open and specialized healthcare LLMs.

PubMed2026-05-11作者：Garcia-Gasulla D, Bayarri-Planas J, Gururajan AK

The growing interest in the application of Large Language Models (LLMs) for healthcare comes with a demand for better open-source LLMs, and stronger reassurances regarding their performance. To advance in this direction, this work conducts a thorough and transparent study of LLM model training and benchmarking in healthcare, releasing as open assets all resources needed for reproducing the Aloe models and its results (weights, data and code). This includes details on optimized data preprocessing and training, combining curated public data with synthetic samples for a total of 1.8B training tokens; enhanced safety, induced through Direct Preference Optimization (DPO), aligning Aloe models for ethical robustness and against jailbreaking attacks; and finally model performance, evaluated thoroughly through close-ended, open-ended, safety, and human assessments. To boost inference efficacy and test the upper bounds of open LLM performance, Aloe models are integrated with a Retrieval-Augmented Generation (RAG) system. The resultant models deliver competitive performance across healthcare benchmarks and medical fields while significantly improving safety and bias resilience. Model weights are released for research-only purposes, together with training and evaluation datasets, and RAG inference code. To enable the responsible release of such technology, this work is supported by a detailed healthcare-specific risk assessment. Building on top of base models like Llama 3.1 and Qwen 2.5, the Aloe models and their development recipe set a high standard for open-source medical LLMs, balancing top-tier performance with high ethical requirements.

NPJ digital medicine

查看原文 ↗

The relationship between reasoning and performance in large language models-o3 (mini) thinks harder, not longer.

PubMed2026-05-06作者：Ballon M, Algaba A, Ginis V

Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and reinforcement learning. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more effective reasoning. We systematically analyze reasoning chain length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

Scientific reports

查看原文 ↗

Public sentiment and thematic evolution in the metaverse: A large-scale computational analysis of Twitter discourse.

PubMed2026-01-01作者：Duraivel S, Rajendran L, Vasudevan S

This study investigates public perception of the metaverse through a large-scale computational analysis of 52,874 English-language tweets. Leveraging sentiment analysis tools (VADER and RoBERTa) and unsupervised topic modeling (BERTopic), we categorize discourse into four thematic domains: general metaverse discussion, Meta's Horizon Worlds, metaverse-related cryptocurrency tokens, and virtual social events. Our findings reveal that 43.0% of tweets express positive sentiment, driven by enthusiasm for immersive innovation and digital transformation, while 23.6% convey skepticism, primarily concerning platform reliability, corporate dominance, and privacy. Sentiment surrounding Horizon Worlds reflects a paradox: underlying optimism is overshadowed by user frustration, with negative tweets generating disproportionately high engagement. Analysis of metaverse token discourse indicates robust investor interest, tempered by persistent concerns over market volatility and fraudulent schemes. Topic modeling further uncovers a notable narrative shift from speculative price-focused discussions toward utility-driven use cases. Virtual events (e.g., digital weddings, concerts) elicit the most positive sentiment (51.3%), with users frequently expressing emotional resonance and communal belonging, as visually reinforced by word cloud analysis. This research contributes to the literature on digital adoption and emerging technologies by mapping the evolving social discourse of the metaverse. It offers actionable insights for platform developers, investors, and educators seeking to align innovation with user expectations and provides a predictive lens for forecasting public readiness for the next generation of digital interaction.

PloS one

查看原文 ↗

Adaptive G-UKT: a unified probabilistic framework for knowledge tracing via adaptive graph topology learning and uncertainty-aware Gaussian embeddings.

PubMed2026-04-30作者：Jia N, Su W, Xian J

Dynamic State Tracking (DST) is pivotal for personalized recommender systems and user modeling, aiming to estimate users' evolving latent states from sequential interactions. However, existing deep sequence paradigms predominantly treat interaction entities, such as semantic tokens and items, as isolated deterministic vectors. This approach often overlooks latent structural dependencies among entities, including knowledge graph topologies, and remains limited in quantifying the epistemic uncertainty inherent in stochastic user behaviors caused by random interactions and aleatoric noise. To address these dual challenges, we propose Adaptive G-UKT (Adaptive Graph-Enhanced Uncertainty-aware Knowledge Tracing), a unified probabilistic framework for temporal sequence modeling. Unlike traditional point-estimation models, we map hidden user states into Gaussian distributions, enabling the simultaneous tracking of semantic activation levels and estimation confidence through diagonal covariance. To mitigate data sparsity, we design an Adaptive Graph Learner that autonomously infers latent semantic correlations from raw data, coupled with an Adaptive Gaussian-HGNN that propagates uncertainty information across the dynamically learned topology. Furthermore, we introduce a Wasserstein attention mechanism to perform distribution-aware sequence retrieval and an uncertainty-guided contrastive learning strategy to enhance model robustness against noisy interactions. Extensive experiments on four large-scale real-world sequential datasets, namely ASSISTments2009, Bridge2Algebra2006, Algebra2005, and NIPS34, demonstrate that Adaptive G-UKT achieves competitive performance against state-of-the-art baselines, showing particularly significant gains in sparse data regimes. Crucially, visualization analysis confirms the model's capacity to autonomously uncover intrinsic structural topologies, bridging the critical gap between high-precision deep sequence learning and interpretable knowledge graph reasoning.

Scientific reports

Ousiometrics: The essence of meaning aligns with a power-danger-structure framework instead of valence-arousal-dominance.

PubMed2026-05-08作者：Dodds PS, Alshaabi T, Fudolig MI

From work emerging through the middle of the 20th century, the essence of meaning has become widely accepted as being described by the three orthogonal dimensions of valence, arousal, and dominance. These essential dimensions have become the cornerstone of sentiment analysis across many fields. By reexamining first types and then tokens for the English language, and through the use of automatically annotated histograms-"ousiograms"-we find here that the essence of meaning conveyed by words is instead best described by a goodness-power-aggression-danger-structure (GPADS) circumplex framework; that large-scale English language corpora reveal a systematic bias toward safe, low-danger words; and that the power-danger-structure framework is the minimal framework that represents essential meaning. We find remarkable congruences between the GPADS framework and other spaces including mental states and fictional archetypes, and we construct and demonstrate a prototype ousiometer.

Science advances

查看原文 ↗

GRAFF: GRaph-Augmented Fine-grained Fusion for Large Language Models.

PubMed2026-03-01作者：Chaudhary H, Wang R, Ramesh G

Recent advancements in large language models (LLMs) have showcased remarkable text generation capabilities. However, due to the inherent ambiguity of natural language and the unstructured nature of text modality, LLMs still struggle to integrate structured information (e.g., graphs) effectively. This hinders their ability to leverage high-quality structured data in specialized domains. Thus, recent research has explored various methods to integrate graph structures into LLMs to improve generation. However, existing methods typically compress the graph's structural information into only a single token, which is concatenated with detailed text tokens for LLMs, restricting their ability to capture deep semantic and structural information. To overcome these limitations, we propose GRaph-Augmented Fine-grained Fusion (GRAFF), a novel method that integrates fine-grained node-level structural information with corresponding text entities to LLMs via a lightweight, structure adapter module. Specifically, we introduce a dual-channel graph input mechanism to separate structural and semantic components for graph encoding, producing more expressive graph representations. We then incorporate a graph attention (GAT) module into LLMs' intermediate decoder layers to process structural information, enhancing the model's capability in graph-based question answering. Extensive experiments show that GRAFF significantly improves LLMs' graph-understanding ability in question answering, outperforming baselines by an average of 10.14% across four datasets. The official code for this work is available at https://github.com/hcpv/GRAFF.

Proceedings of the conference. Association for Computational Linguistics. European Chapter. Conference

TWT: Textual white-box transformer for natural language understanding.

PubMed2026-04-25作者：Yang SX, Zhou Y

Despite making significant advancements, deep learning is still confronted with inherent challenges due to its black-box nature. Existing approaches usually employ explicit data distribution learning methods to enhance interpretability. However, such methods tend to overlook the relational constraints among tokens during the process of interpretation and fall short in natural language understanding tasks. In this paper, we propose a textual white-box transformer for natural language understanding, named TWT, which executes an optimization objective of sparse rate reduction based on token relational constraints. Specifically, the low-rank sparse embedding strategy (LSES) and the label-interacted mapping mechanism (LMM) in the preprocessing layer are used to utilize the feature of natural language. The multi-head subspace self-attention (MSSA) and the token-conditioned iterative shrinkage-thresholding algorithm (T-ISTA) in the transformer layer are employed to maximize rate reduction and sparsify feature representation. Extensive experimental results on four widely used text classification datasets demonstrate that our proposed method performs better than the state-of-the-art baselines, and shows consistent performance while maintaining simplicity and interpretability. Beyond downstream supervised classification, we further investigate a self-supervised pretraining setting for TWT, in which structured textual embeddings are learned without explicit labels, complementing standard transformer architectures for interpretable representation learning in natural language understanding.

Neural networks : the official journal of the International Neural Network Society

查看原文 ↗

Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.

PubMed2026-05-06作者：Zhuge Y, Gong S, Zhang L

Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

搜索结果：tokens

FinePruner: Unbiased Attention-Head-Level Fine-grained Token Reduction for Efficient Inference of Large Vision-Language Models.

HiTMM: Generative Temporal Masked Modeling of Human Interactive Motions.

Optimizing speculative decoding via dynamic multi-path and dual-stream networks.

PteFBIC: Exploiting Pterylotic Relationship for Fine-Grained Bird Image Classification via Rachidian Orientation Learning.

SDPT: Synchronous Dual Prompt Tuning for Visual-Language Pre-trained Models.

Resource-Adaptive Semantic Transmission and Client Scheduling for OFDM-Based V2X Communications.

A Hybrid Model for Ultrasound Image-Based Breast Cancer Diagnosis Using EfficientNet-V2 and Vision Transformer.

ReCoTR: Reducing Semantic Cognitive Shift via Dual-Consensus Token Compression for Remote Sensing Image-Text Retrieval.

EEG-VLM: A Hierarchical Vision-Language Model With Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction.

Improving toddlers' expressive language skills through attachment-based intervention: A randomized controlled trial with low-income, Latine families.

ZAT-DNA enables DNA data storage with molecular-layer non-replicability.

OncoPT: long-context transformer models for in hospital tumor phenotype extraction from pathology reports.

The Aloe Family recipe for open and specialized healthcare LLMs.

The relationship between reasoning and performance in large language models-o3 (mini) thinks harder, not longer.

Public sentiment and thematic evolution in the metaverse: A large-scale computational analysis of Twitter discourse.

Adaptive G-UKT: a unified probabilistic framework for knowledge tracing via adaptive graph topology learning and uncertainty-aware Gaussian embeddings.

Ousiometrics: The essence of meaning aligns with a power-danger-structure framework instead of valence-arousal-dominance.

GRAFF: GRaph-Augmented Fine-grained Fusion for Large Language Models.

TWT: Textual white-box transformer for natural language understanding.

Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.