The rise of electronic cigarettes (e-cigarettes) has generated widespread controversy. As a key platform for public discussion, Twitter/X provides a valuable context for examining how textual stance, images, actor types, and user engagement intersect in e-cigarette discourse. This study analyzed 19,983 image-containing tweets, including 24,676 images and 21,976 replies. Image and text classifications were conducted using an AI-assisted coding approach. User engagement was measured by likes, retweets, and replies. BERTopic was used to identify major reply topics. Promotions (34.6%), Vaping Advocacy and Rights (21.2%), and Health Warnings and Infographics (16.5%) emerged as the dominant image categories. 16.78% of images categorized as Health Warnings and Infographics showed a text-image mismatch. Retailers and vaping communities were more active in image production, whereas health organizations contributed fewer images. In pro-e-cigarette textual contexts, images depicting vaping acts predicted higher levels of likes (B = 0.258, p < 0.001), replies (B = 0.093, p < 0.001), and retweets (B = 0.099, p < 0.01). By contrast, most image types in anti-e-cigarette textual contexts did not significantly predict engagement. User replies included discussions of everyday e-cigarette use, policy debates, and skepticism toward authoritative institutions. E-cigarette images on Twitter/X are not only promotional tools but also part of public debates over health risks, regulation, and vaping rights. Their meanings are shaped by textual stance, and their distribution differs across actor types. These findings suggest that health communication should strengthen its sustained visibility and narrative appeal to respond to pro-vaping narratives and related controversies.
As the consumption of cultural and creative products (CCPs) increasingly shifts to e-commerce channels, consumers rely heavily on online visual-textual displays to perceive and compare cultural value, whereas systematic design-oriented methods for assessing such perceived value remain insufficient. To address this gap, this study proposes a hybrid method for assessing the perceived cultural value of CCPs in e-commerce visual-textual presentation. First, a three-domain indicator framework covering formal representation, usage inference, and meaning construction is developed by integrating hierarchical cultural semantics with narrative-structure organization, and refined through content validity testing. Second, a hybrid DEMATEL-CRITIC-MULTIMOORA model is used to integrate indicator influence and alternative-performance variation across indicators for weight determination, followed by multi-perspective alternative ranking. Using cultural aromatherapy burners as a case, Kendall's concordance tests reveal evident divergence in expert judgments, with W = 0.128 and p = 0.797 for indicator interaction judgments, and W = 0.414 and p = 0.066 for indicator-level assessments of alternative performance. Within this context, the model outputs should be interpreted as structured perceived cultural value assessment results aggregated from heterogeneous expert judgments, rather than as objective measurements of products' intrinsic cultural quality. External questionnaire validation demonstrates relatively high rank-order consistency between the model results and user-side assessments (Spearman's ρ = 0.829; Kendall's τ = 0.733). The proposed approach transforms implicit cultural semantic features of CCPs into observable and comparable evidence, providing a decision-support reference for competitor comparison, visual-textual presentation refinement, and CCP design practice.
Digital environments have become important contexts in which consumers form sensory expectations and evaluate food quality prior to consumption. Drawing on the elaboration likelihood model and attribution theory, this study develops a theoretically grounded process model to explain how visual and textual cues in online presentations of natural foods shape food-related cognition. Specifically, we propose that perceived naturalness serves as an initial perceptual input that can trigger cognitive engagement through multiple mechanisms: directly, via credibility as a validation mechanism, via taste inference as an experiential simulation, and through a sequential chain in which credibility enables taste inference that subsequently sustains elaboration. A 2 (platform type: content-oriented vs. transaction-oriented) × 2 (image scene: lifestyle-oriented vs. nature-oriented) × 2 (text framing: consumption-oriented vs. production-oriented) between-subjects experiment (N = 320) was conducted. Partial least squares structural equation modeling was employed to test direct and indirect effects; multi-group analysis examined boundary conditions across experimental contexts; and necessary condition analysis identified minimum required levels of predictors for high engagement states. The results indicate that perceived naturalness has a significant direct effect on cognitive engagement, as well as indirect effects through credibility and taste inference independently and in sequence. The indirect pathway is more pronounced in content-oriented environments, particularly when nature-oriented images and consumption-oriented text are used. Taste inference emerged as the strongest necessary condition for high cognitive engagement, followed by credibility; perceived naturalness showed a weaker but significant necessity effect. These findings demonstrate how visual and textual cues jointly guide anticipatory sensory processing and cognitive engagement in digital food contexts, offering both theoretical contributions to cue-based processing research and practical implications for the design of online presentations of natural foods.
China ratified the WHO Framework Convention on Tobacco Control (FCTC) in 2005; it entered into force domestically on 9 January 2006. Absent of comprehensive national smoke-free legislation, subnational jurisdictions have enacted dedicated tobacco control regulations heterogeneously. The textual architecture distinguishing FCTC Article 8-compliant from non-compliant drafting has not been systematically characterized at corpus scale. This study compiled 38 Chinese subnational dedicated tobacco control regulations enacted or materially amended after FCTC entry into force (9 January 2006 through 31 December 2024) (100635 substantive Chinese characters). Each was independently evaluated against the four core requirements of the FCTC Article 8 Implementation Guidelines [FCTC/COP2(7), 2007] and triangulated against an authoritative expert database. A five-layer framework examined corpus scale, compliance, e-cigarette prohibitions, enforcement features, and other FCTC complementary measures. Mann-Whitney U tests with Cliff's delta and Fisher's exact tests were used (two-sided, α = 0.05). Of 38 regulations, 10 (26.3%) met FCTC Article 8 compliance criteria; 28 (73.7%) did not. Compliant regulations had higher median character counts (3249 vs 2572; δ=0.40, p=0.066) and clause counts (28.0 vs 23.0; δ=0.46, p=0.034). E-cigarette prohibitions (60.0% vs 25.0%, p=0.062) and cessation service requirements (90.0% vs 50.0%, p=0.056) were more frequent in compliant regulations. The 10 compliant jurisdictions cover an estimated 121.6 million residents (8.6% of mainland China's 2020 population). Greater corpus elaboration, more complete enforcement specification, and inclusion of FCTC-aligned complementary measures are textual features systematically associated with - though not shown to cause - FCTC Article 8 compliance in Chinese subnational law. These findings characterize legal text rather than implementation outcomes and inform drafting guidance for Chinese cities and other jurisdictions pursuing Article 8 implementation.
Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery. The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction. ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099). ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public. Ferramentas de inteligência artificial (IA) baseadas em linguagem natural, como ChatGPT-4.1 mini (OpenAI Group PBC) e Gemini 2.5 Flash (Alphabet Inc.), são utilizadas por pacientes como fonte de informação médica. Este estudo avaliou e comparou a qualidade e a legibilidade das respostas fornecidas por essas IAs, em português brasileiro, sobre cirurgia do manguito rotador. Estudo transversal, descritivo e comparativo, com abordagem qualiquantitativa. Foram utilizadas 24 perguntas frequentes de pacientes, classificadas segundo Rothwell. Cada pergunta foi inserida individualmente nas plataformas dos dois modelos, sendo considerada apenas a primeira resposta. A qualidade foi avaliada por meio do instrumento DISCERN, desenvolvido pela University of Oxford e pela British Library, e dos critérios editoriais da Journal of the American Medical Association (JAMA). A legibilidade foi estimada com o programa Análise de Legibilidade Textual (ALT), validado para o português brasileiro. As análises estatísticas incluíram os testes de Wilcoxon, Friedman, análise de variância ([ analysis of variance , ANOVA, em inglês] para medidas repetidas) e post hoc de Conover com correção de Bonferroni. O ChatGPT obteve escore médio DISCERN de 58,7 ± 4,0, e o Gemini, 56,3 ± 3,5, sem diferença significativa ( p  = 0,174), mas com efeito máximo ( rank-biserial correlation [rrb, em inglês] = 1,0). Ambos os modelos apresentaram legibilidade média correspondente a 13,3 anos de escolaridade ( p  = 1,000). Nenhuma resposta atendeu aos critérios editoriais da JAMA. Perguntas relacionadas a valores obtiveram os maiores escores de qualidade, ao passo que as perguntas sobre política foram as mais complexas em termos de leitura. A correlação entre qualidade e legibilidade foi moderada (ρ = 0,73; p  = 0,099). ChatGPT–4.1 mini e Gemini 2.5 Flash ainda não oferecem informação médica, em português brasileiro, adequada quanto à confiabilidade editorial, qualidade e acessibilidade textual para o público leigo.
Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery. The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction. ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099). ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public. Ferramentas de inteligência artificial (IA) baseadas em linguagem natural, como ChatGPT-4.1 mini (OpenAI Group PBC) e Gemini 2.5 Flash (Alphabet Inc.), são utilizadas por pacientes como fonte de informação médica. Este estudo avaliou e comparou a qualidade e a legibilidade das respostas fornecidas por essas IAs, em português brasileiro, sobre cirurgia do manguito rotador. Estudo transversal, descritivo e comparativo, com abordagem qualiquantitativa. Foram utilizadas 24 perguntas frequentes de pacientes, classificadas segundo Rothwell. Cada pergunta foi inserida individualmente nas plataformas dos dois modelos, sendo considerada apenas a primeira resposta. A qualidade foi avaliada por meio do instrumento DISCERN, desenvolvido pela University of Oxford e pela British Library, e dos critérios editoriais da Journal of the American Medical Association (JAMA). A legibilidade foi estimada com o programa Análise de Legibilidade Textual (ALT), validado para o português brasileiro. As análises estatísticas incluíram os testes de Wilcoxon, Friedman, análise de variância ([ analysis of variance , ANOVA, em inglês] para medidas repetidas) e post hoc de Conover com correção de Bonferroni. O ChatGPT obteve escore médio DISCERN de 58,7 ± 4,0, e o Gemini, 56,3 ± 3,5, sem diferença significativa ( p  = 0,174), mas com efeito máximo ( rank-biserial correlation [rrb, em inglês] = 1,0). Ambos os modelos apresentaram legibilidade média correspondente a 13,3 anos de escolaridade ( p  = 1,000). Nenhuma resposta atendeu aos critérios editoriais da JAMA. Perguntas relacionadas a valores obtiveram os maiores escores de qualidade, ao passo que as perguntas sobre política foram as mais complexas em termos de leitura. A correlação entre qualidade e legibilidade foi moderada (ρ = 0,73; p  = 0,099). ChatGPT–4.1 mini e Gemini 2.5 Flash ainda não oferecem informação médica, em português brasileiro, adequada quanto à confiabilidade editorial, qualidade e acessibilidade textual para o público leigo.
Visual grounding aims to localize target objects in images based on given textual descriptions, with broad applications in fields such as autonomous driving and human-robot interaction. However, existing visual grounding models still face three major challenges: (1) Most prior works employ separate encoders to process images and text independently, which enlarges the semantic gap between visual and textual features; (2) The use of large-language models leads to excessive parameters, making deployment on lightweight devices difficult; (3) Single-level cross-modal attention mechanisms are insufficient for fully capturing interactive information across modalities. To address these issues, this paper proposes a Task-aware Liquid Cross-modal Network (TLCN), which consists of four key modules: a Feature Extraction Module (FEM), a Liquid Fusion Module (LFM), a Task-aware Cross-modal Refinement Module (TCRM), and a Multilevel Grounding Module (MGM). Specifically, the FEM utilizes textual features to guide the extraction of visual features, thereby reducing the feature gap. The LFM employs Liquid Neural Networks (LNNs) to capture temporal dependencies and significantly reduce model parameters. Furthermore, the TCRM deepens textual representation via a second-level attention mechanism, while designed Conv-Trans Blocks (CTBs) are applied to image data to extract deeper visual features. Additionally, a similarity loss function based on KL divergence is introduced to optimize the cross-modal alignment. The proposed model is extensively evaluated on three widely-used public benchmarks: RefCOCO, RefCOCO+, and RefCOCOg. Moreover, a specialized text localization task is designed for further evaluation. Experimental results demonstrate that the TLCN achieves superior performance across all evaluated datasets and tasks. The superior performance of TLCN validates the effectiveness of its structural designs: text-guided visual extraction successfully bridges the semantic gap, the introduction of LNNs effectively reduces parameter counts for lightweight deployment, and the second-level attention with CTBs sufficiently captures deep cross-modal interactions. These findings suggest that TLCN provides a promising, efficient, and lightweight solution for visual grounding and related localization tasks.
The rapid spread of misinformation across social media platforms, websites, and online communication channels has made fake news detection a critical task in the digital era. Although various computational approaches have been developed to identify fake news, many existing methods suffer from limitations such as biased training datasets and high rates of false positives and false negatives. To address these challenges, this study proposes a Multimodal Cross Attention Network with Taylor-based Cross Entropy Mean Bias (MMCN_TCMB) model for detecting multimodal fake news. The proposed approach utilizes multimodal inputs consisting of textual and visual content obtained from fake news datasets. The textual information in news posts is first tokenized using Bidirectional Encoder Representations from Transformers (BERT). Feature extraction is then performed using Word2Vec and Term Frequency-Inverse Gravity Moment (TF-IGM). Simultaneously, images associated with news posts undergo preprocessing through Contrast Limited Adaptive Histogram Equalization and Histogram Equalization (CLAHE-HE), followed by feature extraction using ResNet. The extracted textual and visual features are combined and processed through the MMCN framework. The learning mechanism of the network is enhanced using the Taylor-based Cross Entropy Mean Bias (TCMB) loss function to improve classification performance. Experimental results demonstrate that the proposed MMCN_TCMB model achieves superior performance in multimodal fake news detection. The model attains a recall of 97.988%, precision of 96.223%, F1-score of 97.098%, and overall accuracy of 97.436%, outperforming existing methods. The findings indicate that integrating multimodal feature extraction with cross-attention mechanisms and the TCMB loss function significantly enhances the reliability and accuracy of fake news detection. The proposed framework effectively captures both textual and visual inconsistencies, making it a promising approach for combating misinformation in modern digital platforms.The code is available on: https://github.com/banbhrani84/MMCN_TCMB-Fake-News-.
The increasing use of digital communication platforms has led individuals to express emotions and mental health concerns through text containing implicit emotional cues, informal language, and non-standard expressions. Traditional sentiment analysis systems often struggle to capture these contextual nuances, limiting their effectiveness in mental health-related text analysis . To address this challenge, this study proposes a two-layer framework that combines Azure Sentiment Analysis and Azure Custom Text Classification for sentiment and mental health-related text categorization. In the first layer, user-generated text is classified into positive, neutral, or negative sentiment categories using Azure Sentiment Analysis. Text identified as negative is subsequently analysed using Azure Custom Text Classification to categorize content into predefined mental health-related classes, including Anxiety, Depression, PTSD, Social Anxiety Disorder, and Suicidal Ideation and Behaviour. The proposed framework aims to provide a structured approach for identifying linguistic patterns associated with mental health-related discussions and supporting mental health screening and triage applications. Experimental evaluation using an 80% training and 20% testing split achieved an overall Precision, Recall, and F1-score of 96.97%. Class-level evaluation demonstrated strong performance across multiple categories, with F1-scores ranging from 0.94 to 1.000. The findings indicate that the proposed architecture can effectively classify mental health-related textual content within the evaluated dataset while providing a scalable framework for automated sentiment and text classification. The study contributes to the growing field of intelligent emotional computing and highlights the potential of cloud-based natural language processing tools for mental health-related text analytics . The reported results are limited to the evaluated dataset and should be interpreted as a text classification and screening approach rather than a clinical diagnostic system. This manuscript presents the computational component of a broader mixed-methods study registered under CTRI/2024/06/068766, titled "Exploring Mental Health Status in a Selected Population: A Corpus Analysis Combining Forensic Linguistics and Psychology - a Mixed Method Study." The current work focuses on the development and validation of an AI-based diagnostic tool for mental health assessment using synthetic and anonymized textual data, constituting a secondary objective of the registered protocol. Registry: Clinical Trials Registry- India (CTRI) Trial Registration Number: CTRI/2024/06/068766 Date of Registration: 12.06.2024.
The performance of five popular, widely available large language models (LLMs): ChatGPT-4o, Gemini 2.5 Flash, Llama 4, DeepSeek-V3, and Microsoft Copilot in operating dentistry education was evaluated by employing a multiple-choice question-based assessment system. This was done using a set of 150 MCQs covering areas of endodontics, dental caries, paediatric, preventive, aesthetic and restorative dentistry, biomaterials, and periodontics. The LLM's performance was assessed using classification metrics (accuracy, sensitivity, predictive reliability), textual similarity metrics (BLEU score, cosine similarity, Word Error Rate), and readability metrics (Flesch Reading Ease score). The highest classification accuracy was achieved by Gemini 2.5 Flash and ChatGPT-4o, showing their high sensitivity and high overall predictive reliability. The model with the most textual similarity to the reference answers was ChatGPT-4o with BLEU of 0.10 ± 0.0279, a high cosine similarity of 0.48 ± 0.0422, and a relatively low Word Error Rate (WER) of 5.57 ± 0.7301, and a Flesch Reading Ease score of 13.53 ± 4.9449. In medical education, ChatGPT-4o exhibited the highest accuracy, reference textual overlap, semantic alignment, lower number of errors, and readability among the five evaluated LLMs, making it a valuable assistant for dental healthcare professionals.
Overweight and obesity are causing a growing public health, economic, and clinical burden, particularly within under-resourced communities. There is an urgent need to develop an in-depth understanding of experiences of weight management as well as preferences for support within under-resourced communities, with a view to developing more effective weight management interventions. Focus groups were run in under-resourced communities using storyboarding, a method used to facilitate inclusive communication (n = 37). Thematic analysis was applied to textual and visual data, and a realist 'lens' applied to provide in-depth insight into weight management experiences and needs. We believe this is the first study to use this combined methodology to explore weight management experiences and needs. Combining storyboarding with a realist lens generated four themes. Living circumstances indicated that mental health, individual needs, and the cost of weight management services were key contextual factors. Mechanisms of weight management identified emotional eating and portion control as central to individual weight management. Yo-yo dieting centered on participants' experiences of weight regain after attempting weight loss. Weight management intervention needs indicated that psychological support was perceived as severely lacking and the only route to attain sustained weight management. Offering both in-person and online support for weight management was considered important to reach more people. Moving weight management support from short to long term and incorporating more robust psychological support would better serve the needs of people living in under-resourced communities who are overweight or obese. Ideally interventions should be multicomponent and tailored to individual needs and circumstances.
Low-light Image Enhancement Algorithms (LIEAs) aim to improve the visibility and visual quality of images captured in low-light environments. However, none of the existing LIEAs can comprehensively restore all visual contents, which makes it inevitable for the Enhanced Low-light Images (ELIs) to have different degrees of distortion, thereby affecting the visual quality. Currently, there is little research focusing on the quality assessment of these ELIs, partly due to the lack of publicly available datasets. Moreover, existing quality assessment methods primarily focus on a single visual modality and fail to sufficiently exploit the structural information across multiple image attributes, consequently resulting in suboptimal prediction performance. To this end, this paper conducts a systematic study on both subjective and objective quality assessment of ELIs. Firstly, we construct the first Multi-annotated and multi-modal Low-light image Enhancement quality dataset (MLE), which contains 1,000 ELIs, along with subjective studies to obtain multiple attribute annotations, quality scores, and textual descriptions. Based on this, we further propose an Attribute-guided Vision-Language Graph Reasoning Network (AVGR-Net) for ELI quality prediction, which effectively integrates multi-attribute visual and textual information through cross-modal graph reasoning and alignment. Extensive data analysis and experimental results validate both the reliability of the MLE dataset and the superior performance of the AVGR-Net compared to state-of-the-art methods.
Explainable artificial intelligence (XAI) is increasingly important for computational pathology, where reliable and interpretable predictions are required for clinical use. Pre-trained Vision-Language Models (VLMs) offer a natural pathway to connect visual evidence with textual concepts, but adapting them to the domain of cancer pathology remains challenging due to fine-grained and heterogeneous morphological patterns. Visual prompt learning enables efficient task adaptation with minimal fine-tuning; however, existing prompting techniques face critical limitations: soft prompts often lack clinical specificity, while manually designed hard prompts reduce adaptability. To address these issues, we propose RAPT, a retrieval-augmented and text-guided visual prompting framework for explainable pathology classification. Given an input image, RAPT retrieves semantically related exemplars and leverages class-specific textual descriptions to construct disease-aware prompt tokens, injecting diagnostic cues without manual annotation. An adaptive weighting mechanism attenuates unreliable retrieval evidence, and bridge prompt tokens facilitate the integration of retrieved cues with image representations. Extensive experiments on three public cancer pathology datasets (PatchGastric, BACH, and LC25000) demonstrate that RAPT consistently outperforms prompting baselines across diverse backbone settings. Additional analyses and qualitative case studies demonstrate clinically actionable cues, robustness to imperfect retrieval, and clear failure-mode boundaries for trustworthy pathology decision support, highlighting the potential of robust and explainable prompting for computational pathology.
Zero-shot human-object interaction (HOI) detection aims to recognize both seen and unseen interaction categories while detecting humans and objects in an image. However, due to the absence of training samples for unseen categories, existing methods often overfit on seen HOIs and struggle to generalize to unseen ones. To address this issue, we introduce a novel Language-Driven Visual Data Generation (LD-VDG) approach that generates pseudo visual features from textual semantics of unseen HOIs. This provides an innovative solution enabling generalization to unseen HOIs without relying on visual samples. Specifically, we first design a text-to-vision (T-V) adapter to align HOI text and visual features, trained on seen HOIs with paired image-text data. For unseen HOIs, we guide the large language model to produce multiple fine-grained textual descriptions based on HOI labels, which are then encoded by the vision-language model and transformed into pseudo visual features via the T-V adapter. After that, these pseudo features together with real features from seen HOIs are jointly used to train a transformer-based HOI detector. In this way, our method enables effective recognition of unseen HOIs by leveraging language-driven visual representations. Experimental results on standard datasets demonstrate that the proposed LD-VDG outperforms previous methods. In particular, it achieves superior performance on unseen categories under various zero-shot settings.
Surgical vision-language foundation models learn generalizable representations from large-scale surgical data and support flexible adaptation to diverse downstream tasks by using text prompts as classifiers. However, their effectiveness is hindered by a critical semantic limitation: These models often struggle to distinguish between positive and negative textual assertions, a capability essential for fine-grained surgical tasks where the target object may occupy only a small and localized region. This limitation weakens the reliability of current state-of-the-art vision-language adaptation methods, leading to degraded performance when prompts require precise semantic discrimination. We propose Surg-NAT+, a few-shot vision-language adaptation framework designed to enhance foundation models for fine-grained surgical tasks. First, we introduce a negation-aware contrastive objective that explicitly strengthens the model's ability to differentiate between affirmative and negated textual prompts. This objective is incorporated into pretrained models through multi-level adapter fusion (MAF), enabling hierarchical semantic refinement within the text encoder. Second, we propose a fine-grained self-distillation objective that improves visual grounding by enforcing consistency between global representations of image crops and the corresponding local patch embeddings. We evaluate our approach on the Cholec80 dataset for few-shot surgical tool recognition using foundation models. Surg-NAT+ achieves state-of-the-art performance across all few-shot regimes, consistently surpassing existing baselines by a substantial margin. Qualitative analyses further demonstrate that our method yields more accurate visual grounding of target tools and enhances semantic separability in the learned feature space. We presented Surg-NAT+, a lightweight negation-aware adaptation framework that substantially improves the fine-grained recognition capabilities of surgical vision-language models in few-shot, multi-label tool recognition settings. Our approach provides an efficient and semantically robust pathway for adapting foundation models to surgical domains. The code is available at https://github.com/Yutongc-ai/fine-tune-SurgVLP .
YouTube is the primary global video platform, hosting both authoritative health information and vaccine-skeptic viewpoints. However, engagement dynamics remain poorly understood. The aim of this study was to investigate the temporal and textual dynamics of engagement of the YouTube viewership with vaccination content, and specifically content that is in favor of or against vaccination. We contextualized these dynamics in the authority signals of the posting channel and the moderation actions taken by the platform. We conducted a 6-month daily longitudinal analysis of 7213 vaccine-related YouTube videos (November 2024 to May 2025) mentioning vaccination. We used zero-shot large language model classification with manual verification to classify the video stance toward vaccination, and the stance of their comments toward the video. The engagement and disagreement dynamics were modeled using Bayesian regression. Our findings show engagement asymmetry between content supporting and questioning vaccination. Vaccine-hesitant videos in our sample receive substantially higher raw engagement (median likes: 40 [IQR 3-846] to 59 [IQR 3-1319]; median comments: 10 [IQR 0-160] to 18 [IQR 0-311] per video versus 3 [IQR 1-15] and 0 [IQR 0-4], respectively, for strongly provaccine content) and moderate normalized engagement rates (true median combined rate: 0.073 [IQR 0.028-0.121] to 0.069 [IQR 0.027-0.118] interactions per view versus 0.026 [IQR 0.007-0.060] for strongly pro-vaccine videos, a 2.5-2.6× difference). Descriptively, vaccine-hesitant videos reach 90% of cumulative views faster (18 [IQR 8-38] days vs 32 [IQR 18-64] days; 44% faster), while negative binomial models that adjust for total engagement volume indicate that approximately 20% of this advantage reflects genuine temporal compression independent of engagement volume. Comment analysis indicated that the vaccine-hesitant videos in our sample foster echo chambers, while the provaccine content attracts battlegrounds. Considering the sources of vaccine-related content, provaccine content tends to originate from organizations, particularly news and health institutions, while vaccine-hesitant discourse is more likely to come from individual creators, even those self-identifying as medical doctors. Moderation, on the rare occasion when it occurs (about 2% of the videos were taken down), comes after engagement saturation, limiting its effectiveness. Our analysis suggests that the vaccine-hesitant content can dominate YouTube's engagement ecosystem through rapid early-stage amplification, which has direct implications for public health intervention timing and platform governance policy.
Learning medical visual representations directly from image-report pairs has become an emerging topic in representation learning. However, the heterogeneity and complementary nature of medical reports and images pose challenges to adaptively fusing. We propose a Multi-modal and Multi-scale Disease-oriented fusion framework with Kolmogorov-Arnold Networks (MMDOK). This method utilizes the multi-scale features naturally present in medical images and reports and explores feature fusion between multi-view or multi-modal images and textual reports. Specifically, MMDOK designs a text encoder and two image encoders to enhance the diagnostic model by adaptively integrating text supervision and additional image information. To enhance the model's ability to extract information across different semantic scales, we introduce a Bidirectional Cross-Attention (BCA) module and a Cross-Modal Clustering (CMC) module. The proposed CMC module effectively captures disease consistent latent features from both images and clinical reports, alleviating overfitting and improving generalization to out-of-distribution test sets. To reduce cross-modal redundancy, we introduce a Disease-Oriented Attention (DOA) module, which adaptively assigns modality-specific weights based on disease labels, enabling more effective and context-aware feature fusion. Additionally, we embed Kolmogorov-Arnold Network layers into various modules to enable efficient and accurate feature extraction through enhanced nonlinear representation capacity. We evaluate our model's performance, achieving an average accuracy of 94.27% and 83.67% with and without text supervision on a public lung disease dataset, and 87.63% and 82.1% on a private submucosal tumors dataset, outperforming state-of-the-art alternatives.
We present Aitomia, an agentic framework for AI-driven atomistic and quantum chemical (QC) simulations that helps experts and nonexperts alike set up and run calculations, analyze results, and summarize them in textual and graphical forms through natural language interaction. Built on the MLatom software ecosystem, Aitomia supports AI-driven atomistic simulations as well as conventional quantum-chemical calculations, including density functional theory, semiempirical methods such as GFN2-xTB, and selected high-level wave function-based methods, through interfaces to widely used programs such as Gaussian, ORCA, PySCF, and xtb, covering tasks from ground- and excited-state calculations to geometry optimization, thermochemistry, and spectra simulations. By autonomously executing computational workflows, Aitomia can deliver infrared spectra in seconds and reaction thermochemistry in minutes, with results close to experiment or high-level theoretical references while greatly reducing the manual effort required from users. Aitomia lowers the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in the relevant fields.
The opioid overdose crisis constituted one of the greatest public emergencies in US history. The overpromotion and overprescription of oxycodone and Purdue Pharma's branded formulation, OxyContin, have been implicated as key drivers of the opioid overdose crisis. This study sought to understand Purdue's motivations in developing opioid abuse-deterrent formulations, drugs designed to reduce misuse, abuse, and diversion of prescription opioids. We conducted a qualitative archival analysis of internal corporate documents archived at the UCSF-JHU Opioid Industry Documents Archive (OIDA). We gathered, coded, and analyzed over 80 internal documents spanning from 1996 to 2010 to reconstruct the timeline of events and motivations that drove Purdue's development and eventual release of an abuse deterrent formulation OxyContin in the US. These primary data included emails, slideshows, and other textual and visual documents. We used regulatory documents and court records as a supplementary data source to triangulate and validate our OIDA-based findings. Purdue initially proposed the development of an abuse-deterrent formulation of OxyContin as a strategy to protect its patents and prevent the introduction of a competing generic formulation of oxycodone. When the company successfully litigated against the introduction of a generic medication, it stopped attempting to develop a reformulation until the regulatory environment changed in a way that would prevent competition if Purdue released the reformulated OxyContin. Pharmaceutical companies may focus on strategies such as abuse-deterrent formulations to protect market share rather than to protect public health. Regulators should be aware of the potential risks of encouraging this approach.
Integrated Pest Management (IPM) has long reduced pesticide use while improving economic and ecological sustainability through monitoring and systems analysis. However, traditional IPM models face limitations in predictability, cost, and system specificity. Recent advances in machine learning (ML) provide flexible predictive frameworks that enable reliable short‑term pest forecasting when sufficient data are available. At the same time, Internet of Things (IoT) technologies enable continuous acquisition of pest monitoring data for ML-based predictions. They also collect outcome metrics such as yield, pest resistance, and environmental impacts, supporting feedback-driven IPM optimization. Emerging multimodal modeling approaches now offer new opportunities to integrate diverse data sources, including textual information, and guide more targeted, integrated, and minimally chemical-dependent intervention strategies. Combining IoT monitoring, ML based pest prediction, and adaptive optimization supports diverse ways of delivering actionable IPM insights to stakeholders, from large scale enterprises to smallholder farming communities. This convergence of technologies marks the emergence of truly adaptive, high‑precision IPM in the big‑data era.