The rise of electronic cigarettes (e-cigarettes) has generated widespread controversy. As a key platform for public discussion, Twitter/X provides a valuable context for examining how textual stance, images, actor types, and user engagement intersect in e-cigarette discourse. This study analyzed 19,983 image-containing tweets, including 24,676 images and 21,976 replies. Image and text classifications were conducted using an AI-assisted coding approach. User engagement was measured by likes, retweets, and replies. BERTopic was used to identify major reply topics. Promotions (34.6%), Vaping Advocacy and Rights (21.2%), and Health Warnings and Infographics (16.5%) emerged as the dominant image categories. 16.78% of images categorized as Health Warnings and Infographics showed a text-image mismatch. Retailers and vaping communities were more active in image production, whereas health organizations contributed fewer images. In pro-e-cigarette textual contexts, images depicting vaping acts predicted higher levels of likes (B = 0.258, p < 0.001), replies (B = 0.093, p < 0.001), and retweets (B = 0.099, p < 0.01). By contrast, most image types in anti-e-cigarette textual contexts did not significantly predict engagement. User replies included discussions of everyday e-cigarette use, policy debates, and skepticism toward authoritative institutions. E-cigarette images on Twitter/X are not only promotional tools but also part of public debates over health risks, regulation, and vaping rights. Their meanings are shaped by textual stance, and their distribution differs across actor types. These findings suggest that health communication should strengthen its sustained visibility and narrative appeal to respond to pro-vaping narratives and related controversies.
As the consumption of cultural and creative products (CCPs) increasingly shifts to e-commerce channels, consumers rely heavily on online visual-textual displays to perceive and compare cultural value, whereas systematic design-oriented methods for assessing such perceived value remain insufficient. To address this gap, this study proposes a hybrid method for assessing the perceived cultural value of CCPs in e-commerce visual-textual presentation. First, a three-domain indicator framework covering formal representation, usage inference, and meaning construction is developed by integrating hierarchical cultural semantics with narrative-structure organization, and refined through content validity testing. Second, a hybrid DEMATEL-CRITIC-MULTIMOORA model is used to integrate indicator influence and alternative-performance variation across indicators for weight determination, followed by multi-perspective alternative ranking. Using cultural aromatherapy burners as a case, Kendall's concordance tests reveal evident divergence in expert judgments, with W = 0.128 and p = 0.797 for indicator interaction judgments, and W = 0.414 and p = 0.066 for indicator-level assessments of alternative performance. Within this context, the model outputs should be interpreted as structured perceived cultural value assessment results aggregated from heterogeneous expert judgments, rather than as objective measurements of products' intrinsic cultural quality. External questionnaire validation demonstrates relatively high rank-order consistency between the model results and user-side assessments (Spearman's ρ = 0.829; Kendall's τ = 0.733). The proposed approach transforms implicit cultural semantic features of CCPs into observable and comparable evidence, providing a decision-support reference for competitor comparison, visual-textual presentation refinement, and CCP design practice.
China ratified the WHO Framework Convention on Tobacco Control (FCTC) in 2005; it entered into force domestically on 9 January 2006. Absent of comprehensive national smoke-free legislation, subnational jurisdictions have enacted dedicated tobacco control regulations heterogeneously. The textual architecture distinguishing FCTC Article 8-compliant from non-compliant drafting has not been systematically characterized at corpus scale. This study compiled 38 Chinese subnational dedicated tobacco control regulations enacted or materially amended after FCTC entry into force (9 January 2006 through 31 December 2024) (100635 substantive Chinese characters). Each was independently evaluated against the four core requirements of the FCTC Article 8 Implementation Guidelines [FCTC/COP2(7), 2007] and triangulated against an authoritative expert database. A five-layer framework examined corpus scale, compliance, e-cigarette prohibitions, enforcement features, and other FCTC complementary measures. Mann-Whitney U tests with Cliff's delta and Fisher's exact tests were used (two-sided, α = 0.05). Of 38 regulations, 10 (26.3%) met FCTC Article 8 compliance criteria; 28 (73.7%) did not. Compliant regulations had higher median character counts (3249 vs 2572; δ=0.40, p=0.066) and clause counts (28.0 vs 23.0; δ=0.46, p=0.034). E-cigarette prohibitions (60.0% vs 25.0%, p=0.062) and cessation service requirements (90.0% vs 50.0%, p=0.056) were more frequent in compliant regulations. The 10 compliant jurisdictions cover an estimated 121.6 million residents (8.6% of mainland China's 2020 population). Greater corpus elaboration, more complete enforcement specification, and inclusion of FCTC-aligned complementary measures are textual features systematically associated with - though not shown to cause - FCTC Article 8 compliance in Chinese subnational law. These findings characterize legal text rather than implementation outcomes and inform drafting guidance for Chinese cities and other jurisdictions pursuing Article 8 implementation.
Digital environments have become important contexts in which consumers form sensory expectations and evaluate food quality prior to consumption. Drawing on the elaboration likelihood model and attribution theory, this study develops a theoretically grounded process model to explain how visual and textual cues in online presentations of natural foods shape food-related cognition. Specifically, we propose that perceived naturalness serves as an initial perceptual input that can trigger cognitive engagement through multiple mechanisms: directly, via credibility as a validation mechanism, via taste inference as an experiential simulation, and through a sequential chain in which credibility enables taste inference that subsequently sustains elaboration. A 2 (platform type: content-oriented vs. transaction-oriented) × 2 (image scene: lifestyle-oriented vs. nature-oriented) × 2 (text framing: consumption-oriented vs. production-oriented) between-subjects experiment (N = 320) was conducted. Partial least squares structural equation modeling was employed to test direct and indirect effects; multi-group analysis examined boundary conditions across experimental contexts; and necessary condition analysis identified minimum required levels of predictors for high engagement states. The results indicate that perceived naturalness has a significant direct effect on cognitive engagement, as well as indirect effects through credibility and taste inference independently and in sequence. The indirect pathway is more pronounced in content-oriented environments, particularly when nature-oriented images and consumption-oriented text are used. Taste inference emerged as the strongest necessary condition for high cognitive engagement, followed by credibility; perceived naturalness showed a weaker but significant necessity effect. These findings demonstrate how visual and textual cues jointly guide anticipatory sensory processing and cognitive engagement in digital food contexts, offering both theoretical contributions to cue-based processing research and practical implications for the design of online presentations of natural foods.
Visual grounding aims to localize target objects in images based on given textual descriptions, with broad applications in fields such as autonomous driving and human-robot interaction. However, existing visual grounding models still face three major challenges: (1) Most prior works employ separate encoders to process images and text independently, which enlarges the semantic gap between visual and textual features; (2) The use of large-language models leads to excessive parameters, making deployment on lightweight devices difficult; (3) Single-level cross-modal attention mechanisms are insufficient for fully capturing interactive information across modalities. To address these issues, this paper proposes a Task-aware Liquid Cross-modal Network (TLCN), which consists of four key modules: a Feature Extraction Module (FEM), a Liquid Fusion Module (LFM), a Task-aware Cross-modal Refinement Module (TCRM), and a Multilevel Grounding Module (MGM). Specifically, the FEM utilizes textual features to guide the extraction of visual features, thereby reducing the feature gap. The LFM employs Liquid Neural Networks (LNNs) to capture temporal dependencies and significantly reduce model parameters. Furthermore, the TCRM deepens textual representation via a second-level attention mechanism, while designed Conv-Trans Blocks (CTBs) are applied to image data to extract deeper visual features. Additionally, a similarity loss function based on KL divergence is introduced to optimize the cross-modal alignment. The proposed model is extensively evaluated on three widely-used public benchmarks: RefCOCO, RefCOCO+, and RefCOCOg. Moreover, a specialized text localization task is designed for further evaluation. Experimental results demonstrate that the TLCN achieves superior performance across all evaluated datasets and tasks. The superior performance of TLCN validates the effectiveness of its structural designs: text-guided visual extraction successfully bridges the semantic gap, the introduction of LNNs effectively reduces parameter counts for lightweight deployment, and the second-level attention with CTBs sufficiently captures deep cross-modal interactions. These findings suggest that TLCN provides a promising, efficient, and lightweight solution for visual grounding and related localization tasks.
The rapid spread of misinformation across social media platforms, websites, and online communication channels has made fake news detection a critical task in the digital era. Although various computational approaches have been developed to identify fake news, many existing methods suffer from limitations such as biased training datasets and high rates of false positives and false negatives. To address these challenges, this study proposes a Multimodal Cross Attention Network with Taylor-based Cross Entropy Mean Bias (MMCN_TCMB) model for detecting multimodal fake news. The proposed approach utilizes multimodal inputs consisting of textual and visual content obtained from fake news datasets. The textual information in news posts is first tokenized using Bidirectional Encoder Representations from Transformers (BERT). Feature extraction is then performed using Word2Vec and Term Frequency-Inverse Gravity Moment (TF-IGM). Simultaneously, images associated with news posts undergo preprocessing through Contrast Limited Adaptive Histogram Equalization and Histogram Equalization (CLAHE-HE), followed by feature extraction using ResNet. The extracted textual and visual features are combined and processed through the MMCN framework. The learning mechanism of the network is enhanced using the Taylor-based Cross Entropy Mean Bias (TCMB) loss function to improve classification performance. Experimental results demonstrate that the proposed MMCN_TCMB model achieves superior performance in multimodal fake news detection. The model attains a recall of 97.988%, precision of 96.223%, F1-score of 97.098%, and overall accuracy of 97.436%, outperforming existing methods. The findings indicate that integrating multimodal feature extraction with cross-attention mechanisms and the TCMB loss function significantly enhances the reliability and accuracy of fake news detection. The proposed framework effectively captures both textual and visual inconsistencies, making it a promising approach for combating misinformation in modern digital platforms.The code is available on: https://github.com/banbhrani84/MMCN_TCMB-Fake-News-.
Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery. The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction. ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099). ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public. Ferramentas de inteligência artificial (IA) baseadas em linguagem natural, como ChatGPT-4.1 mini (OpenAI Group PBC) e Gemini 2.5 Flash (Alphabet Inc.), são utilizadas por pacientes como fonte de informação médica. Este estudo avaliou e comparou a qualidade e a legibilidade das respostas fornecidas por essas IAs, em português brasileiro, sobre cirurgia do manguito rotador. Estudo transversal, descritivo e comparativo, com abordagem qualiquantitativa. Foram utilizadas 24 perguntas frequentes de pacientes, classificadas segundo Rothwell. Cada pergunta foi inserida individualmente nas plataformas dos dois modelos, sendo considerada apenas a primeira resposta. A qualidade foi avaliada por meio do instrumento DISCERN, desenvolvido pela University of Oxford e pela British Library, e dos critérios editoriais da Journal of the American Medical Association (JAMA). A legibilidade foi estimada com o programa Análise de Legibilidade Textual (ALT), validado para o português brasileiro. As análises estatísticas incluíram os testes de Wilcoxon, Friedman, análise de variância ([ analysis of variance , ANOVA, em inglês] para medidas repetidas) e post hoc de Conover com correção de Bonferroni. O ChatGPT obteve escore médio DISCERN de 58,7 ± 4,0, e o Gemini, 56,3 ± 3,5, sem diferença significativa ( p  = 0,174), mas com efeito máximo ( rank-biserial correlation [rrb, em inglês] = 1,0). Ambos os modelos apresentaram legibilidade média correspondente a 13,3 anos de escolaridade ( p  = 1,000). Nenhuma resposta atendeu aos critérios editoriais da JAMA. Perguntas relacionadas a valores obtiveram os maiores escores de qualidade, ao passo que as perguntas sobre política foram as mais complexas em termos de leitura. A correlação entre qualidade e legibilidade foi moderada (ρ = 0,73; p  = 0,099). ChatGPT–4.1 mini e Gemini 2.5 Flash ainda não oferecem informação médica, em português brasileiro, adequada quanto à confiabilidade editorial, qualidade e acessibilidade textual para o público leigo.
Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery. The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction. ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099). ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public. Ferramentas de inteligência artificial (IA) baseadas em linguagem natural, como ChatGPT-4.1 mini (OpenAI Group PBC) e Gemini 2.5 Flash (Alphabet Inc.), são utilizadas por pacientes como fonte de informação médica. Este estudo avaliou e comparou a qualidade e a legibilidade das respostas fornecidas por essas IAs, em português brasileiro, sobre cirurgia do manguito rotador. Estudo transversal, descritivo e comparativo, com abordagem qualiquantitativa. Foram utilizadas 24 perguntas frequentes de pacientes, classificadas segundo Rothwell. Cada pergunta foi inserida individualmente nas plataformas dos dois modelos, sendo considerada apenas a primeira resposta. A qualidade foi avaliada por meio do instrumento DISCERN, desenvolvido pela University of Oxford e pela British Library, e dos critérios editoriais da Journal of the American Medical Association (JAMA). A legibilidade foi estimada com o programa Análise de Legibilidade Textual (ALT), validado para o português brasileiro. As análises estatísticas incluíram os testes de Wilcoxon, Friedman, análise de variância ([ analysis of variance , ANOVA, em inglês] para medidas repetidas) e post hoc de Conover com correção de Bonferroni. O ChatGPT obteve escore médio DISCERN de 58,7 ± 4,0, e o Gemini, 56,3 ± 3,5, sem diferença significativa ( p  = 0,174), mas com efeito máximo ( rank-biserial correlation [rrb, em inglês] = 1,0). Ambos os modelos apresentaram legibilidade média correspondente a 13,3 anos de escolaridade ( p  = 1,000). Nenhuma resposta atendeu aos critérios editoriais da JAMA. Perguntas relacionadas a valores obtiveram os maiores escores de qualidade, ao passo que as perguntas sobre política foram as mais complexas em termos de leitura. A correlação entre qualidade e legibilidade foi moderada (ρ = 0,73; p  = 0,099). ChatGPT–4.1 mini e Gemini 2.5 Flash ainda não oferecem informação médica, em português brasileiro, adequada quanto à confiabilidade editorial, qualidade e acessibilidade textual para o público leigo.
The performance of five popular, widely available large language models (LLMs): ChatGPT-4o, Gemini 2.5 Flash, Llama 4, DeepSeek-V3, and Microsoft Copilot in operating dentistry education was evaluated by employing a multiple-choice question-based assessment system. This was done using a set of 150 MCQs covering areas of endodontics, dental caries, paediatric, preventive, aesthetic and restorative dentistry, biomaterials, and periodontics. The LLM's performance was assessed using classification metrics (accuracy, sensitivity, predictive reliability), textual similarity metrics (BLEU score, cosine similarity, Word Error Rate), and readability metrics (Flesch Reading Ease score). The highest classification accuracy was achieved by Gemini 2.5 Flash and ChatGPT-4o, showing their high sensitivity and high overall predictive reliability. The model with the most textual similarity to the reference answers was ChatGPT-4o with BLEU of 0.10 ± 0.0279, a high cosine similarity of 0.48 ± 0.0422, and a relatively low Word Error Rate (WER) of 5.57 ± 0.7301, and a Flesch Reading Ease score of 13.53 ± 4.9449. In medical education, ChatGPT-4o exhibited the highest accuracy, reference textual overlap, semantic alignment, lower number of errors, and readability among the five evaluated LLMs, making it a valuable assistant for dental healthcare professionals.
The increasing use of digital communication platforms has led individuals to express emotions and mental health concerns through text containing implicit emotional cues, informal language, and non-standard expressions. Traditional sentiment analysis systems often struggle to capture these contextual nuances, limiting their effectiveness in mental health-related text analysis . To address this challenge, this study proposes a two-layer framework that combines Azure Sentiment Analysis and Azure Custom Text Classification for sentiment and mental health-related text categorization. In the first layer, user-generated text is classified into positive, neutral, or negative sentiment categories using Azure Sentiment Analysis. Text identified as negative is subsequently analysed using Azure Custom Text Classification to categorize content into predefined mental health-related classes, including Anxiety, Depression, PTSD, Social Anxiety Disorder, and Suicidal Ideation and Behaviour. The proposed framework aims to provide a structured approach for identifying linguistic patterns associated with mental health-related discussions and supporting mental health screening and triage applications. Experimental evaluation using an 80% training and 20% testing split achieved an overall Precision, Recall, and F1-score of 96.97%. Class-level evaluation demonstrated strong performance across multiple categories, with F1-scores ranging from 0.94 to 1.000. The findings indicate that the proposed architecture can effectively classify mental health-related textual content within the evaluated dataset while providing a scalable framework for automated sentiment and text classification. The study contributes to the growing field of intelligent emotional computing and highlights the potential of cloud-based natural language processing tools for mental health-related text analytics . The reported results are limited to the evaluated dataset and should be interpreted as a text classification and screening approach rather than a clinical diagnostic system. This manuscript presents the computational component of a broader mixed-methods study registered under CTRI/2024/06/068766, titled "Exploring Mental Health Status in a Selected Population: A Corpus Analysis Combining Forensic Linguistics and Psychology - a Mixed Method Study." The current work focuses on the development and validation of an AI-based diagnostic tool for mental health assessment using synthetic and anonymized textual data, constituting a secondary objective of the registered protocol. Registry: Clinical Trials Registry- India (CTRI) Trial Registration Number: CTRI/2024/06/068766 Date of Registration: 12.06.2024.
Explainable artificial intelligence (XAI) is increasingly important for computational pathology, where reliable and interpretable predictions are required for clinical use. Pre-trained Vision-Language Models (VLMs) offer a natural pathway to connect visual evidence with textual concepts, but adapting them to the domain of cancer pathology remains challenging due to fine-grained and heterogeneous morphological patterns. Visual prompt learning enables efficient task adaptation with minimal fine-tuning; however, existing prompting techniques face critical limitations: soft prompts often lack clinical specificity, while manually designed hard prompts reduce adaptability. To address these issues, we propose RAPT, a retrieval-augmented and text-guided visual prompting framework for explainable pathology classification. Given an input image, RAPT retrieves semantically related exemplars and leverages class-specific textual descriptions to construct disease-aware prompt tokens, injecting diagnostic cues without manual annotation. An adaptive weighting mechanism attenuates unreliable retrieval evidence, and bridge prompt tokens facilitate the integration of retrieved cues with image representations. Extensive experiments on three public cancer pathology datasets (PatchGastric, BACH, and LC25000) demonstrate that RAPT consistently outperforms prompting baselines across diverse backbone settings. Additional analyses and qualitative case studies demonstrate clinically actionable cues, robustness to imperfect retrieval, and clear failure-mode boundaries for trustworthy pathology decision support, highlighting the potential of robust and explainable prompting for computational pathology.
Zero-shot human-object interaction (HOI) detection aims to recognize both seen and unseen interaction categories while detecting humans and objects in an image. However, due to the absence of training samples for unseen categories, existing methods often overfit on seen HOIs and struggle to generalize to unseen ones. To address this issue, we introduce a novel Language-Driven Visual Data Generation (LD-VDG) approach that generates pseudo visual features from textual semantics of unseen HOIs. This provides an innovative solution enabling generalization to unseen HOIs without relying on visual samples. Specifically, we first design a text-to-vision (T-V) adapter to align HOI text and visual features, trained on seen HOIs with paired image-text data. For unseen HOIs, we guide the large language model to produce multiple fine-grained textual descriptions based on HOI labels, which are then encoded by the vision-language model and transformed into pseudo visual features via the T-V adapter. After that, these pseudo features together with real features from seen HOIs are jointly used to train a transformer-based HOI detector. In this way, our method enables effective recognition of unseen HOIs by leveraging language-driven visual representations. Experimental results on standard datasets demonstrate that the proposed LD-VDG outperforms previous methods. In particular, it achieves superior performance on unseen categories under various zero-shot settings.
YouTube is the primary global video platform, hosting both authoritative health information and vaccine-skeptic viewpoints. However, engagement dynamics remain poorly understood. The aim of this study was to investigate the temporal and textual dynamics of engagement of the YouTube viewership with vaccination content, and specifically content that is in favor of or against vaccination. We contextualized these dynamics in the authority signals of the posting channel and the moderation actions taken by the platform. We conducted a 6-month daily longitudinal analysis of 7213 vaccine-related YouTube videos (November 2024 to May 2025) mentioning vaccination. We used zero-shot large language model classification with manual verification to classify the video stance toward vaccination, and the stance of their comments toward the video. The engagement and disagreement dynamics were modeled using Bayesian regression. Our findings show engagement asymmetry between content supporting and questioning vaccination. Vaccine-hesitant videos in our sample receive substantially higher raw engagement (median likes: 40 [IQR 3-846] to 59 [IQR 3-1319]; median comments: 10 [IQR 0-160] to 18 [IQR 0-311] per video versus 3 [IQR 1-15] and 0 [IQR 0-4], respectively, for strongly provaccine content) and moderate normalized engagement rates (true median combined rate: 0.073 [IQR 0.028-0.121] to 0.069 [IQR 0.027-0.118] interactions per view versus 0.026 [IQR 0.007-0.060] for strongly pro-vaccine videos, a 2.5-2.6× difference). Descriptively, vaccine-hesitant videos reach 90% of cumulative views faster (18 [IQR 8-38] days vs 32 [IQR 18-64] days; 44% faster), while negative binomial models that adjust for total engagement volume indicate that approximately 20% of this advantage reflects genuine temporal compression independent of engagement volume. Comment analysis indicated that the vaccine-hesitant videos in our sample foster echo chambers, while the provaccine content attracts battlegrounds. Considering the sources of vaccine-related content, provaccine content tends to originate from organizations, particularly news and health institutions, while vaccine-hesitant discourse is more likely to come from individual creators, even those self-identifying as medical doctors. Moderation, on the rare occasion when it occurs (about 2% of the videos were taken down), comes after engagement saturation, limiting its effectiveness. Our analysis suggests that the vaccine-hesitant content can dominate YouTube's engagement ecosystem through rapid early-stage amplification, which has direct implications for public health intervention timing and platform governance policy.
Fire detection is vital in subway tunnels where confined geometries and ventilation complicate safety monitoring. Existing approaches, including classical machine learning methods, typically detect fire from multivariate correlations but often lack the contextual reasoning required for disambiguation. This limitation becomes critical when HVAC-driven airflow disrupts thermal stratification and dilutes gas concentrations, creating ambiguous patterns that mimic fire signatures. Recent studies suggest that Large Language Models (LLMs) may help address this challenge by translating structured sensor summaries into concise semantic descriptions. To examine this role in fire detection, we propose HyFiD, a hybrid framework that employs an LLM as a semantic feature extractor to augment classical classifiers. By converting momentary multi-sensor readings (temperature, smoke, O[Formula: see text], CO, and CO[Formula: see text]) into textual assessments of environmental states, HyFiD generates semantic vectors that are fused with numerical features for supervised classifier training. Experiments on Fire Dynamics Simulator (FDS)-based subway scenarios, including ventilation-dominated HVAC events and high-energy battery fires, show that the effect of LLM-derived semantic features is strongly dependent on the downstream classifier. Within the evaluated FDS-based scenarios, the GBM-based HyFiD configuration achieved the highest Accuracy, Recall, and F1 among the tested hybrid models, numerical-only deep learning baselines, and direct prompt-only LLM baselines. The results further show that semantic augmentation should be assessed not only by classification accuracy, but also by operational alarm behavior such as detection delay and pre-alarm rate.
Retrieval-Augmented Generation (RAG) has emerged as a pivotal framework for knowledge-intensive reasoning by coupling external retrieval with generative capabilities. However, existing RAG systems suffer from a critical granularity mismatch problem: coarse-grained retrieval units (entire passages or global images) fail to align with the fine-grained reasoning requirements of generation tasks, particularly in multimodal contexts where localized visual evidence is essential. We propose Fine-Grained Retrieval-Augmented Generation (FG-RAG), a unified framework that bridges this semantic resolution gap through two key innovations. First, we enhance the CLIP architecture with patch-level contrastive supervision, enabling explicit alignment between localized image regions and corresponding textual fragments. Second, we introduce a joint retrieval-reranking optimization mechanism that unifies a dense retriever with a large language model (LLM)-based reranker through a shared relevance loss. To address the non-differentiable nature of LLM generation, we employ a Score Alignment Strategy where generative likelihoods provide structural supervision for the retriever, creating bidirectional feedback between retrieval precision and generation quality. Comprehensive evaluation on MSCOCO and Flickr30k benchmarks demonstrates that FG-RAG achieves significant retrieval gains (Recall@1 = 0.8845 on MSCOCO), outperforming state-of-the-art methods by up to 7.5% across datasets. In visual question answering, our framework achieves 0.4353 F1 score on MSCOCO and reduces hallucination rates by 15% points compared to conventional RAG systems. Ablation studies confirm the necessity of both fine-grained modeling and joint optimization components, with their removal causing substantial performance degradation in critical metrics. These results establish that fine-grained semantic alignment coupled with closed-loop optimization substantially enhances the factual grounding and contextual coherence of multimodal generation systems.
Overweight and obesity are causing a growing public health, economic, and clinical burden, particularly within under-resourced communities. There is an urgent need to develop an in-depth understanding of experiences of weight management as well as preferences for support within under-resourced communities, with a view to developing more effective weight management interventions. Focus groups were run in under-resourced communities using storyboarding, a method used to facilitate inclusive communication (n = 37). Thematic analysis was applied to textual and visual data, and a realist 'lens' applied to provide in-depth insight into weight management experiences and needs. We believe this is the first study to use this combined methodology to explore weight management experiences and needs. Combining storyboarding with a realist lens generated four themes. Living circumstances indicated that mental health, individual needs, and the cost of weight management services were key contextual factors. Mechanisms of weight management identified emotional eating and portion control as central to individual weight management. Yo-yo dieting centered on participants' experiences of weight regain after attempting weight loss. Weight management intervention needs indicated that psychological support was perceived as severely lacking and the only route to attain sustained weight management. Offering both in-person and online support for weight management was considered important to reach more people. Moving weight management support from short to long term and incorporating more robust psychological support would better serve the needs of people living in under-resourced communities who are overweight or obese. Ideally interventions should be multicomponent and tailored to individual needs and circumstances.
Surgical vision-language foundation models learn generalizable representations from large-scale surgical data and support flexible adaptation to diverse downstream tasks by using text prompts as classifiers. However, their effectiveness is hindered by a critical semantic limitation: These models often struggle to distinguish between positive and negative textual assertions, a capability essential for fine-grained surgical tasks where the target object may occupy only a small and localized region. This limitation weakens the reliability of current state-of-the-art vision-language adaptation methods, leading to degraded performance when prompts require precise semantic discrimination. We propose Surg-NAT+, a few-shot vision-language adaptation framework designed to enhance foundation models for fine-grained surgical tasks. First, we introduce a negation-aware contrastive objective that explicitly strengthens the model's ability to differentiate between affirmative and negated textual prompts. This objective is incorporated into pretrained models through multi-level adapter fusion (MAF), enabling hierarchical semantic refinement within the text encoder. Second, we propose a fine-grained self-distillation objective that improves visual grounding by enforcing consistency between global representations of image crops and the corresponding local patch embeddings. We evaluate our approach on the Cholec80 dataset for few-shot surgical tool recognition using foundation models. Surg-NAT+ achieves state-of-the-art performance across all few-shot regimes, consistently surpassing existing baselines by a substantial margin. Qualitative analyses further demonstrate that our method yields more accurate visual grounding of target tools and enhances semantic separability in the learned feature space. We presented Surg-NAT+, a lightweight negation-aware adaptation framework that substantially improves the fine-grained recognition capabilities of surgical vision-language models in few-shot, multi-label tool recognition settings. Our approach provides an efficient and semantically robust pathway for adapting foundation models to surgical domains. The code is available at https://github.com/Yutongc-ai/fine-tune-SurgVLP .
Low-light Image Enhancement Algorithms (LIEAs) aim to improve the visibility and visual quality of images captured in low-light environments. However, none of the existing LIEAs can comprehensively restore all visual contents, which makes it inevitable for the Enhanced Low-light Images (ELIs) to have different degrees of distortion, thereby affecting the visual quality. Currently, there is little research focusing on the quality assessment of these ELIs, partly due to the lack of publicly available datasets. Moreover, existing quality assessment methods primarily focus on a single visual modality and fail to sufficiently exploit the structural information across multiple image attributes, consequently resulting in suboptimal prediction performance. To this end, this paper conducts a systematic study on both subjective and objective quality assessment of ELIs. Firstly, we construct the first Multi-annotated and multi-modal Low-light image Enhancement quality dataset (MLE), which contains 1,000 ELIs, along with subjective studies to obtain multiple attribute annotations, quality scores, and textual descriptions. Based on this, we further propose an Attribute-guided Vision-Language Graph Reasoning Network (AVGR-Net) for ELI quality prediction, which effectively integrates multi-attribute visual and textual information through cross-modal graph reasoning and alignment. Extensive data analysis and experimental results validate both the reliability of the MLE dataset and the superior performance of the AVGR-Net compared to state-of-the-art methods.
Pulmonary diseases pose significant health risks to athletes, necessitating accurate early detection and risk prediction methods. In this study, we propose a novel Multimodal Pulmonary Risk Prediction Network (MPRPN), which integrates visual data, textual data, and auxiliary physiological data through a unified deep learning framework. The model incorporates an Adaptive Modality Weighting Strategy (AMWS) to dynamically adjust modality contributions and a Hierarchical Risk Prediction Strategy (HRPS) to capture domain-specific feature structures. Experiments were conducted on multiple multimodal datasets, including the Athlete Respiratory Health Records dataset, Multimodal Pulmonary Imaging Collection, Pulmonary Risk Profiles dataset, and Early Detection Biomarker dataset, comprising diverse clinical, imaging, and physiological samples. The proposed method achieves superior performance compared to state-of-the-art models, with accuracy improvements up to 89.92%, F1-score reaching 90.23%, and AUC up to 90.47%, demonstrating strong predictive capability and robustness. These results indicate that MPRPN effectively leverages complementary multimodal information and provides a reliable tool for early detection and personalized risk assessment of pulmonary diseases in athletes. The proposed framework has significant potential for real-world applications in sports medicine and preventive healthcare.
The exponential growth of textual data has intensified the need for reliable automated text summarization (ATS) systems that can extract and synthesize knowledge while maintaining factual accuracy. Current evaluation frameworks for large language models (LLMs) in summarization tasks lack comprehensive assessment of factual consistency, particularly in knowledge engineering contexts where information integrity is paramount. This paper presents a comprehensive evaluation framework that systematically assesses factual consistency in LLM-generated summaries through advanced prompting strategies and multi-dimensional evaluation metrics. Our framework integrates five prompting methodologies, such as Zero-shot, Few-shot, Chain-of-Thought (CoT), Structured Chain-of-Thought (SCoT), and Chain-of-Verification (CoVe) with state-of-the-art (SOTA) factuality assessment approaches, such as FActScore, LongDocFACTScore (LDFActs) and AlignScore across eight LLMs and five diverse datasets spanning news, scientific literature, and conversational domains. Results demonstrate that Few-shot prompting achieves optimal performance across most domains except scientific literature, with LLMs consistently outperforming human-generated summaries. Our findings reveal trade-offs between completeness and precision, with models generating 2-10 times more atomic facts than human references while maintaining comparable or superior factual accuracy. The framework provides actionable insights for researchers developing reliable summarization systems, with open-source implementation available for reproducibility.