Advancing orthopedic care through large language models requires both multimodal processing capabilities for medical images and open-source deployment options for secure in-house operations, yet these remain underexplored in current literature. This study aims to benchmark open-source vision-language models (VLMs) against orthopedic residents using the Orthopedic In-Training Examination (OITE), assess domain-specific performance across orthopedic subspecialties, and investigate the relationship between model parameter size and performance. Six open-source VLMs of varying sizes (Alibaba Qwen2.5-VL-72B-Instruct, Alibaba Qwen2.5-VL-32B-Instruct, Alibaba Qwen2.5-VL-7B-Instruct, Alibaba Qwen2.5-VL-3B-Instruct, Meta Llama-3.2-90B-Vision-Instruct, Meta Llama-3.2-11B-Vision-Instruct) were evaluated using the 2023 OITE (210 questions; 111 with images). Model performance was compared to resident scores from the 2023 OITE technical report. Pearson correlation coefficient was used to assess the association between model size and performance. The 2 largest open-source models, Qwen2.5-VL-72B and Llama-3.2-90B, demonstrated performance levels comparable to those of second-year orthopedic residents on the OITE examination. A mid-sized model, Qwen-32B, slightly outscored first-year residents. In contrast, small-sized models (under 11 billion parameters) performed worse than first-year residents. Qwen2.5-VL-72B performed best in foot & ankle and sports medicine topics, while Llama-3.2-90B was strongest in basic science and hand & wrist. All models had the most difficulty with spine and pediatric questions. Overall, model accuracy increased steadily with model size up to 72 billion parameters, but larger sizes showed little additional improvement. Smaller models offer reduced accuracy in exchange for lower hardware requirements. Spine and pediatric domains remain consistently areas of underperformance across all models. Model selection should be based on domain-specific benchmark results to balance clinical needs with hardware limitations. While promising, open-source VLMs currently require further refinement and validation before they can be reliably applied in clinical or educational settings.
Multivariate workload prediction in cloud computing environments is a critical research problem. Effectively capturing inter-variable correlations and temporal patterns in multivariate time series is key to addressing this challenge. To address this issue, this paper proposes a convolutional model based on a Nonlinear Spiking Neural P System (ConvNSNP), which enhances the ability to process nonlinear data compared to conventional convolutional models. Building upon this, a hybrid forecasting model is developed by integrating ConvNSNP with a Bidirectional Long Short-Term Memory (BiLSTM) network. ConvNSNP is first employed to extract temporal and cross-variable dependencies from the multivariate time series, followed by BiLSTM to further strengthen long-term temporal modeling. Comprehensive experiments are conducted on three public cloud workload traces from Alibaba and Google. The proposed model is compared with a range of established deep learning approaches, including CNN, RNN, LSTM, TCN and hybrid models such as LSTNet, CNN-GRU and CNN-LSTM. Experimental results on three public datasets demonstrate that our proposed model achieves up to 9.9% improvement in RMSE and 11.6% improvement in MAE compared with the most effective baseline methods. The model also achieves favorable performance in terms of MAPE, further validating its effectiveness in multivariate workload prediction.
Pediatric heart disease (PHD), including congenital heart defects, is often incompletely captured in electronic health records, particularly when clinical significance must be inferred from unstructured echocardiogram reports. Automated methods capable of extracting clinically meaningful PHD from narrative reports could improve clinical decision support and research applications. The aim of the study is to evaluate the feasibility of using supervised fine-tuning of large language models (LLMs), with and without chain-of-thought (CoT) reasoning, to characterize patients with clinically significant or historical PHD from unstructured echocardiogram reports. We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9749 echocardiogram reports. A subset of 712 reports was adjudicated by 2 pediatric cardiac anesthesiologists, classifying 506 (71.1%) as clinically significant PHD and 206 (28.9%) as not significant. While DeepSeek R1 has shown improved performance with CoT reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs. The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.4%), outperforming Qwen-7B-without-CoT (90%), LLaMA-3B-without-CoT (87.9%), Qwen-3B-without-CoT (85.6%), Qwen-3B-10k-overthink-CoT (68.5%), and LLaMA-3B-10k-overthink-CoT (46.2%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained a strong, balanced performance (82.7%), followed by Qwen-7B-without-CoT (88.4%), LLaMA-3B-without-CoT (86.8%), Qwen-3B-without-CoT (84.5%), Qwen-3B-10k-overthink-CoT (58.9%), and LLaMA-3B-10k-overthink-CoT (46.2%). The fine-tuned Qwen-7B model with overthinking CoT (10,000 tokens) achieved the highest internal accuracy (92.4%), with balanced sensitivity and specificity. Across repeated runs, CoT-enhanced models demonstrated improved classification consistency compared to non-CoT models (Qwen-7B-without-CoT: 90%, LLaMA-3B-without-CoT: 87.9%, Qwen-3B-without-CoT: 85.6%). In external validation (n=113), non-CoT variants achieved higher accuracy (up to 88.4%), whereas the Qwen-7B CoT model demonstrated more balanced class performance (accuracy=82.7%). Supervised fine-tuning of LLMs with CoT offers an effective approach for automated PHD detection within unstructured data in the electronic medical record. While CoT-enhanced models demonstrated improved internal performance and more balanced classification, they did not consistently achieve higher accuracy in external validation, highlighting trade-offs between accuracy and class balance. These findings highlight the promise of LLM-based approaches for clinical text phenotyping while underscoring the need for larger, multicenter validation and careful calibration for real-world deployment. Continued validation and integration into the electronic medical record are essential for real-world, artificial intelligence-driven clinical decision support.
Accurately differentiating early-stage breast cancer from benign lesions on MRI is essential to reduce unnecessary biopsies. However, the limited interpretability of current deep learning models hinders their clinical trustworthiness and adoption. This study aimed to develop a clinically interpretable concept bottleneck model (CBM) that integrates radiologist-specific knowledge and automatically generates structured reports, thereby improving diagnostic accuracy and consistency in breast MRI interpretation. Preoperative breast MR images and radiological reports were retrospectively collected from five institutions (January 2016–July 2025) and allocated to internal, external and multi-reader cohorts. Lesion-related descriptors from free-text MRI reports were standardized into BI-RADS-compliant concepts. These concepts, alongside multiparametric MR sequences, were input into the CBM for classification and structured reporting of the lesions annotated by radiologists using bounding boxes. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) and compared against a black-box deep learning model. The accuracy of CBM-generated concepts was evaluated. A two-phase multi-reader study was further conducted to assess clinical utility. A total of 1,695 pathology-confirmed breast lesions (857 malignant and 838 benign) from 1,634 patients (median age 46 years, IQR 39–53) were included. The CBM achieved an AUC of 0.92 (95%CI 0.90–0.93) on the test set, comparable to the black-box model (AUC: 0.93, 95%CI 0.92–0.94). Concept accuracy ranged from 0.64 to 1.00. In the multi-reader study, the CBM matched the diagnostic accuracy of one radiologist and exceeded that of seven others (all P < 0.05). With CBM assistance, radiologists correctly downgraded 22.1% of lesions to benign. Diagnostic accuracy improved for three radiologists (from 0.71 to 0.72 to 0.82–0.91, all P < 0.05), and inter-reader agreement increased for both concept recognition and BI‑RADS category (Gwet’s AC1: 0.27-1.00 to 0.46-1.00). The CBM provides a versatile framework for classifying early breast cancer and benign lesions. By employing an image-concept alignment strategy, it enhances intrinsic interpretability and offers radiologists clinically relevant, intelligible decision support that serves both diagnostic and educational needs. Moreover, this retrospective study demonstrates its potential to reduce unnecessary biopsies for benign breast lesions and to improve reporting consistency in breast MRI. The online version contains supplementary material available at 10.1186/s12916-026-04889-7.
The synergistic development of new quality productive forces (NQPF) and innovation resource allocation is critical for achieving sustainable and high-quality economic growth. Using provincial data from 2012 to 2022 in China, this study constructs the evaluation framework for NQPF and innovation resource allocation, and employs an unsupervised dual-tower multilayer perceptron (MLP) neural network model to measure the coupling coordination degree. And the spatial differentiation and dynamic evolution of the coordinated degree are further explored. The results demonstrate that the MLP approach offers superior performance in identifying long-term trends while remaining robust to short-term fluctuations. Despite remaining at a primary stage, the overall coordination degree exhibits a distinct upward trajectory. Spatial disparities are primarily driven by interregional differences, with the eastern region exhibiting short-term positive development cycles, the central region showing steady catch-up progress, and the western region facing challenges of marginalization. Moreover, significant spatial spillover effects highlight the influence of geographical proximity, underscoring the importance of cross-regional cooperation and innovation resource sharing.
Fibrosis is one of the major causes of cardiac allograft malfunction and is mainly driven by fibroblasts. However, the role of recipient-derived cells in generating allograft fibroblasts and the underlying mechanisms remain to be explored. We analyzed human heart allograft samples and used murine transplant models (C57BL/6J, Cd34 (cluster of differentiation 34)-CreERT2; R26-tdTomato, mRFP (cell membrane labeled with red fluorescence protein) mice, Rosa26-iDTR, Postn-CreERT2; R26-tdTomato, double-tdTomato, and immunodeficient mice with BALB/c donors). Human progenitor cells were cultivated from blood. Single-cell RNA sequencing, Western blotting, quantitative polymerase chain reaction, and immunohistochemistry, whole-mount staining with 3-dimensional reconstruction, and in vivo/in vitro experiments were applied to characterize allograft cellular composition and communication. Single-cell RNA sequencing was introduced to delineate the allograft cell atlas of patients and mice. Y chromosome analysis identified that recipient-derived cells contributed to allograft fibroblasts in both patients and murine models. Combining the genetic cell lineage tracing technique, we found that recipient-derived CD34+ cells could give rise to activated fibroblasts. Bone marrow transplantation and parabiosis models revealed that the recipient's circulating non-bone marrow Cd34+ cells could generate allograft fibroblasts. Human CD34+ cells could differentiate into fibroblasts both in vivo and in vitro. CD34+ fibroblast progenitors were recruited by CXCL12 (C-X-C motif chemokine ligand 12)-ACKR3 (atypical chemokine receptor 3) and MIF (macrophage migration inhibitory factor)-ACKR3 interactions and differentiated via the TGFβ (transforming growth factor beta)/GFPT2 (glutamine-fructose-6-phosphate transaminase 2)/SMAD2/4 (small mother against decapentaplegic 2/4) axis. Ablation of recipient Cd34+ cells reduced activated fibroblasts and alleviated allograft fibrosis. We identify circulating CD34+ cells as a novel source of fibroblast progenitors that contribute to cardiac allograft fibrosis, suggesting that targeting recipient CD34+ cells could be a novel therapeutic potential for treating cardiac fibrosis after heart transplantation.
Heavy-duty vehicles (HDVs) are major greenhouse gas emitters, and liquefied natural gas (LNG)-powered HDVs have emerged as a promising low-carbon alternative. However, their real-world emission performance and mitigation potential remain insufficiently studied, necessitating the characterization of LNG container trucks' on-road CO2 emissions via advanced sensing technologies. To characterize HDVs' emission characteristics, real-driving emissions from China VI LNG and diesel-powered container trucks were measured employing portable emissions measurement systems (PEMS). The results reveal that high CO2 emissions predominantly occur during low- to medium-speed acceleration and at speeds above 40 km/h with an acceleration exceeding 0.3 m/s2 on highways, whereas emissions on port roads are more dispersed. A third-degree polynomial function fits emissions well with vehicle-specific power (VSP). Engine parameters mainly influence CO2 emissions for LNG trucks, while VSP and acceleration significantly impact diesel trucks. The Random Forest model achieves superior prediction accuracy, particularly in highway scenarios, and significantly better CO2 forecasting for LNG-powered trucks. These findings validate the effectiveness of PEMS-based sensing in characterizing low-carbon HDVs' real-world emissions. The integration of multi-source sensor data and machine learning also provides a reference for intelligent sensing in transportation environmental monitoring.
Purpose To develop and validate a deep learning-based approach, Gastric Neoplasm Detection with Artificial Intelligence (GANDA), for automated detection, diagnosis, and segmentation of gastric neoplasms at clinical routine contrast-enhanced CT. Materials and Methods In this retrospective study, GANDA, a joint segmentation and classification three-dimensional deep learning model, was developed by using CT data of 1683 patients with or without gastric neoplasms from one hospital between January 2007 and June 2019. Performance was evaluated in an internal test cohort (January 2019 to June 2019), an external test cohort (April 2015 to December 2022) from four external centers, and a real-world test cohort (March 2023 to May 2023) from one hospital. Model performance for tumor detection and diagnosis was assessed using receiver operating characteristic analysis and compared with that of 10 board-certified radiologists (median experience, 8.5 years [IQR: 5.25-14]). Model segmentation performance was assessed using the Dice coefficient. Results A total of 4606 patients were included in the study (median age, 57 years [IQR: 48-66]; 2554 male patients). In the internal test cohort (n = 266), GANDA achieved 87.3% sensitivity and 87.2% specificity for tumor detection. The model demonstrated significantly higher diagnostic accuracy (top-1 accuracy, 85.3%; 95% CI: 81.2, 89.1) compared with radiologists (mean accuracy, 74.2%; 95% CI: 70.5, 77.6; P = .002). In the external test cohort (n = 2657), GANDA distinguished between patients with gastric neoplasms and controls with 77.4% sensitivity and 89.8% specificity. The mean Dice coefficient in the internal test cohort was 0.52 for gastric cancer and 0.45 for non-gastric cancer. In the real-world test cohort (n = 7695), GANDA achieved 83.2% sensitivity and 93.1% specificity for tumor detection. Conclusion GANDA enabled the detection and segmentation of gastric neoplasms at routine clinical CT scans. Keywords: CT, Computed Tomography, Abdomen/GI, Stomach, Screening, Gastric Neoplasm, Deep Learning Supplemental material is available for this article. ©RSNA, 2025.
Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of  ∼ 30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus  < G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades ( ≥ G2 versus  < G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative. Code is available at https://github.com/StevenHaojc/MOON.
The intrinsic variability of solar and wind energy, compounded by their rapid expansion, has intensified power curtailment challenges1,2. Although spatiotemporal complementarity between these resources is widely recognized as a pathway to enhance renewable integration and reduce balancing requirements3-16, existing assessments are largely based on hypothetical deployments17-24. Consequently, how solar-wind complementarity manifests under real-world infrastructure and shapes system-level integration outcomes remains unclear. Here we develop a unified national inventory to enable a data-driven assessment of solar-wind complementarity. The inventory covers 319,972 solar photovoltaic facilities and 91,609 wind turbines in 2022, identified from sub-metre satellite imagery using a deep-learning-based framework. Using this dataset, we show that solar-wind complementarity substantially reduces generation variability, with effectiveness increasing as the geographic scope of pairing expands. At the system level, nationwide inter-provincial coordination raises effective renewable penetration by 99.88 TWh in an 80% dispatchable-flexibility system, corresponding to 9.1% of total solar and wind generation, or approximately 120 h of national average load. These findings demonstrate that energy complementarity is a scalable, system-wide mechanism for advancing solar and wind penetration, offering broadly applicable insights into the role of inter-regional coordination in enhancing renewable integration in large power systems.
Interpretability has become an essential topic for artificial intelligence in some high-risk domains such as healthcare, banking, and security. For commonly used tabular data, traditional methods trained end-to-end machine learning models with numerical and categorical data only and did not leverage human-understandable knowledge such as data descriptions. Yet mining human-level knowledge from tabular data and using it for prediction remain a challenge. In this paper, we propose a novel component for tabular data, called quantitative argumentation layer, which mined concepts from both data and data descriptions. we construct a concept and argumentation model (CAM) that embeds human-aligned reasoning processes-quantitative argumentation explicitly represents domain knowledge through human-understandable argumentation rules rather than opaque machine encodings. As a result, CAM provides decisions that are based on human-level knowledge and the reasoning process is intrinsically interpretable. Finally, to explain the proposed interpretable model, we provide a dialogical explanation containing dominated reasoning paths within CAM. Human-subject evaluations indicate CAM is comprehensible to individuals, and the explanations provide reasonable rationales and have a high level of user acceptance. We also conduct data experiments on both open-source benchmarks and real-world business datasets that show that our interpretable approach can reach competitive results compared with state-of-the-art models.
We propose a data-driven approach to realize chaotic control. By virtue of the reservoir computing approach, we obtain an appealing model for characterizing chaotic systems with only observational data required. By applying the Grebogi-Yorke algorithm to the reservoir computing model, we show that the dynamical variables for characterizing trajectory evolution indicate successful synchronization in the considered systems. We sample data from several chaotic systems as well as real-world systems to demonstrate the effectiveness of our approach. Our work overcomes the reliance of traditional chaos control on analytical system models, thereby extending chaotic control theory to more complex industrial scenarios.
Develop a dual-phase deep learning (DPDL) model using arterial/portal-phase CT to detect incidental small (0.5-3 cm) pancreatic cystic lesions (PCLs). Contrast-enhanced CT images of 437 incidental small PCLs, including 201 subcentimeter cysts (0.5-1 cm) and 193 normal pancreases were retrospectively collected (January 2018 - December 2020) and randomly divided into training, validation and testing cohorts. Detection sensitivity, specificity, any-false-positive rate (AFPR) and accuracy of the DPDL model were compared with a portal single-phase deep learning (SPDL) model and the senior and junior radiologists in the testing cohort. Factors potentially affecting detection were analyzed using logistic regression analysis. In the validation cohort, the DPDL model exceeded the SPDL model in sensitivity (91.7 % vs. 82.1 %; P = 0.021). In the testing cohort, it surpassed the junior radiologist in sensitivity (92.7 % vs. 74.0 %; P < 0.001) and accuracy (86.2 % vs. 69.7 %; P = 0.003), while performing comparably to the senior radiologist (all P > 0.05). Subgroup analysis confirmed DPDL's superiority for subcentimeter PCLs than the junior radiologist. With DPDL assistance, sensitivity of radiologists was significantly improved, while detection time and AFPR of the junior radiologist were significantly reduced (all P < 0.05). None of study factors affected DPDL's performance, whereas SPDL and radiologists were influenced by multiple factors. Notably, DPDL model identified all 18 incidental PCLs that progressed during follow-up (3 malignant) while the junior radiologist missed 2 in testing cohort. The DPDL model exhibited superior and more robust detection performance for small PCLs than the SPDL model and the junior radiologist, potentially narrowing performance gaps between experience levels and improving early diagnosis of high-risk lesions.
Accurate cloud workload forecasting is critical for proactive resource provisioning, cost control, and Service Level Agreement (SLA) compliance; however, it is hindered by the scarcity and heterogeneity of labels. We present Time Series Augmentation for Multi-Scale Prediction (TSA-MSP), a self-supervised framework that achieves near-fully supervised accuracy with limited labels. Conceptually, TSA-MSP couples cloud-informed augmentations-multi-scale time warping, frequency-domain mixing, and periodic pattern injection-with a hierarchical multi-scale contrastive objective and efficient fine-tuning using lightweight adapters to capture short- and long-range workload patterns. We conducted an empirical experimental study on real-world traces (Alibaba 2018, Google 2019, Azure 2019): self-supervised pre-training on unlabeled data followed by fine-tuning with small labeled subsets. We evaluated the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Symmetric Mean Absolute Percentage Error (sMAPE), and a peak-oriented metric (Peak Prediction Accuracy, PPA) and performed ablations (augmentations, scales), label efficiency, and cross-dataset transfer tests. Remarkably, with only 20% labels, TSA-MSP attains RMSE within  ∼ 1.5-1.7% of fully supervised training (100% labels) across all three datasets; for example, on Alibaba, it achieves MAE/RMSE 0.241/0.335 versus 0.238/0.330 for a fully supervised Pathformer. Against the best self-supervised baseline at 20% labels, TSA-MSP reduces RMSE by  ∼ 6.6-6.9%. In a 5%-label setting, it outperformed supervised training from scratch by  ∼ 23-26% RMSE. Cross-dataset transfer further improves the RMSE by  ∼ 2.4-5.2% over direct target-only pre-training. By combining domain-aware augmentations with multi-scale self-supervision and efficient adaptation, TSA-MSP delivers accurate forecasts under scarce labels, improving peak readiness while enabling faster, more cost-effective deployment for real-world cloud resource management.
Passive smartphone sensing shows promise for suicide prevention, but behavioral metadata (GPS, screen time, and accelerometry) often lacks the contextual information needed to detect acute psychological distress. Analyzing what people actually see, read, and type on their phones-rather than just usage patterns-may provide more proximal signals of risk. This study aimed to test whether vision-language models (VLMs) applied to passively captured smartphone screenshots can predict momentary suicidal ideation (SI). Seventy-nine adults with past month suicidal thoughts or behaviors completed ecological momentary assessments (EMA) over 28 days while screenshots were captured every 5 seconds during active phone use. We fine-tuned open-source VLMs (Qwen2.5-VL [Alibaba Cloud], LFM2-VL [Liquid AI]), and text-only models (Qwen3 [Alibaba Cloud]) to predict SI from screenshots captured in the 2 hours preceding each EMA. We evaluated performance with temporal and subject holdouts. The analytic sample comprised 2.5 million screenshots from 70 participants. Temporal holdout models achieved strong discrimination at the EMA level (AUC=0.83; AUPRC=0.77), with image-based models outperforming text-only models (AUC=0.83 vs 0.79; 95% CI 0.003-0.07). Subject holdout generalization was near chance (AUC≈0.50), though a simple lexical screening method retained modest discrimination (AUC=0.62). Smaller models performed comparably to larger models, supporting feasible on-device deployment. Screen content predicts short-term SI with clinically meaningful accuracy when models are personalized but does not generalize across individuals. These findings support a 2-stage clinical architecture, coarse lexical screening for new patients, with personalized VLM-based monitoring after a calibration period. On-device inference may enable privacy-preserving deployment.
There is an increasing amount of literature evaluating the clinical knowledge and reasoning performance of large language models (LLMs) in ophthalmology, but to date, investigations into its multimodal abilities clinically-such as interpreting images and tables-have been limited. To evaluate the multimodal performance of the following 7 foundation models (FMs): GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Llama-3.2-11B (Meta), DeepSeek V3 (High-Flyer), Qwen2.5-Max (Alibaba Cloud), and Qwen2.5-VL-72B (Alibaba Cloud) in answering offline Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice textual and multimodal questions, with head-to-head comparisons with physicians. This cross-sectional study was conducted between September 2024 and March 2025 using questions sourced from a textbook used as an examination preparation resource for the Fellowship of the Royal College of Ophthalmologists part 2 written examination. FM performance. The primary outcome measure was FM accuracy, defined as the proportion of answers generated by the model matching the textbook's labeled letter answer. For textual questions, Claude 3.5 Sonnet (accuracy, 77.7%) outperformed all other FMs (followed by GPT-4o [accuracy, 69.9%], Qwen2.5-Max [accuracy, 69.3%], DeepSeek V3 [accuracy, 63.2%], Gemini Advanced [accuracy, 62.6%], Qwen2.5-VL-72B [accuracy, 58.3%], and Llama-3.2-11B [accuracy, 50.7%]), ophthalmology trainees (difference, 9.0%; 95% CI, 2.4%-15.6%; P = .01) and junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%; P < .001), with comparable performance with expert ophthalmologists (difference, 1.3%; 95% CI, -5.1% to 7.4%; P = .72). GPT-4o (accuracy, 69.9%) outperformed GPT-4 (OpenAI; difference, 8.5%; 95% CI, 1.1%-15.8%; P = .02) and GPT-3.5 (OpenAI; difference, 21.8%; 95% CI, 14.3%-29.2%; P < .001). For multimodal questions, GPT-4o (accuracy, 57.5%) outperformed all other FMs (Claude 3.5 Sonnet [accuracy, 47.5%], Qwen2.5-VL-72B [accuracy, 45%], Gemini Advanced [accuracy, 35%], and Llama-3.2-11B [accuracy, 25%]) and the junior physician (difference, 15%; 95% CI, -6.7% to 36.7%; P = .18) but was weaker than expert ophthalmologists (accuracy range, 70.0%-85.0%; P = .16) and trainees (accuracy range, 62.5%-80%; P = .35). Results of this cross-sectional study suggest that for textual questions, current FMs exhibited notable improvements in ophthalmological knowledge reasoning when compared with older LLMs and ophthalmology trainees, with performance comparable with that of expert ophthalmologists. These models demonstrated potential for medical assistance for answering ophthalmological textual queries, but their multimodal abilities remain limited. Further research or fine-tuning models with diverse ophthalmic multimodal data may lead to more capable applications with multimodal functionalities.
Pancreatic masses present significant challenges in clinical management due to their diverse manifestations and inherent complexity. Dual-phase contrast-enhanced CT is essential for accurate diagnosis, yet widely adopted segmentation methods rely on image registration, which compromises both precision and efficiency. In this study, we introduce a novel architecture that utilizes a cross-attention mechanism for selective feature integration across different phases, achieving registration-free dual-phase segmentation of the pancreas and pancreatic masses. Our model incorporates a dual-path encoder with symmetrical branches specifically designed for the arterial and portal venous phases, where weight-shared cross-attention modules perform symmetrical feature selection and alignment, obviating explicit registration. We further design a progressive fusion decoder that incrementally merges features from both branches through multiple cross-attention modules, ensuring optimal utilization of information from both imaging phases throughout the decoding process. Extensive evaluations on one internal and three external datasets demonstrate that our approach not only outperforms previous registration-dependent methods in accuracy (Dice: 81.86% vs 76.68%) but also improves inference speeds (10.55s vs 130.07s per scan), setting new benchmarks in the field. Additional comparative experiments underscore the efficacy and robustness of our symmetrical fusion framework, confirming its potential as a superior alternative to conventional techniques.
Compared with unimodal knowledge distillation (KD), cross-modal KD is more challenging due to modality differences. However, how such differences affect cross-modal KD remains insufficiently understood. In this paper, we propose the Non-Target Divergence Hypothesis (NTDH), which states that modality differences mainly affect cross-modal KD through divergences in non-target class predictions, and that smaller non-target divergence leads to better student performance. We further provide a theoretical analysis based on Vapnik-Chervonenkis (VC) theory, deriving an upper bound on the cross-modal KD error that supports the proposed hypothesis. Extensive experiments on five cross-modal datasets validate the effectiveness, generality, and practical relevance of NTDH.
Based on the theory of rational expectations equilibrium, this study examines the optimal decision-making of fresh retailer under strategic customer behavior. The paper focuses on analyzing the impact of perceived customer value on the retailer's investment in freshness preservation and configuration in inventory, as well as how market size moderates this influence. The findings reveal that in the face of strategic customer behavior and an increased perception of value for the freshness of products nearing their sell-by date, the retailer tends to lower retail prices, which subsequently affects their enthusiasm for investment in freshness preservation efforts and inventory configuration. The influence of perceived customer value on preservation effort and inventory volume is differential and significantly moderated by market size. When the market size is large, the retailer is inclined to implement optimal preservation measures, but inventory volume may be negatively impacted. With a medium market size, the retailer reduces preservation efforts and inventory levels. However, in situations where perceived customer value is low, the retailer increases inventory volumes. When the market size is small, the impact of strategic customer behavior and perceived customer value on preservation efforts and inventory volumes more pronounced.
暂无摘要(点击查看详情)