Advancing orthopedic care through large language models requires both multimodal processing capabilities for medical images and open-source deployment options for secure in-house operations, yet these remain underexplored in current literature. This study aims to benchmark open-source vision-language models (VLMs) against orthopedic residents using the Orthopedic In-Training Examination (OITE), assess domain-specific performance across orthopedic subspecialties, and investigate the relationship between model parameter size and performance. Six open-source VLMs of varying sizes (Alibaba Qwen2.5-VL-72B-Instruct, Alibaba Qwen2.5-VL-32B-Instruct, Alibaba Qwen2.5-VL-7B-Instruct, Alibaba Qwen2.5-VL-3B-Instruct, Meta Llama-3.2-90B-Vision-Instruct, Meta Llama-3.2-11B-Vision-Instruct) were evaluated using the 2023 OITE (210 questions; 111 with images). Model performance was compared to resident scores from the 2023 OITE technical report. Pearson correlation coefficient was used to assess the association between model size and performance. The 2 largest open-source models, Qwen2.5-VL-72B and Llama-3.2-90B, demonstrated performance levels comparable to those of second-year orthopedic residents on the OITE examination. A mid-sized model, Qwen-32B, slightly outscored first-year residents. In contrast, small-sized models (under 11 billion parameters) performed worse than first-year residents. Qwen2.5-VL-72B performed best in foot & ankle and sports medicine topics, while Llama-3.2-90B was strongest in basic science and hand & wrist. All models had the most difficulty with spine and pediatric questions. Overall, model accuracy increased steadily with model size up to 72 billion parameters, but larger sizes showed little additional improvement. Smaller models offer reduced accuracy in exchange for lower hardware requirements. Spine and pediatric domains remain consistently areas of underperformance across all models. Model selection should be based on domain-specific benchmark results to balance clinical needs with hardware limitations. While promising, open-source VLMs currently require further refinement and validation before they can be reliably applied in clinical or educational settings.
There is an increasing amount of literature evaluating the clinical knowledge and reasoning performance of large language models (LLMs) in ophthalmology, but to date, investigations into its multimodal abilities clinically-such as interpreting images and tables-have been limited. To evaluate the multimodal performance of the following 7 foundation models (FMs): GPT-4o (OpenAI), Gemini 1.5 Pro (Google), Claude 3.5 Sonnet (Anthropic), Llama-3.2-11B (Meta), DeepSeek V3 (High-Flyer), Qwen2.5-Max (Alibaba Cloud), and Qwen2.5-VL-72B (Alibaba Cloud) in answering offline Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice textual and multimodal questions, with head-to-head comparisons with physicians. This cross-sectional study was conducted between September 2024 and March 2025 using questions sourced from a textbook used as an examination preparation resource for the Fellowship of the Royal College of Ophthalmologists part 2 written examination. FM performance. The primary outcome measure was FM accuracy, defined as the proportion of answers generated by the model matching the textbook's labeled letter answer. For textual questions, Claude 3.5 Sonnet (accuracy, 77.7%) outperformed all other FMs (followed by GPT-4o [accuracy, 69.9%], Qwen2.5-Max [accuracy, 69.3%], DeepSeek V3 [accuracy, 63.2%], Gemini Advanced [accuracy, 62.6%], Qwen2.5-VL-72B [accuracy, 58.3%], and Llama-3.2-11B [accuracy, 50.7%]), ophthalmology trainees (difference, 9.0%; 95% CI, 2.4%-15.6%; P = .01) and junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%; P < .001), with comparable performance with expert ophthalmologists (difference, 1.3%; 95% CI, -5.1% to 7.4%; P = .72). GPT-4o (accuracy, 69.9%) outperformed GPT-4 (OpenAI; difference, 8.5%; 95% CI, 1.1%-15.8%; P = .02) and GPT-3.5 (OpenAI; difference, 21.8%; 95% CI, 14.3%-29.2%; P < .001). For multimodal questions, GPT-4o (accuracy, 57.5%) outperformed all other FMs (Claude 3.5 Sonnet [accuracy, 47.5%], Qwen2.5-VL-72B [accuracy, 45%], Gemini Advanced [accuracy, 35%], and Llama-3.2-11B [accuracy, 25%]) and the junior physician (difference, 15%; 95% CI, -6.7% to 36.7%; P = .18) but was weaker than expert ophthalmologists (accuracy range, 70.0%-85.0%; P = .16) and trainees (accuracy range, 62.5%-80%; P = .35). Results of this cross-sectional study suggest that for textual questions, current FMs exhibited notable improvements in ophthalmological knowledge reasoning when compared with older LLMs and ophthalmology trainees, with performance comparable with that of expert ophthalmologists. These models demonstrated potential for medical assistance for answering ophthalmological textual queries, but their multimodal abilities remain limited. Further research or fine-tuning models with diverse ophthalmic multimodal data may lead to more capable applications with multimodal functionalities.
Accurate cloud workload forecasting is critical for proactive resource provisioning, cost control, and Service Level Agreement (SLA) compliance; however, it is hindered by the scarcity and heterogeneity of labels. We present Time Series Augmentation for Multi-Scale Prediction (TSA-MSP), a self-supervised framework that achieves near-fully supervised accuracy with limited labels. Conceptually, TSA-MSP couples cloud-informed augmentations-multi-scale time warping, frequency-domain mixing, and periodic pattern injection-with a hierarchical multi-scale contrastive objective and efficient fine-tuning using lightweight adapters to capture short- and long-range workload patterns. We conducted an empirical experimental study on real-world traces (Alibaba 2018, Google 2019, Azure 2019): self-supervised pre-training on unlabeled data followed by fine-tuning with small labeled subsets. We evaluated the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Symmetric Mean Absolute Percentage Error (sMAPE), and a peak-oriented metric (Peak Prediction Accuracy, PPA) and performed ablations (augmentations, scales), label efficiency, and cross-dataset transfer tests. Remarkably, with only 20% labels, TSA-MSP attains RMSE within  ∼ 1.5-1.7% of fully supervised training (100% labels) across all three datasets; for example, on Alibaba, it achieves MAE/RMSE 0.241/0.335 versus 0.238/0.330 for a fully supervised Pathformer. Against the best self-supervised baseline at 20% labels, TSA-MSP reduces RMSE by  ∼ 6.6-6.9%. In a 5%-label setting, it outperformed supervised training from scratch by  ∼ 23-26% RMSE. Cross-dataset transfer further improves the RMSE by  ∼ 2.4-5.2% over direct target-only pre-training. By combining domain-aware augmentations with multi-scale self-supervision and efficient adaptation, TSA-MSP delivers accurate forecasts under scarce labels, improving peak readiness while enabling faster, more cost-effective deployment for real-world cloud resource management.
The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment1. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding4. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.
Passive smartphone sensing shows promise for suicide prevention, but behavioral metadata (GPS, screen time, and accelerometry) often lacks the contextual information needed to detect acute psychological distress. Analyzing what people actually see, read, and type on their phones-rather than just usage patterns-may provide more proximal signals of risk. This study aimed to test whether vision-language models (VLMs) applied to passively captured smartphone screenshots can predict momentary suicidal ideation (SI). Seventy-nine adults with past month suicidal thoughts or behaviors completed ecological momentary assessments (EMA) over 28 days while screenshots were captured every 5 seconds during active phone use. We fine-tuned open-source VLMs (Qwen2.5-VL [Alibaba Cloud], LFM2-VL [Liquid AI]), and text-only models (Qwen3 [Alibaba Cloud]) to predict SI from screenshots captured in the 2 hours preceding each EMA. We evaluated performance with temporal and subject holdouts. The analytic sample comprised 2.5 million screenshots from 70 participants. Temporal holdout models achieved strong discrimination at the EMA level (AUC=0.83; AUPRC=0.77), with image-based models outperforming text-only models (AUC=0.83 vs 0.79; 95% CI 0.003-0.07). Subject holdout generalization was near chance (AUC≈0.50), though a simple lexical screening method retained modest discrimination (AUC=0.62). Smaller models performed comparably to larger models, supporting feasible on-device deployment. Screen content predicts short-term SI with clinically meaningful accuracy when models are personalized but does not generalize across individuals. These findings support a 2-stage clinical architecture, coarse lexical screening for new patients, with personalized VLM-based monitoring after a calibration period. On-device inference may enable privacy-preserving deployment.
Lymph node (LN) assessment is an essential task in the routine radiology workflow, providing valuable insights for cancer staging and treatment planning. Identifying scatteredly-distributed and low-contrast LNs in 3D CT scans is highly challenging, even for experienced clinicians. Previous lesion and LN detection methods demonstrate the effectiveness of 2.5D approaches (i.e., using 2D backbone with multi-slice inputs), leveraging pretrained 2D model weights and showing improved accuracy as compared to separate 2D or 3D detectors. However, slice-based 2.5D detectors do not explicitly model inter-slice consistency for LN as a 3D object, requiring heuristic post-merging steps to generate final 3D LN instances, which can involve tuning a set of parameters for each dataset. In this work, we formulate 3D LN detection as a slice-by-slice tracking task along the z-axis and propose LN-Tracker, a novel LN tracking transformer, for joint end-to-end detection and 3D instance association. Built upon a DETR-based detector, LN-Tracker decouples transformer queries into distinct track and detection groups with independent matching, enabling comprehensive LN detection while maintaining trajectory consistency. A masked attention mechanism further separates learning between these query groups, and a similarity loss promotes robust interslice LN association, particularly in low-contrast scenarios. Extensive evaluation on four LN datasets shows LN-Tracker's superior performance, with at least 2.49% gain in average sensitivity when compared to top 3D/2.5D/tracking detectors. Further validation on public lung nodule and prostate tumor detection tasks confirms the generaliz-ability of LN-Tracker as it achieves top performance on both tasks. Code is available at https://github.com/alibaba-damo-academy/LN-Tracker.
Pediatric heart disease (PHD), including congenital heart defects, is often incompletely captured in electronic health records, particularly when clinical significance must be inferred from unstructured echocardiogram reports. Automated methods capable of extracting clinically meaningful PHD from narrative reports could improve clinical decision support and research applications. The aim of the study is to evaluate the feasibility of using supervised fine-tuning of large language models (LLMs), with and without chain-of-thought (CoT) reasoning, to characterize patients with clinically significant or historical PHD from unstructured echocardiogram reports. We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9749 echocardiogram reports. A subset of 712 reports was adjudicated by 2 pediatric cardiac anesthesiologists, classifying 506 (71.1%) as clinically significant PHD and 206 (28.9%) as not significant. While DeepSeek R1 has shown improved performance with CoT reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs. The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.4%), outperforming Qwen-7B-without-CoT (90%), LLaMA-3B-without-CoT (87.9%), Qwen-3B-without-CoT (85.6%), Qwen-3B-10k-overthink-CoT (68.5%), and LLaMA-3B-10k-overthink-CoT (46.2%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained a strong, balanced performance (82.7%), followed by Qwen-7B-without-CoT (88.4%), LLaMA-3B-without-CoT (86.8%), Qwen-3B-without-CoT (84.5%), Qwen-3B-10k-overthink-CoT (58.9%), and LLaMA-3B-10k-overthink-CoT (46.2%). The fine-tuned Qwen-7B model with overthinking CoT (10,000 tokens) achieved the highest internal accuracy (92.4%), with balanced sensitivity and specificity. Across repeated runs, CoT-enhanced models demonstrated improved classification consistency compared to non-CoT models (Qwen-7B-without-CoT: 90%, LLaMA-3B-without-CoT: 87.9%, Qwen-3B-without-CoT: 85.6%). In external validation (n=113), non-CoT variants achieved higher accuracy (up to 88.4%), whereas the Qwen-7B CoT model demonstrated more balanced class performance (accuracy=82.7%). Supervised fine-tuning of LLMs with CoT offers an effective approach for automated PHD detection within unstructured data in the electronic medical record. While CoT-enhanced models demonstrated improved internal performance and more balanced classification, they did not consistently achieve higher accuracy in external validation, highlighting trade-offs between accuracy and class balance. These findings highlight the promise of LLM-based approaches for clinical text phenotyping while underscoring the need for larger, multicenter validation and careful calibration for real-world deployment. Continued validation and integration into the electronic medical record are essential for real-world, artificial intelligence-driven clinical decision support.
Multivariate workload prediction in cloud computing environments is a critical research problem. Effectively capturing inter-variable correlations and temporal patterns in multivariate time series is key to addressing this challenge. To address this issue, this paper proposes a convolutional model based on a Nonlinear Spiking Neural P System (ConvNSNP), which enhances the ability to process nonlinear data compared to conventional convolutional models. Building upon this, a hybrid forecasting model is developed by integrating ConvNSNP with a Bidirectional Long Short-Term Memory (BiLSTM) network. ConvNSNP is first employed to extract temporal and cross-variable dependencies from the multivariate time series, followed by BiLSTM to further strengthen long-term temporal modeling. Comprehensive experiments are conducted on three public cloud workload traces from Alibaba and Google. The proposed model is compared with a range of established deep learning approaches, including CNN, RNN, LSTM, TCN and hybrid models such as LSTNet, CNN-GRU and CNN-LSTM. Experimental results on three public datasets demonstrate that our proposed model achieves up to 9.9% improvement in RMSE and 11.6% improvement in MAE compared with the most effective baseline methods. The model also achieves favorable performance in terms of MAPE, further validating its effectiveness in multivariate workload prediction.
Large Language Models (LLMs) are increasingly applied in healthcare and are expected to play an active role in clinical practice. However, their effectiveness for clinical note summarization remains underexplored, and systematic comparisons across different models are lacking. This study addresses this gap by benchmarking 16 generative LLMs from major providers, including OpenAI (GPT), DeepSeek, Meta (LLaMA), Google (Gemma), Mistral (Mixtral), and Alibaba (Qwen), using the MIMIC-IV-Note. Both extractive and abstractive summarization approaches were implemented and evaluated with multiple lexical and semantic metrics, including ROUGE, BLEU, METEOR, COMET, and BERTScore. In addition, processing time, cost, and deployment feasibility were assessed to provide a practical perspective for clinical adoption. The results show that Gemma-3-27B achieved the strongest overall performance in extractive summarization. For abstractive summarization, DeepSeek-R1-70B, Qwen-3-32B, and GPT-4o emerged as leading models. Their relative strengths varied depending on whether lexical overlap, semantic adequacy, or fluency was prioritized. Importantly, larger parameter sizes did not always translate into better outcomes, as smaller models such as LLaMa-3-8B and Gemma-2-9B often produced competitive results with faster runtimes and lower computational costs. This study highlights the trade-offs between performance, efficiency, and deployment context that offers practical insights into model selection for clinical note summarization and informing future integration of LLMs into healthcare workflows.
As human spaceflight expands beyond low Earth orbit, the ability to deliver advanced surgical care in space becomes critical. Current medical provisions on board the International Space Station (ISS) are geared towards treating low-risk conditions, with a 'stabilize-and-evacuate' principle for more complex cases-an approach that is not viable for extended missions to the Moon and Mars. This review summarizes research conducted around space surgery, with a particular focus on surgical robotics. Experiments in parabolic flight and analogue environments demonstrate that, provided the operator, patient, and instruments are restrained, surgical skill is largely unaffected by reduced gravity. Robotic surgery has primarily been explored in remote undersea habitats and in limited flight studies. There are several challenges to the implementation of surgical systems in space, including size, weight, and power constraints, communication latency, and crew training. Means of fluid and debris containment, provision of anaesthesia, and postoperative recovery in altered physiology must also be considered. The key features of an ideal space surgery robotic set-up are outlined. It should be compact, multifunctional, adaptable, reliable, and optimized in technical design and material composition for use in habitable volumes. Such systems should incorporate artificial intelligence (AI)-driven decision-making support, variable autonomy, and human-in-the-loop control. Crew members must be trained and supported to deliver and recover from surgical care in space. Cloud and edge computing will mitigate latency while expanding on-board data processing capabilities. Although not yet operationally mature, robotic surgery is a critical capability for future exploratory space missions, but requires continued multidisciplinary development.
Understanding the impact that subtle variations (missense mutation, environmental change, ion chelation, ligand binding, etc.) have on protein structure helps to reveal their biological effects, but remain extremely challenging due to the difficulty in measuring and locating the changes in protein structure. Herein, a method entitled MELO is therefore constructed, which enable a systematic measurement based on residues' geometric characteristics & relative distance and a high-throughput location of structural change based on secondary structure variation & protein segment shift. Our method performs best in capturing the structure changes of various degrees of magnitude (some increases were >30%) and is capable of precisely locating the regions of alterations for critical case studies. Moreover, it identifies over 10,000 structural changes induced by subtle variation that existing methods fail to detect. An online server allows users to upload their structures for comparison, and all those structural changes identified in this study have also been made available for download.
Question routing (QR) aims to route questions to answerers who are likely to provide high-quality answers. Though existing QR methods have achieved promising results, they still face two key challenges that have not yet been well addressed: 1) user access temporal preference (i.e., user preference to the time of accessing community question answering websites) has not been well captured and utilized and 2) asker acceptance temporal preference (i.e., asker preference to an answer's submission time) is neglected. Given this, we introduce a novel deep neural network model named TQR which applies temporal preference information for effective question routing. To address the first challenge, we design an access temporal preference encoder in TQR that models the access temporal preferences of users based on their periodic and evolving patterns of accessing time. To solve the second challenge, an acceptance temporal preference encoder is proposed in TQR which learns long&short-term acceptance temporal preferences of askers. Then, an answerer's representation is computed based on the learned user access temporal preferences and asker acceptance preferences. Finally, a question is routed to those answerers whose representations better match the question representation. To the best of our knowledge, this is the first attempt to model the acceptance temporal preferences of askers to optimize the QR task. Extensive experiments are conducted on six public datasets and the experimental results show that the proposed TQR model achieves an average improvement of 7.22% in MRR compared to those best baselines.
We study resilience in collective behaviors of "next-generation reservoir computers" in terms of transmitted signal distortion. Specifically, we introduce an interactive communication scheme and achieve synchronization between two "next-generation reservoir computer" oscillators. A dynamical transition from synchronization to desynchronization emerges with the growth of signal distortion. Remarkably, we show that the order of clique has no significant effect on the robustness of synchronization. The effectiveness of our proposed scheme is illustrated via the classical dynamical models and qualitative analysis. Our work reveals the function of transmitted signal distortion in shaping collective behaviors of machine learning oscillators.
Based on the theory of rational expectations equilibrium, this study examines the optimal decision-making of fresh retailer under strategic customer behavior. The paper focuses on analyzing the impact of perceived customer value on the retailer's investment in freshness preservation and configuration in inventory, as well as how market size moderates this influence. The findings reveal that in the face of strategic customer behavior and an increased perception of value for the freshness of products nearing their sell-by date, the retailer tends to lower retail prices, which subsequently affects their enthusiasm for investment in freshness preservation efforts and inventory configuration. The influence of perceived customer value on preservation effort and inventory volume is differential and significantly moderated by market size. When the market size is large, the retailer is inclined to implement optimal preservation measures, but inventory volume may be negatively impacted. With a medium market size, the retailer reduces preservation efforts and inventory levels. However, in situations where perceived customer value is low, the retailer increases inventory volumes. When the market size is small, the impact of strategic customer behavior and perceived customer value on preservation efforts and inventory volumes more pronounced.
Background Accurate preoperative identification of pathologic extranodal extension (ENE) at CT is essential for precise treatment decisions in laryngeal and hypopharyngeal squamous cell cancer (LHSCC). However, human interpretation of ENE is neither reliable nor reproducible. Purpose To develop and evaluate the diagnostic performance of a new deep learning tool, DeepENE, in detecting metastatic and ENE lymph nodes on preoperative CT scans in patients with LHSCC in a multicenter cohort. Materials and Methods In this retrospective study, patients with LHSCC from Zhongshan Hospital, Fudan University (April 2011-August 2022), were included in training, validation, and internal test sets to develop DeepENE. For the reference standard, lymph nodes were segmented on CT scans and labeled for metastasis and ENE status based on pathologic findings. DeepENE was tested using three external cohorts of patients with LHSCC (external test sets 1-3) and one external cohort of patients with oral squamous cell carcinoma. The primary diagnostic metric was the area under the receiver operating characteristic curve (AUC). The performance of DeepENE was compared with that of five board-certified head and neck cancer specialists using the DeLong method. Results Overall, 289 patients with LHSCC with 1954 pathologically confirmed lymph nodes were evaluated. DeepENE achieved an AUC of 0.93 for ENE diagnosis in the internal test set under fivefold cross-validation, and AUCs of 0.96, 0.87, and 0.90 in external test sets 1, 2, and 3, respectively. DeepENE outperformed the five experts, especially in early-stage ENE detection in external test set 2 (AUC of 0.87 for DeepENE vs mean AUC of 0.66 for readers; P < .001). In external test set 1, DeepENE maintained a high sensitivity of 97% at specificity of 90%, compared with experts' mean sensitivity of 77% (P = .003). In external test sets 2 and 3, DeepENE had sensitivity of 78% and 80%, compared with experts' mean sensitivity of 36% (P < .001) and 46% (P < .001), respectively. Conclusion DeepENE accurately detected ENE on preoperative CT scans in patients with LHSCC and outperformed head and neck cancer specialists. © RSNA, 2026 Supplemental material is available for this article.
The global rise in steatotic liver disease poses a significant public health challenge. While non-contrast computed tomography scans hold promise for opportunistic detection of steatotic liver disease, their potential for staging and risk assessment remains underexplored. Here we present a multimodal AI model trained on a large dataset, comprising of (n=968) histopathologically and (n=1103) radiologically confirmed cases, validated against both histology (n=660) and MRI-PDFF (n=375) gold standards, demonstrating high accuracy in detecting mild to severe steatosis (AUC: 0.904-0.929) and clinically significant fibrosis (AUC: 0.824-0.888). Furthermore, integrating the model into the standard clinical pathway improves primary risk screening in a retrospective patient cohort (n=1192), identifying 36% more patients at risk of fibrosis progression. Using Cox proportional hazard model, we observe that the intermediate-high risk patients identified by the optimized clinical pathway exhibits a significantly higher incidence of cirrhosis (hazard ratio: 5.54: 2.69-11.42), showcasing the model's potential for early detection and management of steatotic liver disease.
Purpose To develop and validate a deep neural network that simultaneously segments brain tumors and anatomic structures, regardless of the contrast and resolution of the input scans, and can effortlessly adapt to unseen modalities. Materials and Methods The authors included various MRI scans from patients with and without brain tumors from four different datasets. Patient data were divided into a training set and a test set. The authors' method, TumorSynth, combines a Bayesian generative model and a deep learning segmentation model. The generative model creates paired synthetic labels and images with simulated tumors and brain tissues, providing a rich dataset for training the segmentation model. The authors quantitatively compared its performance with that of other widely used methods by calculating Dice similarity coefficients (DSCs). Results A total of 1971 patients with and without tumors were included in the study (training set, n = 351 patients; test set, n = 1620 patients). The median DSCs for segmentation (authors' method vs reference standard) were 0.89 (IQR, 0.83-0.95; P < .001) for the unaffected brain volume and 0.89 (IQR, 0.84-0.94; P < .001) for the tumor region. There were no differences in parcellation performance when an MRI sequence was missing (P = .07). In cross-modality validation, the authors' method achieved DSC values of 0.88 for apparent diffusion coefficient, 0.85 for diffusion-weighted imaging, 0.80 for susceptibility-weighted imaging, and 0.79 for fractional anisotropy images. The authors observed a 4% false-positive rate when processing tumor-free MR images. Conclusion The authors developed a deep neural network for brain tumor and tissue segmentation, validated its performance across standard structural MRI sequences, and determined its generalizability to unseen data. Keywords: Segmentation, Neuro-Oncology, CNS, Deep Learning, Neurosurgery Supplemental material is available online for this article. © RSNA, 2026.
To address the path-planning challenge for unmanned aerial vehicles (UAVs) in complex environments, this study presents an improved pelican optimization algorithm enhanced with multiple strategies (MIPOA). The proposed method introduces four main improvements: (1) using chaotic mapping to spread the initial search points more evenly, thereby increasing population variety; (2) incorporating a random Lévy-flight strategy to improve the exploration of the search space; (3) integrating a differential evolution approach based on Cauchy mutation to strengthen individual diversity and overall optimization ability; and (4) adopting an adaptive disturbance factor to speed up convergence and fine-tune solutions. To evaluate MIPOA, comparative tests were carried out against classical and modern intelligent algorithms using the CEC2017 and CEC2022 benchmark sets, along with a custom UAV environmental model. Results show that MIPOA converges faster and achieves more accurate solutions than the original pelican optimization algorithm (POA). On CEC2017 in 30-, 50-, and 100-dimensional cases, MIPOA attained the best average ranks of 1.57, 2.37, and 2.90, respectively, and achieved the top results on 26, 21, and 19 test functions, outperforming both POA and other advanced algorithms. For CEC2022 (20 dimensions), MIPOA obtained the highest Friedman average rank of 1.42, demonstrating its effectiveness in complex UAV path-planning tasks. The method enables the generation of faster, shorter, safer, and collision-free flight paths for UAVs, underscoring the robustness and wide applicability of MIPOA in real-world UAV path-planning scenarios.
Colorectal cancer (CRC) is a leading cause of cancer deaths, with early screening vital to reduce mortality. While methods such as colonoscopy and computed tomography (CT) colonography are available, they face challenges such as bowel preparation, invasiveness, and low adherence. We aimed to develop COCA (COlorectal Cancer detection with AI), a novel, noninvasive, cost-effective, and scalable method for CRC screening using noncontrast CT scans. This retrospective, multicenter, and international study included 1321 CRC patients and 1357 normal controls from two centers to develop COCA. We enhanced the CRC detection capabilities of COCA by employing a joint lesion segmentation and classification architecture, optimized with mixed-supervised learning. For validation, we gathered abdominal and pelvic CT data from four external centers and chest CT data from four centers. A reader study involving 10 radiologists with varying levels of experience evaluated diagnostic performance on noncontrast CT first without COCA assistance and then with it. Additionally, we evaluated both the initial and iteratively improved versions of COCA in two real-world, multi-scenario cohorts comprising 27 433 consecutive patients. In a multicenter and international validation involving 2053 patients across six centers, COCA demonstrated an area under the curve ranging from 0.967 to 0.996 for CRC detection. COCA improved CRC detection sensitivity by 20.4% and specificity by 5.4% compared with radiologists. In the first real-world multi-scenario validation with 9014 consecutive patients, COCA achieved a sensitivity of 88.2% and specificity of 99.5% for CRC detection. In the second external real-world validation involving 18 419 consecutive patients, COCA maintained a sensitivity of 86.6% and specificity of 99.8%, with a positive predictive value of 63.4%. COCA demonstrated robust performance across various clinical scenarios, including physical exams, emergency departments, outpatient, and inpatient settings, effectively preventing missed CRC diagnoses. These findings suggest that COCA could serve as a potential tool for large-scale opportunistic CRC screening.
Spatial transcriptomics provides high-dimensional gene expression data while preserving spatial context, offering novel insights into tissue composition and heterogeneity. Each spot or cell in the spatial transcriptome could be reflected as gene modules influenced by its surrounding microenvironment, with module interactions vital for tissue architecture and function. Here, we present Scalable Niche Guided Module Discovery (SIGMOD), a method that integrates prior constructed microenvironment information with gene expression decompositions to uncover gene modules, enabling a deeper understanding of crosstalk within the microenvironment. SIGMOD identifies cell-type-specific and cell-state-specific, clinically relevant gene modules, uncovering gene module-module interactions in 10X ST, Visium, Xenium, and CosMX data, demonstrating its effectiveness and broad applicability.