Large language models (LLMs) show great potential for clinical decision-making, yet most applications remain narrow, task-specific chat tools rather than systems integrated into clinical workflows1,2. However, building physician copilots will require models that operate within the electronic health record (EHR), with governed access to patient data and the ability to initiate permitted EHR actions within defined safety constraints. Yet it remains unproven whether such a system can manage patient cases with physician-level performance. Here we show that MIRA (Medical Intelligence for Reasoning and Action), an autonomous artificial intelligence agent operating in a sandboxed EHR environment, can navigate a large clinical action space to obtain patient histories; order and interpret laboratory, imaging and microbiology tests; generate differential diagnoses; and formulate treatment plans such as prescribing medications, scheduling surgical procedures and planning admissions. In simulations on real patient cases spanning multiple diagnoses, MIRA outperformed physicians in diagnostic accuracy and made guideline-concordant, medication-safe and appropriate admission decisions. Compared with previous LLM applications that addressed isolated subtasks or provided free-text advice, these results suggest that an EHR-integrated artificial intelligence agent can turn clinical intent into structured, actionable EHR operations, possibly making it a more effective decision-support partner for physicians. Further work is needed to establish generalization, safety and governance through prospective, real-world studies.
The purpose of this study is to compare the answers of the 12 most frequently asked questions regarding breast augmentation surgery on both Google Gemini and ChatGPT-4.5. Our research highlights the growing importance and differential performance of artificial intelligence models in informing patients before surgery. The 12 most asked questions about breast augmentation, based on user engagement metrics, were obtained from the Realself website. These twelve questions were investigated on both Google Gemini and ChatGPT-4.5. Information received from both platforms was analyzed and evaluated by ten plastic surgeons with European Board of Plastic Reconstructive and Aesthetic Surgery (EBOPRAS) certification. The surgeons, who reached a consensus on the application of the Global Quality Score (GQS) scale prior to evaluation, were blinded to the source of the answers. The GQS scale was used for the evaluation. The average results obtained were compared with each other. While the average of Google Gemini responses was calculated at 2.842, the average of ChatGPT-4.5 responses was calculated at 3.867. For this calculation, the Wilcoxon signed-rank test was used. It was found that the ChatGPT-4.5 responses were statistically superior to Google Gemini according to the Global Quality Score (GQS) (p = 0.003). While there is emerging research on AI in plastic surgery, studies specifically comparing ChatGPT and Gemini for breast augmentation patient education using a blinded evaluation by board-certified surgeons remain limited. We suggest that AI-powered chatbots offer significant advantages for patient education but should be used cautiously. While ethical concerns persist, this study underscores the practicality of ChatGPT in informing patients about plastic surgery procedures, emphasizing the need for careful usage and collaboration to optimize benefits while minimizing risks. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Response evaluation in pleural mesothelioma is challenging because its crescent growth pattern is poorly captured by diameter-based criteria. We aimed to develop and validate artificial intelligence (AI)-assisted volumetric response criteria (ARTIMES) based on automated tumour segmentation and biologically derived thresholds. In this retrospective, multicentre study, we included 10 926 CT scans from 2080 patients from 14 cohorts. A subset totalling 1176 CT scans from routine care (Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital) and trial cohorts (INITIATE, NivoMes, PEMMELA, LUME-MESO, NVALT19, and MiST1 trials) was annotated by 12 radiologists and 1 pulmonologist, supplemented by 100 negative CT scans, to train a deep-learning segmentation model. Internal testing included 98 CT scans from independent international hospitals in LUME-MESO. External testing included data from the MEDUSA cohort (101 CT scans with radiologist-corrected segmentations) and two fully independent manual segmentation datasets from SAKK17/18 (22 CT scans) and the University of Chicago (15 CT scans). AI segmentations were evaluated through dice similarity coefficient (DSC) and normalised surface distance (NSD) at 3 mm. Progressive disease thresholds were derived using data from patients with multiple CT scans before first-line therapy or receiving only supportive care after first-line treatment (611 CT scans), and partial response thresholds from inter-reader variability (derived from 451 CT scans). ARTIMES was validated using data from eight clinical trials (4674 CT scans; 943 patients) and compared with modified Response Evaluation Criteria in Solid Tumors (mRECIST) using time-varying Cox proportional hazards models and trial-level surrogate endpoint analysis against overall survival using R2 and surrogate threshold effect. DSC was 94-95% in internal testing and 71-80% with manual segmentations. NSD was 98% and 81-93%, respectively. ARTIMES demonstrated superior patient-level prognostic performance compared with mRECIST (concordance index 0·83 [95% CI 0·79-0·87] vs 0·73 [0·66-0·80]; p=0·023) and detected progression a median of 5 weeks earlier (124 days [95% CI 115-126] vs 162 days [138-167]; p<0·0001). At the trial level, ARTIMES-based progression-free survival showed stronger correlation with overall survival (R2 88% [95% CI 42-100]) than did mRECIST-based progression-free survival (R2 6% [0-97]) and demonstrated a surrogate threshold effect at a progression-free survival hazard ratio of less than 0·82; no threshold was observed for mRECIST. Baseline AI-derived tumour volume independently predicted overall survival and outperformed T stage and WHO performance status. ARTIMES-based progression-free survival improves prognostic stratification and shows better trial-level surrogacy for overall survival compared with mRECIST-based progression-free survival. Pending prospective validation, ARTIMES could potentially facilitate a more reliable response evaluation in pleural mesothelioma. Asbestos-Related Disease Section (SAGA) of the Dutch Society of Pulmonology and Tuberculosis (NVALT), Dutch Cancer Society, and Dutch Ministry of Health, Welfare and Sport.
Disruptive technologies can reconfigure innovation trajectories and create new market opportunities, yet their early detection remains difficult because disruptive impact is uncertain and often becomes visible only years after invention. Building on the CD disruptiveness measure derived from citation-network dynamics, this study develops an interpretable machine-learning framework to screen and prioritize potentially disruptive patent candidates ex ante. Using IncoPat patent records in the artificial intelligence domain, we construct a multidimensional indicator system spanning technological, market, and legal signals. To ensure conceptual consistency with the CD framework, the main disruption label is defined as patents falling within the top 5% of the empirical CD distribution among patents with at least one backward citation. This definition allows disruptiveness to be assessed relative to identifiable prior art and reduces reliance on extreme CD values that may be sensitive to incomplete reference information. We train models on a time-split learning set covering patents filed from 2007 to 2021 and evaluate their ability to predict this stricter disruption label using early observable patent metadata. Among benchmarked classifiers, AdaBoost provides the most competitive screening performance after feature selection, reducing the feature space from 17 to 12 indicators. Under the main top-5% specification, the model achieves an accuracy of 0.914, a precision of 0.271, a recall of 0.431, and an F1 score of 0.333, indicating modest but nontrivial early-screening ability. Feature importance analysis highlights the predictive relevance of citation- and disclosure-related signals, including family citation activity, backward citations, and document length. The framework is best interpreted as a scalable candidate-generation tool for monitoring and expert review within the patented segment of AI innovation, rather than as a definitive classifier of realized disruptive impact.
This study examines the pivotal role of artificial intelligence (AI) in advancing the urban green transition (UGT), with a particular focus on the Yangtze River Delta (YRD) Urban Agglomeration. Drawing on panel data from 40 cities in the YRD between 2012 and 2024, the research utilizes benchmark regression, non-linear effect analysis, and spatial econometric models to investigate how AI influences UGT through production, consumption, and agglomeration channels. The main findings are as follows: (1) AI exerts a significant positive effect on UGT. (2) UGT exhibits strong spatial autocorrelation, whereas AI's impact is characterized by robust local promotion but limited spatial spillover. (3) The influence of AI displays marked heterogeneity, varying by urban hierarchy, geographic location, resource endowment, and levels of institutional and market development. (4) The effects of AI on UGT are mediated by fixed capital stock, industrial structure upgrading, and consumption levels. Additionally, digital infrastructure, green consumption awareness, and digital industry agglomeration serve as moderating factors. AI's influence also exhibits nonlinear diminishing marginal effects at specific thresholds in productive services and agricultural agglomeration. These results highlight the multifaceted mechanisms through which AI drives sustainable urban development and offer policy implications for fostering green transformation in metropolitan regions.
Pulmonary hypertension (PH) is a severe and oftentimes fatal disease with a high degree of clinical variability. Its complexity necessitates a multifaceted approach, with clinical and prognostic results significantly improved with earlier identification and initiation of treatment. Developments in artificial intelligence (AI) and machine learning (ML) allow earlier diagnosis, individualized treatments, and new therapeutic options, which have already shown real potential to improve patient outcomes; such an intricate approach was not feasible before the AI innovations of the last decade. This article outlines the current applications of AI and ML for diagnostics, treatment optimization, surveillance, and drug discovery for PH.
The use of artificial intelligence (AI) in endoscopic studies has grown in recent years. The present study evaluates the performance of AI in detecting polyps and adenomas in daily clinical practice. A cross-sectional study was conducted, in which AI-assisted colonoscopies (AIACs) performed between January 2021 and May 2024 were reviewed. Logistic regression was applied for adenoma detection, based on their characteristics. A total of 1,251 colonoscopies were reviewed. The patients in the AIAC group were older than the control group (59 ± 13 vs. 56 ± 12 years, P < .05). There were no differences between sex, procedure indication, bowel preparation, and procedure time. Regarding the primary aim, the AIAC group had a significantly higher polyp detection rate (58 vs. 52%; P < .05) and non-significantly higher adenoma detection rate (39 vs. 33%; P > .05), compared with the control group. In the analysis of adenoma characteristics, the identification of polypoid adenomas (OR: 1.28; 95% CI: 1.04-1.59), smaller 10 mm (OR: 1.41; 95% CI: 1.14-1.74), and located in the proximal colon (OR: 1.31; 95% CI: 1.05-1.65) was significantly higher in the AIAC group, compared with the control group. The use of AI in colonoscopies resulted in a non-significant increase in the adenoma detection rate but a significant increase in detecting polypoid adenomas smaller than 10 mm and located in the proximal colon.
暂无摘要(点击查看详情)
Water quality prediction and management are crucial for ensuring the sustainability of water supplies. Contaminated water can harm humans and aquatic life. As the demand for seafood grows, the aquaculture industry faces several obstacles, including disease management, feeding optimization, water quality monitoring, and aquaculture area extraction. Recently, aquaculture systems have increasingly used AI techniques to successfully and sustainably handle these issues. However, traditional AI techniques such as random forest (RF) and multi-layer perceptron (MLP) among others frequently face data scarcity and poor physical consistency. This research bridges this gap by integrating physical sciences with AI algorithms through the solution of the two coupled pollution-aeration equations to generate a high-fidelity physics-derived dataset of 50,000 observations over an extended spatial domain ranging from 0 to 4. This dataset is then used to train a novel hybrid RF-MLP algorithm to identify fish-survival zones within a polluted river at a given time, while determining the minimum allowable water velocity and the upstream dissolved oxygen level required to maintain environmentally safe conditions along the entire river reach. The proposed algorithm employs a three-stage sequential residual learning logic, combining RF's stable feature partitioning with MLP's improved non-linear error correction. The algorithm's performance was benchmarked against nine standalone AI algorithms using a comprehensive suite of metrics. The experiments demonstrated exceptional precision with a Correlation Coefficient (CC) of 0.9999999973, a Scatter Index (SI) of 0.00007326, a Willmott's Index (WI) of 0.9999999986, a Test RMSE of 0.00012966, and a 0.9999999692. Beyond accuracy, the hybrid algorithm demonstrated superior computational efficiency, training in just 22.58 s-a 24.45-fold reduction compared to BiLSTM architectures. These results provide a robust tool for decision-makers to identify optimal river reaches for fish farms based on minimum water velocity and permissible dissolved oxygen transfer levels, bridging the gap between theoretical physics and industrial aquaculture management.
暂无摘要(点击查看详情)
The rise in importance of narrative intelligence systems that are inspired by artificial intelligence (AI) is increasing in terms of their ability to transmit intergenerational legacies, psychological continuity, and cultural identities. This leads to even more interest in such systems. The StoryWeaver Lab is an AI-driven app that replicates an interactive living tradition in a family. The assessment environment is rather complex and logical but ambiguous, and opposition can occur. It is difficult to model such uncertainty in an accurate way. The proposed study solves this problem with the help of comparative analysis and the weighted aggregated sum product evaluation (WASPAS) methodology. It uses a new decision-making (DM) model, which relies on the Frank norm and Fermatean fuzzy Z-numbers (FFZNs). The effectiveness of the suggested methodology is measured using a multi-attribute group DM (MAGDM) approach concerning the interdependent criteria. These criteria are narrative coherence, personal connection, ethical trustworthiness, flexibility, and long-term sustainability. The conventional MAGDM methods tend to deteriorate in case of inaccurate, partial, or reliability-sensitive data. These limitations are resolved with the use of FFZNs to describe the membership and non-membership levels and the reliability that goes with them simultaneously. The Frank norm increases the aggregate behavior by letting the criteria interact freely. This includes both reinforcing and suppressing effects. The DM model is developed based on the WASPAS method in the context of the Frank norm and FFZNs. It is evaluated in terms of consistency and strength. Mathematical case analysis of a number of AI storytelling system choices demonstrates that all methods find the same best answer. Nevertheless, the suggested FFZNs-Frank model is more interpretable. It is more reliably aware of uncertainty. Less information is lost. The results prove the legitimacy and reliability of the suggested framework. They show their benefits in comparison with traditional WASPAS. This research contributes methodologically. It offers a useful decision-support system to assess the emotional intelligence of AI systems, facilitate ethical narrative telling, sustainable cultural maintenance in unpredictable DM situations, and trustworthy human-AI relationships.
Despite promising results of artificial intelligence (AI) in prostate cancer (PCa) detection, its impact on biparametric MRI (bpMRI) interpretation remains uncertain, especially for readers with limited experience. To evaluate the effect of AI software assistance on prostate bpMRI interpretation by readers with different levels of prostate MRI experience. Retrospective. Six hundred and forty-six male patients, including 297 with PCa. 3.0 T; T2-weighted imaging using fast spin echo sequence, diffusion-weighted imaging using single-shot echo-planar imaging. Two experienced readers (8 and 10 years of prostate MRI experience) and two novice-level readers (2 years of general radiology experience; 20-50 prior prostate MRI cases) assessed all examinations twice, without and with AI software (uAI, United Imaging) assistance, in counterbalanced orders with a 4-week washout interval. Lesions were scored using Prostate Imaging Reporting and Data System (PI-RADS) v2.1 at ≥ 3 and ≥ 4 thresholds. Histopathology was the reference standard. The primary analysis defined cancer as International Society of Urological Pathology (ISUP) grade group ≥ 1 (Gleason score ≥ 6); sensitivity analysis defined clinically significant cancer as ISUP grade group ≥ 2. Generalized Estimating Equations were used for clustered data. Receiver operating characteristic (ROC) analysis with the Obuchowski-Rockette model was used to compare the area under the ROC curve (AUC). Cohen's κ assessed inter-reader agreement; two-sided p < 0.05 indicated significance. For ISUP ≥ 1, uAI increased novice-level/experienced-reader AUCs (0.684-0.744; 0.757-0.794). At PI-RADS ≥ 3, novice-level sensitivity/specificity significantly improved (0.71-0.79; 0.46-0.58). Experienced-reader sensitivity gains were nonsignificant (p = 0.344/0.291). For ISUP ≥ 2 at ≥ 3, all-reader sensitivity/specificity increased (0.76-0.82; 0.47-0.57). Novice-level κ increased at ≥ 3/≥ 4 (0.582-0.700; 0.654-0.741). uAI assistance improved diagnostic performance, with multi-metric improvements in novice-level readers. Stage 3. This study tested whether artificial intelligence software could help doctors read prostate MRI scans more consistently and accurately. The researchers studied 646 men with 730 lesions. Two experienced doctors and two doctors with limited prostate MRI experience reviewed each case without and with artificial intelligence support. The software improved several measures of cancer detection, especially for doctors with limited experience, and increased agreement between these doctors. Additional analyses showed that doctors rarely changed correct judgments to match incorrect artificial intelligence outputs, whereas incorrect judgments were more often corrected. These findings support artificial intelligence as a decision‐support tool for prostate MRI.
In this work, a new concept called the vector dissipation of randomness (VDR) is developed and formalized. It describes the mechanism by which complex multicomponent systems transition from chaos to order through the filtering of random directions, accumulation of information in the environment, and self-organization of agents. VDR explains how individual random strategies can evolve into collective goal-directed behavior, leading to the emergence of an ordered structure without centralized control. In this framework, paraintelligence is defined as a functional, nonreflexive mode of collective cognition in which a decentralized system produces rational-like outcomes without an individual conscious subject. To test the proposed model, a numerical simulation of the "ant-beetle" system was conducted, in which agents (ants) randomly choose movement directions, but through feedback mechanisms and weak strategies, they form a single coordinated vector of the beetle's movement. VDR is a universal mechanism applicable to a wide range of self-organizing systems, including biological populations, decentralized technological networks, sociological processes, and artificial intelligence algorithms.
Microscopy techniques can uncover the physical properties and dynamic behaviours of materials, driving the discovery of emergent phenomena and guiding the design of next-generation computing hardware. As artificial intelligence becomes pervasive, the demand for high-performance materials to support sustainable information technologies is growing. This Review highlights state-of-the-art imaging from electron and X-ray to optical techniques to probe the dynamics of neuromorphic materials, including operando characterization of devices. We examine design principles for neuromorphic materials, along with obstacles that hinder their development. Emphasis is placed on spatially and temporally resolved approaches that capture state changes including phase transitions, ferroic switching and spin-wave propagation that emulate biological components such as neurons, synapses and their connectivity. We discuss challenges in operando characterization and the integration of artificial intelligence-driven analysis for feedback-guided material discovery. Finally, we outline opportunities for real-time imaging of neuromorphic systems, paving the way towards adaptive, brain-inspired hardware.
In the era of artificial intelligence, machines are demonstrating an unprecedented capacity to learn from massive amounts of real-world data to perform human-like cognitive processes, enabling them to recognize environments, objects, and conditions and make critical decisions more accurately than ever. In the medical field, the potential to generate realistic, privacy-preserving, unbiased synthetic data can be the key to unlocking the potential of artificial intelligence in medicine and overcoming the current barriers such as data privacy concerns and high data curation costs. Advanced data-driven solutions could lead towards more robust clinical decision support systems and enhanced clinical training. This Perspective critically examines current and emerging advances in synthetic data generation, and highlights its anticipated transformational effect for early and efficient prevention, diagnosis and treatment of gastrointestinal diseases. Research challenges and directions are identified for leveraging the benefits of synthetic data as well as translating and adopting them in clinical workflows.
Large-scale artificial intelligence (AI) models achieve notable performance in computer vision but require substantial computational resources, limiting their deployment on edge devices1,2. Optical neural networks (ONNs) promise reduced latency and energy consumption by making use of the inherent parallelism of light3. However, present ONNs struggle to scale and are confined to simple tasks, owing to the challenges of replicating exact algebraic operations of digital models using physical (analogue) systems. This work introduces a new paradigm that directly embeds core computer vision principles, including similarity-based recognition, attention-guided perception and detail-context fusion, into a large-scale optical metasurface. By unifying optical physics with these computer vision fundamentals, we develop a photonic-electronic engine that overcomes scalability and generality barriers, enabling high-accuracy, general-purpose computer vision at the edge. The resulting system combines a 41-million-parameter optical metasurface front end with a co-designed, ultraefficient 87,000-parameter digital back end, outperforming many digital models with tens of millions of parameters across object detection, segmentation, 3D reconstruction and video understanding. We build a deployable prototype and demonstrate real-time edge visual processing in natural scenes. This work represents a path towards practical optical computing for general vision tasks in complex natural environments, enabling a new paradigm for low-energy, low-latency, real-time on-device vision intelligence.
As artificial intelligence (AI) becomes increasingly integrated into financial decision-making, concerns about responsibility attribution in human-AI collaboration have intensified. This study examines how AI trust relates to the displacement of responsibility. Drawing on automation trust theory and moral disengagement theory, we propose a mediation model in which decision delegation links AI trust to displacement of responsibility, with perceived anthropomorphism and perceived accountability as contextual moderators. Two scenario-based experiments were conducted to test the proposed framework. The findings show that AI trust has no direct effect on the displacement of responsibility. Instead, it exerts an indirect effect by increasing users' willingness to delegate decision authority to AI systems. Furthermore, perceived anthropomorphism strengthens this indirect effect, whereas perceived accountability weakens it. These results suggest that responsibility attenuation in AI-assisted decision-making is primarily driven by behavioral delegation rather than trust itself. The study clarifies the psychological mechanism and boundary conditions linking AI trust to responsibility attribution in human-AI collaboration.
Rhinitis in the elderly represents a unique challenge, due to specific clinical profiles, needs and expectations. Allergic rhinitis (AR) in the elderly is often intertwined with non-allergic rhinitis (NAR). Multimorbidity, frailty, disability and polypharmacy leading to drug-drug interactions are important in the elderly. Epidemiological data remain limited. AR in the elderly often follows earlier onset but some patients may start AR symptoms in the elderly. Rhinitis in the elderly is often underrecognized because symptoms are those of AR possibly combined with NAR influenced by age-related structural and functional changes in the nose. These include nasal dryness, congestion without clear trigger, and rhinorrhoea alone. Several characteristics of AR in the elderly render AR diagnosis more challenging in that age group. An assessment of control and severity is needed to optimize person-centered treatment. There is no specific guideline for the management of AR in the elderly. However, Although AR in the elderly may differ from rhinitis in younger adults, and medication efficacy may be reduced, the main management trend is the same as that in younger patients. The main problem is safety, as some drugs may cause specific severe side effects. First generation oral anti-histamines should be avoided. Intranasal corticosteroids (INCS), second-generation oral antihistamines and intranasal H1-antihistamine (INAH) and INAH-INCS fixed combination are the first-line therapeutic options. Digital health associated with artificial intelligence has a promising future for AR management.
Brain-computer interface (BCI) technology establishes a direct communication pathway between neural activity and external devices. Driven by advances in neuroscience, artificial intelligence (AI), neural signal acquisition, decoding algorithms, and implantable system design, BCIs have progressed rapidly from experimental prototypes toward clinically relevant neurotechnologies. However, the translation of these technical advances into routine clinical practice and equitable real-world access remains substantially slower than technological innovation. This review summarizes the major technological pathways of BCIs and their clinical applications, and it then examines BCI development from the perspective of clinical translation and accessibility. We focus on key barriers across the translational chain, including long-term technical stability, quality of clinical evidence, evaluation standards, reimbursement mechanisms, health-economic evidence, and the feasibility of implementation in real-world healthcare settings. We argue that the central challenge in BCI development has shifted from improving technical performance alone to building the translational infrastructure required for safe, effective, affordable, and sustainable clinical integration.
Data scarcity, inter-institutional stain variability, and privacy constraints are major challenges impeding the development of generalizable artificial intelligence models in digital pathology. Although recent generative adversarial network (GAN)-based synthesis approaches show promise, they struggle to preserve fine-grained nuclear morphology or maintain class-specific histological diversity. Moreover, existing diffusion-based studies have not sufficiently addressed class-conditional synthesis across heterogeneous, multi-institutional pathology data. We introduce the Efficient Pathology Diffusion Pipeline (EPDP), a class-conditional diffusion framework that enables the generation of clinically relevant, subtype-specific synthetic histopathology images for training and validating diagnostic AI models. By providing high-fidelity, subtype-specific synthetic datasets, EPDP lowers the barriers to developing robust AI diagnostic tools and supports the establishment of standardized evaluation frameworks in digital pathology. To achieve this, EPDP integrates a customized denoising U-Net that employs nuclear details, learnable class embeddings, and CycleGAN-based stain normalization with reference-guided alignment. This design explicitly targets the preservation of fine-grained nuclear morphology, class-specific histological diversity, and stain-invariant visual consistency across institutions. Multi-institutional hematoxylin and eosin whole-slide images are curated via a two-round pathology review to build subtype-labeled training and evaluation sets. Image fidelity is assessed using FID, 1-LPIPS, and SSIM, and compared against state-of-the-art models including PathDiff, PathLDM, and DiffInfinite. Diagnostic utility was evaluated using cross-domain classification (EfficientNetV2-L), and perceptual realism was assessed using a visual Turing test (VTT). The real-synthetic FID was lower than that of real-real FID by 11.0% (breast) and 8.6% (gastric). The subtype 1-LPIPS gaps were ≤ 12%, and the SSIM gaps were ≤ 0.3%. Classifiers trained only on synthetic patches matched real-trained baselines within an ~ 1% F1 score on real-image validation. Pathologists performed at near-chance levels in the VTT, with accuracies ranging from 50% to 56%.