共找到 20 条结果
暂无摘要(点击查看详情)
暂无摘要(点击查看详情)
Access to care is an important component of cancer center catchment area (CA) analytics, where CAs are defined as the geographic scope of cancer center operations. Spatial access to care is one piece of the access to care continuum that is useful for quantifying population travel to health care providers. As no studies have comprehensively calculated CA spatial access to providers, we examined access to oncology, cancer care, and primary care providers for all 65 National Cancer Institute-designated cancer center CAs in the 48 contiguous US states. We used the 2024 end-of-year Centers for Medicare and Medicaid Services National Downloadable File and the enhanced two-step floating CA method to compute spatial accessibility. We stratified analyses by cancer center, census division, 2020 urban/rural status, 2023 area deprivation, and cancer center type, and produced select CA maps. Census tracts in the Montefiore Einstein Comprehensive Cancer Center CA had the highest oncology and cancer care spatial access, while the Masonic Cancer Center had the highest primary care spatial access. New Jersey, New York, and Pennsylvania CAs had the highest oncology and cancer care spatial access (P < .001), while midwestern CAs had the highest primary care spatial access (P < .001). Across area deprivation index quartiles and all provider groupings, urban tracts had higher spatial access than rural tracts (P < .001). Comprehensive cancer centers had higher spatial access to oncology and primary care than noncomprehensive cancer centers (P < .001), while noncomprehensive cancer centers had higher spatial access to cancer care providers (P < .001). We observed significant differences in CA spatial access to oncology, cancer care, and primary care by region, urban/rural status, socioeconomic position, and cancer center type.
The rapidly evolving breast cancer treatment landscape creates significant information synthesis challenges for clinicians. We evaluated whether small open-source large language models (LLMs) augmented with retrieval-augmented generation (RAG) could match proprietary model performance for clinical guideline queries. We developed a domain-specialized RAG pipeline using HTML-structure-preserving chunking of 1,356 ASCO breast cancer guideline documents. Five LLMs were each evaluated with and without RAG: GPT-4-turbo, GPT-3.5-turbo, Qwen2.5-14B (14 billion parameters), LLaMA3-8B, and OpenBioLLM-8B. Performance was assessed using 98 expert-curated question-answer-context triplets across seven breast cancer categories. Evaluation used both rubric-based scoring (six metrics: fluency, relevance, reliability, consistency, clarity, and clinical impact) and exhaustive pairwise ranking by GPT-4-turbo as judge. Human validation was conducted with 15 practicing oncologists on a 10-query subset. RAG-enhanced Qwen2.5-14B achieved mean rubric scores of 3.77 versus 3.96 for GPT-4-turbo and pairwise ranking performance of 0.72 versus 0.81 (normalized scale). Although absolute rubric gains were modest (0.02-0.05 on a five-point scale), relative improvements in head-to-head win rates ranged from 16% to 46%. Human expert scores confirmed RAG superiority but were consistently more conservative than LLM judge scores (mean 3.81 v 4.12 across all metrics). Optimal retrieval used top-5 contexts; performance degraded sharply at higher context volumes. Small open-source LLMs with optimized RAG can approach state-of-the-art proprietary model performance for clinical decision support. This approach enables scalable, cost-effective, privacy-preserving deployment without recurrent fine-tuning, suggesting potential for real-world clinical implementation on single-graphics processing unit infrastructure under expert supervision.
Hereditary cancer risk is key to guiding screening and prevention strategies. Cancer risks can vary by individual because of the presence or absence of high- and moderate-risk pathogenic variants (PVs) in cancer-associated genes, in addition to sex, age, and other risk factors. We previously developed Fam3PRO, a flexible multigene, multicancer Mendelian risk prediction model that estimates a patient's risk of carrying a PV in hereditary cancer genes and their future risk of developing several types of cancers. The Fam3PRO R package includes 22 genes with 18 associated cancers, allowing users to build customized submodels from any gene-cancer set. However, the current R package lacks a user interface (UI), limiting its practical use in clinical settings. Therefore, we aim to develop a web-based UI for broader use of the Fam3PRO functionalities. The Fam3PRO UI (F3PI), built using R Shiny, collects and formats inputs including family health history, genetic test results, and other risk factors. Pedigree data are interactively visualized and modified using pedigreejs, whereas the backend Fam3PRO model takes all the inputs to generate carrier probabilities and future cancer risks, presented through an interactive UI. F3PI streamlines the collection of patient and family history data, which is analyzed by the Fam3PRO models to provide personalized cancer risks for each proband across 18 cancers, as well as probabilities that a proband has a PV in up to 22 hereditary cancer genes. These results are returned to the user, within 1 minute on average, and are available in both interactive and downloadable formats. We have developed F3PI, an easy-to-use, interactive web application that makes cancer and genetic risk information more accessible to providers and their patients.
Cancer registry data represent an indispensable tool for researchers and community outreach and engagement (COE) professionals seeking to understand and mitigate cancer burden in cancer center catchment areas and beyond. To provide insights into the opportunities and obstacles for innovation in the population cancer data space, we contrast the needs and challenges of three groups of stakeholders with regard to cancer registry data. We convened a nationwide panel of 18 population cancer researchers, COE professionals, and central cancer registry officials. We performed qualitative analysis of individual interviews, survey responses, and meeting transcripts to identify the cancer registry data-related needs and challenges of each stakeholder group. We identified distinct functional categories related to registry data applications, and described points of convergence and divergence within each category across the three stakeholder groups. We completed 8 hours of individual panelist interviews, received 16 survey responses (88.9% response rate), and conducted three meetings of working groups. All stakeholder groups agreed on the value of accurate and representative registry data. Researchers desired granular data (case-level and aggregated by small geographic levels) with more clinically relevant data fields and linked community-level access measures. COE participants valued cancer burden and social drivers of health metrics aggregated at a subcounty level as well as user-friendly data querying and visualization tools. Cancer registry officials described an imperative to comply with mandatory reporting requirements and to protect patient privacy in a setting of resource constraint that can conflict with the data use goals of researchers and COE users. Population cancer researchers, COE professionals, and cancer registry officials understand the value of registry data, but the priorities of each are misaligned to varying degrees. Further work is needed to understand the elements of successful efforts to expand the utility and use of registry data.
Population-based cancer registries are a key data resource for catchment area informatics, but their utility for quantifying differences in cancer burden by socioeconomic status is limited. Here, we describe an approach that estimates cancer incidence along income gradients, leveraging a newly validated method called weighting by income probabilities (WIP). We estimated income-specific colorectal cancer incidence, stratified by sex and race/ethnicity, in a catchment area (Ohio) as a case study. Income-specific numerator data (number of cancer cases per income bracket) were estimated using WIP, whereas denominators (population at risk by income bracket) were derived from US Census data. In the case study of the 52,257 patients with invasive colorectal cancer diagnosed in the catchment area of Ohio between 2010 and 2019, lower income was generally associated with higher incidence rates, except in non-Hispanic (NH) White female individuals. The highest incidence was observed in NH Black male individuals at 0-149% of the Federal Poverty Level, with 113.7 cases per 100,000 (95% CI, 99.6 to 129.3) in 2010-2012, compared with 57.8 (95% CI, 54.7 to 61.2) in their NH White counterparts. Sensitivity analyses showed that income-specific incidence statistics were robust to sources of error in numerator and denominator estimation, with incidence estimates varying by no more than 1.98% from the reference estimates. The approach described here accurately estimates cancer incidence along income gradients and can be expanded to estimate income-specific survival and mortality. The case study of colorectal cancer in Ohio demonstrates important insights into the burden of cancer by income. These granular income-specific data can enhance our understanding of the relationship between cancer burden and socioeconomic status and inform cancer surveillance, prevention, and control efforts.
Curating high-quality clinical and genomic data sets from patients with cancer to predict hospital readmission using machine learning (ML) models. We extracted data from electronic health records for patients with cancer in the University of California, San Diego Health System, to curate clinicogenomic data sets for lung, breast, and colon cancers. We constructed ML models to predict the risk of hospital readmission 30, 60, and 90 days postdischarge. Standard ML models (logistic regression, random forest [RF], gradient boosting [GB], neural network) and multitask neural network models were developed to simultaneously predict all three readmission outcomes. Our results revealed that rehospitalization is most frequent in colon cancer within 30 days. For the 30-day hospitalization prediction, GB achieved the highest area under the precision recall curve (PR-AUC) for lung (0.415) and breast (0.470) cancers and RF achieved the overall highest PR-AUC for colon cancer (0.621). Explainability analysis revealed that health care metrics (such as the number of previous admissions and average length of stay), risk scores composed of diagnosis codes, and treatments are significant features in predicting readmission within cancer types. It also identified EGFR mutations as a potential predictor of readmission in colon cancer. The study highlights the potential of integrating clinical and genomic data for predicting adverse outcomes in patients with cancer. The standard ML approaches were able to successfully capture patterns in readmission and outperformed the more complex models. Limitations include the relatively small data set from a single institution. Ultimately, this study highlights the value of curating and maintaining clinicogenomic information at an institution level to streamline data set curation and model development.
Although cancer centers need geospatially referenced cancer surveillance systems to track disease incidence and mortality rates for the communities they serve, most do not have the tools to allow them to identify which of the neighborhoods they serve have the greatest cancer care needs. Here, we describe the latest build of SCAN360, a web-enabled tool developed by the Sylvester Comprehensive Cancer Center, which describes disease burden and risk factors for hundreds of communities across south Florida. After describing some of the geographically encoded metrics from more than a dozen data sources, including the Florida State Cancer Registry, US census, and EPA, we describe innovative applications of geospatial analytics for cancer prevention research. Using the data harmonized within SCAN360, we applied geospatial hotspot analysis to identify locations with an unusually high burden of lung and gastric cancers, for intervention with community engagement and intervention. To help increase the precision for such efforts, we overlaid these cancer hotspots on top of choropleth maps differentiating census tracts by socioeconomic disparities that are known to affect cancer control in the population. We also demonstrate how environmental data can be integrated with cancer surveillance data for assessing climate change impacts on melanoma risk. SCAN360 and the case studies presented here offer deployment-ready examples for other cancer centers to follow when developing geospatially referenced surveillance systems for catchment area monitoring, outreach, and research.
Disparities in lung cancer incidence exist in Black populations, and screening criteria underserve Black populations due to disparately elevated risk in the screening-eligible population. Prediction models that integrate clinical and imaging-based features to individualize lung cancer risk are a potential means to mitigate these disparities. This multicenter (National Lung Screening Trial [NLST]) and catchment population-based (University of Illinois Health [UIH], urban and suburban Cook County) cross-sectional study used participants at risk of lung cancer with available lung computed tomography (CT) imaging and follow-up between the years 2015 and 2024. In all, 53,452 in NLST and 11,654 in UIH were included on the basis of age and tobacco use-based risk factors for lung cancer. Cohorts were used for training and testing of deep and machine learning models using clinical features alone or combined with CT image features (hybrid computer vision). An optimized seven-feature clinical model achieved receiver operating characteristic (ROC)-AUC values ranging from 0.64 to 0.67 in NLST and 0.60 to 0.65 in UIH cohorts across multiple years. Incorporation of imaging features to form a hybrid computer vision model significantly improved ROC-AUC values to 0.78-0.91 in NLST but deteriorated in UIH with ROC-AUC values of 0.68-0.80, attributable to Black participants where ROC-AUC values ranged from 0.63 to 0.72 across multiple years. Retraining the hybrid computer vision model by incorporating Black and other participants from the UIH cohort improved performance with ROC-AUC values of 0.70-0.87 in a held-out UIH test set. Hybrid computer vision predicted risk with improved accuracy compared with clinical risk models alone. However, potential biases in image training data reduced model generalizability in Black participants. Performance was improved upon retraining with a subset of the UIH cohort, suggesting that inclusive training and validation data sets can minimize racial disparities. Future studies incorporating vision models trained on representative data sets may demonstrate improved health equity upon clinical use.
Cancer recurrence is a critical outcome for patients and physicians. Retrospective cancer recurrence data can evaluate recurrence-directed treatment and generate novel interventions targeting recurrent cancers. However, large cancer databases do not provide recurrence-related information, stymying study at scale, and consequently require significant manual record review. Automated evaluation of records may allow for the rapid generation of easily analyzed data sets, accelerating the evaluation of recurrence altogether. Patients treated with radiation therapy at one tertiary referral center from 2010 to 2018 with a verified cancer status (cancer recurrence v no cancer recurrence) were identified. Patients with recurrent disease were initially identified through manual record review, and the associated pathology report was collected. Google Automated Machine Learning with Natural Language Processing (AutoNLP) and Google Gemini 1.5 Pro were used to generate a model for binary classification, with comparison to the gold-standard manually developed data set. A total of 7,054 patients were identified. 3,431 (48.6%) were female, with a median age of 64 years. Head and neck (1,482, 21%), breast (1,480, 21%), upper GI (1,307, 18.5%), and lung/thorax (1,107, 15.7%) were the most common disease sites. Recurrence was verified for 1,546 patients (21.9%) using pathology reports, of which 1,249 positive cases were paired with 651 negative pathology reports for model development. Google Gemini 1.5 Pro consistently outperformed AutoNLP across all measurements of accuracy, generating a greater absolute difference in precision, recall, negative predictive value, and specificity, and a higher likelihood of correct classification at the individual level, rendering Gemini superior in recurrence status extraction. AutoNLP and Google Gemini 1.5 Pro are promising tools for identifying recurrence from pathology reports, with the latter demonstrating superior overall performance, making it particularly suitable for clinical translation.
The oncogenic impact of somatic driver alterations is shaped by tissue context. Classifying alterations by cancer type and evaluating their context-specific properties requires large cohorts of genomically profiled and clinically annotated tumors. Here, we define cancer type-specific patterns of driver alterations, including 164 newly identified hotspots, in 54,331 tumors from 48,179 patients spanning 448 histological cancer subtypes. One-third of all drivers arose in non-canonical contexts and exhibited distinct features, including increased subclonality, later emergence, and divergent biological properties. Within cancer types, gene fusions and other distinct patterns of co-occurring drivers are indicative of earlier age of disease onset. We also identify ancestry-specific differences in human leukocyte antigen (HLA)-restricted driver neoantigens affecting T cell receptor therapy eligibility, and demonstrate cancer-type-specific patterns of intrinsic resistance via somatic HLA loss. Our findings highlight that functional roles of driver alterations depend on the cancer types and clinical contexts in which they arise.
Cancer registries are often asked to present cancer data for small geographic areas to inform and facilitate targeted interventions and prevention programs. However, it is challenging to compute and visualize reliable cancer estimates for areas with small case counts and populations to support cancer control planning. Leveraging a user-centered design process, we developed a visual analytics platform and interactive graphics to display modeled cancer risk estimates for small areas. Development of our visual analytics platform was informed by cancer registry and public health professionals through focus groups and surveys. The reliable cancer risk estimates for small areas that we displayed on this platform were created using a Bayesian hierarchical model that borrows strength from neighboring areas and over time to produce cancer estimates for small areas. The Cancer Analytics and Maps for Small Areas tool (CAMSA) provided age-adjusted cancer incidence and mortality rates and risk probabilities for eight cancers at the county and ZIP-code tabulation area levels. It allowed the user to identify areas of high cancer incidence, including among subgroups defined by sex and race/ethnicity. Potential end users were enthusiastic about the opportunity to implement CAMSA within their practice, emphasizing the tool's potential for increased collaborative opportunities at local and state levels. Suggestions for improvement included adding map overlays such as additional cancer risk variables and incorporating functionalities such as exportable data tables. CAMSA presented cancer rate and risk estimates for small geographic areas where they may have previously been suppressed. Through our user-centered design process, we developed statistical models and data visualizations to support the needs of an array of potential end users.
Predicting recurrence of pancreatic cancer after surgery could inform clinical decision making, including adjuvant therapies and follow-up. This study aimed to develop and validate a deep learning model using digitized whole-slide images (WSI) of histopathology. Publicly available WSI of pancreatic ductal adenocarcinoma resections from three cohorts were used for training. The model consisted of a pan-cancer foundation model to generate embeddings, mean-pooling across tissue patches, and then a fully connected neural network. Model predictions were compared with human-labeled histopathologic features and genomic alterations. The model was externally validated in a meta-analysis of a single-center cohort from Princess Margaret Cancer Centre, a multicenter cohort from France, and the PRODIGE 24 trial of adjuvant chemotherapy. The deep learning model was trained on 12,594 tissue patches from 257 patients. High-risk classifications were associated with squamous morphology, reactive stroma, tumor cellularity, and necrosis, whereas low-risk classifications were associated with tubulopapillary and conventional morphologies, as well as deserted stroma. High-risk cancers were enriched for basal-like gene expression profiles and distinct oncogenic pathways. In a meta-analysis of the external cohorts, the hazard ratio (HR) for death comparing high-versus low-risk cancers was 1.49 (95% CI, 1.25 to 1.79, P < .001), whereas the HR for recurrence or death was 1.41 (95% CI, 1.19 to 1.68, P < .001). The classifications remained prognostic among moderately differentiated cancers. An open-source deep learning model using WSI from pancreatic cancer resections generated risk classifications that correlated with histopathologic and genomic features. Classifications were externally validated in a meta-analysis of three cohorts. This model could be applied to WSI to provide individualized prognostic information for patients.
Reviewing pathology reports requires physicians to integrate complex histopathologic, immunohistochemical, and molecular findings from multiple reports and institutions, often under time constraints that increase the risk of error and fatigue. Large language models (LLMs) offer a potential solution by generating concise, coherent summaries from complex pathology data. Patients who underwent initial consultation in a thoracic clinic between January 2019 and July 2023 were included. Original pathology reports and corresponding physician pathology summaries from consultation notes were extracted and anonymized. Six open-source LLMs (Llama 3.0, Llama 3.1, Llama 3.2, Mistral, Gemma, and DeepSeek-R1) generated pathology summaries directly from the original reports. Objective and subjective evaluations were performed using the original reports as the ground truth. LLM-generated summaries were compared with physician summaries for correctness, completeness, and conciseness. Additional subjective assessments with multiple evaluators were conducted for Llama 3.1. Ninety-four cases met the eligibility criteria. Using the original pathology reports as the ground truth, the LLM-generated summaries achieved higher scores across all objective evaluation metrics compared with physician pathology summaries (P < .0001). In the subjective evaluation, DeepSeek, Mistral, Llama 3.1, and Llama 3.2 achieved higher ratings for completeness (P = .017, P < .0001, P < .0001, and P < .0001, respectively) while maintaining comparable correctness relative to physician pathology summaries (P = 1.000). The results remained consistent in additional subjective analyses involving multiple evaluators for Llama 3.1. LLM-generated summaries demonstrated better performance in objective metrics and greater completeness in subjective evaluations compared with physician summaries. These results highlight the potential of LLMs as valuable tools for enhancing clinical documentation and workflow efficiency in oncology practice.
Multimodal machine learning offers a holistic view of a patient's status, integrating structured and unstructured data from electronic health records (EHR). We propose a framework to predict metastasis risk 1 month before diagnosis, using 6 months of clinical history from EHR data. Data from four cancer cohorts collected at Karolinska University Hospital (Stockholm, Sweden) were analyzed: breast (n = 743), colon (n = 387), lung (n = 870), and prostate (n = 1,890). The data set included demographics, comorbidities, laboratory results, medications, and clinical text. We compared traditional and deep learning (DL) classifiers across single modalities and multimodal combinations, using various fusion strategies and a transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) 2a design, with an 80-20 development-validation split to ensure a rigorous, repeatable evaluation. Performance was evaluated using AUROC, area under the precision-recall curve, F1 score, sensitivity, and specificity. We then employed a multimodal adaptation of Shapley additive explanations (SHAP) to analyze the classifiers' reasoning. Intermediate fusion achieved the highest F1 scores on breast (0.845), colon (0.786), and prostate cancer (0.845), demonstrating strong predictive performance. For lung cancer, the intermediate fusion achieved an F1 score of 0.819, while the text-only model achieved the highest, with an F1 score of 0.829. DL classifiers consistently outperformed traditional models. Colon cancer, the smallest cohort, had the lowest performance, highlighting the importance of sufficient training data. SHAP analysis showed that the relative importance of modalities varied across cancer types. Fusion strategies offer distinct strengths and weaknesses. Intermediate fusion consistently delivered the best results, but strategy choices should align with data characteristics and organizational needs.
This study assessed the feasibility of developing the University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center (UMGCCC)-Medicare-linked database infrastructure by integrating tumor registry, electronic health records (EHRs), and Medicare administrative claims data. The database was designed to support research identifying determinants of differences in cancer outcomes among patient populations commonly under-represented in clinical trials (based on the US population with the disease) including older adults. Patients 65 years and older who were diagnosed and/or received their first course of treatment for a primary tumor at UMGCCC from 2018 to 2021 were included in the database. A two-stage data linkage process was used to merge cancer center tumor registry data with EHR and Medicare claims data. We performed data quality and linkage quality checks. Summary statistics were calculated for patient and tumor characteristics. Of the 3,322 patients identified from the tumor registry, 3,119 patients (94%) were included in the UMGCCC-Medicare database (mean age 73.1 years, 56% male, 31% Black). Lung cancers were the most common (15%) followed by oral cancers (12%) and non-Hodgkin lymphoma (6%). The development of the UMGCCC-Medicare database serves as proof of concept for linking real-world data from different sources. The database is a valuable resource for research requiring detailed patient-level data and follow-up that may generate real-world evidence for older adults living in the United States and treated in routine oncology practice.
Effective risk stratification in cancer survivorship requires handling longitudinal data characterized by multimodal inputs, irregular follow-up, and recurrent clinical events. This study evaluated the incremental value of integrating patient-reported outcomes (PROs) with electronic health record (EHR) data and identified optimal windowing strategies for machine learning-based prediction of adverse survivorship outcomes. This study used a cohort of 25,592 cancer survivors followed for 36 months. Data from four domains were integrated: baseline measures, treatments, PROs, and health care utilization (emergency room visits and hospitalizations). Two classification models, LASSO and CATBOOST, were applied across modality combinations and five temporal representations of patient history: static early-phase (0-6 months), cumulative history, sliding windows (4- and 12-month), and a most-recent baseline. Performance was evaluated for predicting monthly health care utilization and patient-reported symptom burden using average precision (AP). SHapley Additive exPlanations (SHAP) analysis identified key predictors and characterized their evolving influence. For health care utilization, CATBOOST models trained on the full multimodal data set with time-windowed predictors achieved strong discrimination (AP = 0.207), outperforming static baselines by 27%. SHAP analyses emphasized dynamic contributions from recent utilization and treatment toxicity. For symptom burden, PRO integration was crucial, nearly doubling clinical-only performance (AP = 0.132 v 0.071), with longer historical context improving characterization of progressive functional decline and symptom severity. Flagging the top 10% of patients by predicted risk captured 51.7% of health care utilizations and 46.7% of symptom burden events. Adverse survivorship risk is dynamic and outcome-specific: acute health care utilization is best predicted by recent clinical momentum, while longitudinal patient-reported trends drive symptom burden. Implementing decoupled, dynamic windows provides a flexible framework for risk stratification and risk prediction beyond standard clinical heuristics, facilitating proactive, precision-based survivorship care.
Constitutional epimutations arise early in development and are present across normal tissues, including peripheral blood. Constitutional BRCA1 promoter methylation has emerged as a risk factor for BRCA1-associated cancers, such as ovarian cancer (OC), and may serve as a biomarker for OC risk. This study retrospectively evaluated the clinical relevance of constitutional BRCA1 promoter methylation in 473 patients with OC enrolled in the observational AGO-TR1 study (ClinicalTrials.gov identifier: NCT02222883). BRCA1 promoter methylation was quantified by the methylation-specific real-time polymerase chain reaction using whole blood-derived DNA from 476 female controls and 473 patients with OC along with 473 corresponding tumor-derived DNA samples. Methylation levels ≥1.0% were considered methylation-positive. BRCA1 promoter methylation in blood-derived DNA was detected in 42 of 473 patients with OC and in 26 of 476 controls (8.9% v 5.5%; odds ratio [OR], 1.69 [95% CI, 1.02 to 2.80], P = .0432), with the strongest association observed with methylation levels ≥10% (OR, 6.17 [95% CI, 1.37 to 27.72], P = .018). Patients with BRCA1 promoter methylation in blood-derived DNA were diagnosed at a younger median age than those without (54.0 v 60.0 years, P = .018). Constitutional BRCA1 promoter methylation was less frequent in patients carrying pathogenic germline variants in OC predisposition genes than in noncarriers (4.1% v 10.5%; OR, 0.37 [95% CI, 0.14 to 0.96], P = .04) and showed no association with a family history of cancer or platinum-based chemotherapy before blood draw. BRCA1 promoter methylation in blood-derived DNA was correlated with tumor BRCA1 promoter methylation (P < .001). Tumor BRCA1 promoter methylation was observed in 64 of 473 samples (13.5%), half (32 of 64) of which were attributable to constitutional BRCA1 promoter methylation also detectable in the blood. Constitutional BRCA1 promoter methylation accounts for a substantial proportion of OCs and represents a robust biomarker for individual OC risk.
Comprehensive genomic profiling (CGP) is a key strategy in precision medicine for lung cancer, yet its clinical implementation remains limited, partly because of the uncertainty in identifying druggable mutations in individual patients. In this study, we investigated the potential of an artificial intelligence (AI)-based tool to predict the probability of identifying druggable mutations before CGP (pretest probability). We developed an eXtreme Gradient Boosting (XGBoost) prediction model trained on pre-CGP clinical variables from 3,470 patients with lung cancer (June 2019-November 2023) to estimate the probability of identifying druggable mutations. The key predictors were identified using explainable artificial intelligence (XAI) analysis. The refined model was deployed as a web application and evaluated in a temporally independent test cohort of 1,307 patients (December 2023-November 2024), with Brier score as the primary end point. The prediction model achieved an area under the receiver operating characteristic curve (AUROC) of 0.85 (95% CI, 0.82 to 0.89) in the overall validation cohort and 0.79 (95% CI, 0.74 to 0.84) in patients for whom a driver mutation had not been identified through companion diagnostic testing. The XAI analysis identified sex, smoking history, histology, and metastatic sites as important predictors. Even among patients who underwent tissue CGP, bone (P = .011) and lung (P < .001) metastases were significantly associated with a higher druggable mutation detection rate. The deployed model achieved Brier scores of 0.19 in the overall independent test cohort and 0.16 in patients for whom a driver mutation had not been identified through companion diagnostic testing. These findings indicate that an AI-based tool using pre-CGP clinical data may support broader CGP implementation and improve access to targeted therapies.