Mendelian randomization (MR) assesses the total effect of exposure on outcome. With the rapidly increasing availability of summary statistics from genome-wide association studies (GWASs), MR leverages existing summary statistics and is widely used to study the causal effects among complex traits and diseases. The total effect in the population is a sum of indirect and direct effects. For complex disease outcomes with complicated etiologies, and/or for modifiable exposure traits, there may exist more than one pathway between exposure and outcome. The direct effect and the indirect effect via a mediator of interest could be of opposite directions, and the total effect estimates may not be informative for treatment and prevention decision-making or may be even misleading for different subgroups of patients. Causal mediation analysis delineates the indirect effect of exposure on outcome operating through the mediator and the direct effect transmitted through other mechanisms. However, causal mediation analysis often requires individual-level data measured on exposure, outcome, mediator and confounding variables, and the power of the mediation analysis is restricted by sample size. In this work, motivated by a study of the effects of atrial fibrillation (AF) on Alzheimer's dementia, we propose a framework for Integrative Mendelian randomization and Mediation Analysis (IMMA). The proposed method integrates the total effect estimates from MR analyses based on large-scale GWASs with the direct and indirect effect estimates from mediation analysis based on individual-level data of a limited sample size. We introduce a series of IMMA models, under the scenarios with or without exposure-mediator interaction and/or study heterogeneity. The proposed IMMA models improve the estimation and the power of inference on the direct and indirect effects in the population, as well as the characterization of the variation of effects. Our analyses showed a significant positive direct effect of AF on Alzheimer's dementia risk not through the use of the oral anticoagulant treatment and a significant indirect effect of AF-induced anticoagulant treatment in reducing Alzheimer's dementia risk. The results suggested potential Alzheimer's dementia risk prediction and prevention strategies for AF patients, and paved the way for future re-evaluation of anticoagulant treatment guidelines for AF patients. A sensitivity analysis was conducted to assess the sensitivity of the conclusions to a key assumption of the IMMA approach.
Changes in DNA methylation patterns exhibit a high correlation with chronological age. Epigenetic clocks, developed through statistical models that estimate epigenetic age using the methylation levels of cytosine-guanine dinucleotide (CpG) sites, have emerged as powerful tools for understanding aging and age-related diseases. Despite their popularity, the generalizability of these clocks across diverse populations remains a challenge. Some of the widely used epigenetic clocks, such as Horvath's clock (Genome Biol. 14 (2013) 1-20) and the PedBE clock (Proc. Natl. Acad. Sci. USA 117 (2020) 23329-23335), are shown to perform poorly in our target cohort. This loss of prediction accuracy raises concerns about their viability in calculating biological age in distinct demographic and ethnic groups. Technically, the feature space of existing clocks is yielded with an obsolete technique, potentially leading to systematic bias in the analysis of all target data generated by the EPIC 850K array. To address both population heterogeneity and technological advances, we adopt a transfer learning framework to calibrate existing epigenetic clocks by borrowing shared knowledge from diverse datasets. Furthermore, our transfer learning is built on kriging- and DNN-based methods for feature adaptation, to close the gap between existing clocks and our target data. We analyze data collected from 523 blood samples from a cohort of children and adolescents in the Early Life Exposure in Mexico to Environmental Toxicants (ELEMENT) study and show that our proposed transfer learning methods significantly improve prediction performance compared to existing clocks. Performance is further enhanced by using the CpG sites profiled on the higher-resolution EPIC array. More importantly, calibrated clocks produce epigenetic age accelerations that correlate better with stages of sexual maturation. Our methodology demonstrates the potential to bridge the gap between different DNA methylation datasets and various profiling platforms, thereby enhancing the applicability of epigenetic clocks across diverse population groups and contributing to more accurate aging research.
We develop a quantile regression decomposition (QRD) method for analyzing observed disparities (OD) between population groups in socioeconomic and health-related outcomes for complex survey data. The conventional decomposition approaches use the conditional mean regression to decompose the disparity into two parts, the part explained by the difference arising from the different distributions in the explanatory covariates and the remaining part, which is unexplained by the covariates. Many socioeconomic and health outcomes exhibit heteroscedastic distributions, where the magnitude of observed disparities varies across different quantiles of these outcomes. Thus, differences in the explanatory covariates may account for varying differences in the OD across the quantiles of the outcome. The QRD can identify where there are greater differences in the outcome distribution, for example, 90th quantile, and how important the covariates are in explaining those differences. Much socioeconomic and health research relies on complex surveys, such as the National Health and Nutrition Examination Survey (NHANES), that oversample individuals from disadvantaged/minority population groups in order to provide improved precision. QRD has not been extended to the complex survey setting. We improve the QRD approach proposed in Machado and Mata (2005) to yield more reliable estimates at the quantiles, where the data are sparse, and extend it to the complex survey setting. We also propose a perturbation-based variance estimation method. Simulation studies indicate that the estimates of the unexplained portions of the OD across quantiles are unbiased and the coverage of the confidence intervals are close to nominal value. This methodology is used to study disparities in body mass index (BMI) and telomere length between race/ethnic groups estimated from the NHANES data.
Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.
Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).
This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results.
Accurate estimation and forecasts for neonatal mortality rates (NMRs) in low- and middle-income countries is an urgent problem. Much of child mortality is preventable, and understanding temporal trends is of great interest when evaluating past performance and planning future policy or programming. In countries without robust vital registration, we rely on modeled estimates based on survey data to understand trends. A toolkit of compelling temporal models exists, but these methods have not been comprehensively evaluated for their application for the estimation of the NMR in low- and middle-income countries using household survey data. Using Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS) data from 41 countries in sub-Saharan Africa, we estimate and forecast the national-level NMR for 1970-2030 separately with random walk, auto-regressive, penalized spline, natural spline, and logit-linear latent temporal models. We examine the statistical behavior of these temporal models with both an out-of-sample analysis using the DHS and MICS data and a simulation study. We find that the second-order random walk and the penalized spline have the least bias, and short-term forecasts from the penalized spline tend to have narrower intervals with better out-of-sample performance. From the analysis of the NMR in sub-Saharan Africa, we estimate that 6 or fewer of the 41 countries included are on track to achieve the Sustainable Development Goals target of 12 neonatal deaths per 1000 live births by 2030.
Cancer screening facilitates the early detection of cancer, at a stage when treatment is often most effective. However, it also brings the risk of over-diagnosis, where a diagnosis made through screening would not have led to symptoms or death during the patient's lifetime. In this paper, we tackle a significant unresolved issue in the evaluation of screening efficacy: selecting primary endpoints and inferential procedures that efficiently consider potential overdiagnosis in screening trials. This is motivated by the necessity to design and analyze a phase IV Early Detection Initiative (EDI) trial for evaluating a pancreatic cancer screening strategy. We introduce two novel approaches for assessing screening efficacy, grounded on cancer stage-shift. These methods address potential overdiagnosis by: i) borrowing information about clinical diagnosis from the control arm that hasn't undergone screening (the BR approach), and ii) performing sensitivity analysis, contingent upon a conservative bound of the overdiagnosis magnitude (the SEN-T approach). Analytical methods and extensive simulation studies underscore the superiority of our proposed methods, demonstrating enhanced efficiency in estimating and testing screening efficacy compared to existing methods. The latter either overlook overdiagnosis or adhere to a valid, yet conservative, cumulative incidence endpoint. We illustrate the practical application of these approaches using ovarian cancer data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. The results affirm that our methods bolster an efficient and robust study design for cancer screening trials.
Sepsis is a life-threatening condition affecting millions of individuals in the U.S. each year. The complexity of sepsis clinical management makes individualized treatment approaches desirable. The University of Pittsburgh Medical Center (UPMC) has collected electronic health records data of sepsis patients from multiple hospitals. The goal of this study is to derive individualized decision rules (IDRs) that could be safely applied to and uniformly improve decision-making across hospitals in the UPMC Health System by only using a subset of hospitals for training. Traditional approaches assume that data are sampled from a single population of interest. With multiple hospitals that vary in patient populations, treatments, and provider teams, an IDR that is successful in one hospital may not be as effective in another, and the performance achieved by a globally optimal IDR may vary greatly across hospitals, preventing it from being safely applied to unseen hospitals. To address these challenges as well as the practical restriction of data sharing across hospitals, we introduce a new objective function and a federated learning algorithm for learning IDRs that are robust to distributional uncertainty from heterogeneous data. The proposed framework uses a conditional maximin objective to enhance individual outcomes across hospitals, ensuring robustness against hospital-level variations. Compared to the traditional approach, the proposed method enhances the survival rate by 10 percentage points among patients who may experience extreme adverse outcomes across hospitals. Additionally, it increases the overall survival rate by two to three percentage points when the learned IDR is applied to unseen hospital populations.
Risk of suicide attempt varies over time. Understanding the importance of risk factors measured at a mental health visit can help clinicians evaluate future risk and provide appropriate care during the visit. In prediction settings where data are collected over time, such as in mental health care, it is often of interest to understand both the importance of variables for predicting the response at each time point and the importance summarized over the time series. Building on recent advances in estimation and inference for variable importance measures, we define summaries of variable importance trajectories and corresponding estimators. The same approaches for inference can be applied to these measures regardless of the choice of the algorithm(s) used to estimate the prediction function. We propose a nonparametric efficient estimation and inference procedure as well as a null hypothesis testing procedure that are valid even when complex machine learning tools are used for prediction. Through simulations, we demonstrate that our proposed procedures have good operating characteristics. We use these approaches to analyze electronic health records data from two large health systems to investigate the longitudinal importance of risk factors for suicide attempt to inform future suicide prevention research and clinical workflow.
In this work we study the lifetime Medicare spending patterns of patients with end-stage renal disease (ESRD). We extract the information of patients who started their ESRD services in 2007-2011 from the United States Renal Data System (USRDS). Patients are partitioned into three groups based on their kidney transplant status: 1-unwaitlisted and never transplanted, 2-waitlisted but never transplanted, and 3-waitlisted and then transplanted. To study their Medicare cost trajectories, we use a semiparametric regression model with both fixed and bivariate time-varying coefficients to compare groups 1 and 2, and a bivariate time-varying coefficient model with different starting times (time since the first ESRD service and time since the kidney transplant) to compare groups 2 and 3. In addition to demographics and other medical conditions, these regression models are conditional on the survival time, which ideally depict the lifetime Medicare spending patterns. For estimation, we extend the profile weighted least squares (PWLS) estimator to longitudinal data for the first comparison and propose a two-stage estimating method for the second comparison. We use sandwich variance estimators to construct confidence intervals and validate inference procedures through simulations. Our analysis of the Medicare claims data reveals that waitlisting is associated with a lower daily medical cost at the beginning of ESRD service among waitlisted patients which gradually increases over time. Averaging over lifespan, however, there is no difference between waitlisted and unwaitlisted groups. A kidney transplant, on the other hand, reduces the medical cost significantly after an initial spike.
Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.
The bulk of causal inference studies rule out the presence of interference between units. However, in many real-world scenarios, units are interconnected by social, physical, or virtual ties, and the effect of the treatment can spill from one unit to other connected individuals in the network. In this paper, we develop a machine learning method that uses tree-based algorithms and a Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood, and network characteristics in the context of clustered networks and interference within clusters. The proposed network causal tree (NCT) algorithm has several advantages. First, it allows the investigation of the heterogeneity of the treatment effect, avoiding potential bias due to the presence of interference. Second, understanding the heterogeneity of both treatment and spillover effects can guide policymakers in scaling up interventions, designing targeting strategies, and increasing cost-effectiveness. We investigate the performance of our NCT method using a Monte Carlo simulation study and illustrate its application to assess the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.
Questionnaires are among the oldest and most widely used instruments in practice to measure variables relevant to traits of interest that cannot be easily measured by physical devices, for example, depression. In many clinical settings, the scope of an existing questionnaire is often unfit to apply to a new study population, whose underlying characteristics are different from those of the original population used for the questionnaire's development and/or validation. Motivated by a cohort study of elderly asthma patients, we aim to examine associations between clinical outcomes and quality of life (QoL) measured by a QoL questionnaire. To increase comparability, we consider a supervised learning method to identify a subset of questions whose summary score is strongly associated with a specific clinical outcome under investigation. The resultant set of selected items gives an optimal summary metric of the questionnaire, which improves both statistical power and clinical interpretation. Our item extraction procedure is built upon the best subset algorithm implemented by a mixed integer programming, which enjoys both theoretical guarantee of selection consistency and flexibility of handling nonresponse missing data. Moreover, estimation uncertainty is analyzed by the means of noise perturbation. Our methodology is first evaluated by extensive simulation studies with comparisons to existing methods and then applied to derive tailored QoL scores adaptive to two clinical outcomes of lung function measure (FEV1) and asthma control test (ACT), respectively, among elderly people with persistent asthma.
In longitudinal studies, investigators are often interested in understanding how the time since the occurrence of an intermediate event affects a future outcome. The intermediate event is often asymptomatic such that its occurrence is only known to lie in a time interval induced by periodic examinations. We propose a linear regression model that relates the time since the occurrence of the intermediate event to a continuous response at a future time point through a rectified linear unit activation function while formulating the distribution of the time to the occurrence of the intermediate event through the Cox proportional hazards model. We consider nonparametric maximum likelihood estimation with an arbitrary sequence of examination times for each subject. We present an EM algorithm that converges stably for arbitrary datasets. The resulting estimators of regression parameters are consistent, asymptotically normal, and asymptotically efficient. We assess the performance of the proposed methods through extensive simulation studies and provide an application to the Atherosclerosis Risk in Communities Study.
In recent years longitudinal, multi-site imaging studies have emerged as key tools for investigating brain function. These studies follow a large number of participants for an extended period, offering exciting opportunities to uncover brain functional network changes over time as a function of clinical and demographic covariates. However, these studies also introduce many statistical challenges such as site-effects and accounting for the heterogeneous nature of network differences between subjects. Robust statistical methods are highly needed to address these issues, but to date there has been little methods development addressing these problems in the context of data-driven brain network estimation. This work addresses this gap in the literature, introducing a general Bayesian framework, REMBRAiNDT, incorporating site- and subject-effects into the network decomposition, while also enabling covariate effect estimation and efficient information pooling across brain locations. We use our procedure to conduct a novel analysis of neurodevelopment among adolescents in the longitudinal, multi-site ABCD study. We find extensive evidence of increasing functional integration with age in networks associated with higher order cognitive processes. Our study is one of the first to examine neurodevelopment using blind source separation in the longitudinal ABCD study data, and the findings enrich earlier cross-sectional results on neurodevelopment.
Disease progression prediction based on patients' evolving health information is challenging when true disease states are unknown due to diagnostic capabilities or high costs. For example, the absence of gold-standard neurological diagnoses hinders distinguishing Alzheimer's disease (AD) from related conditions such as AD-related dementias (ADRDs), including Lewy body dementia (LBD). Combining temporally dependent surrogate labels and health markers may improve disease prediction. However, existing literature models informative surrogate labels and observed variables that reflect the underlying states using purely generative approaches, often posing unrealistic assumptions on the outcomes and suffering from misspecification thereof. We propose integrating the conventional hidden Markov model as a generative model with a time-varying discriminative classification model to simultaneously handle potentially misspecified surrogate labels and incorporate important markers of disease progression. We develop an adaptive forward-backward algorithm with subjective labels for estimation, and utilize the modified posterior and Viterbi algorithms to predict the progression of future states or new patients based on objective markers only. Importantly, the adaptation eliminates the need to model the marginal distribution of longitudinal markers, a requirement in traditional algorithms. Asymptotic properties are established, and significant improvements in finite samples are demonstrated via simulation studies. Analysis of the neuropathological dataset of the National Alzheimer's Coordinating Center (NACC) shows much improved accuracy in distinguishing LBD from AD.
Interval-censoring frequently occurs in studies of chronic diseases where disease status is inferred from intermittently collected biomarkers. Although many methods have been developed to analyze such data, they typically assume perfect disease diagnosis, which often does not hold in practice due to the inherent imperfect clinical diagnosis of cognitive functions or measurement errors of biomarkers such as cerebrospinal fluid. In this work, we introduce a semiparametric modeling framework using the Cox proportional hazards model to address interval-censored data in the presence of inaccurate disease diagnosis. Our model incorporates sensitivity and specificity of the diagnosis to account for uncertainty in whether the interval truly contains the disease onset. Furthermore, the framework accommodates scenarios involving a terminal event and when diagnosis is accurate, such as through postmortem analysis. We propose a nonparametric maximum likelihood estimation method for inference and develop an efficient EM algorithm to ensure computational feasibility. The regression coefficient estimators are shown to be asymptotically normal, achieving semiparametric efficiency bounds. We further validate our approach through extensive simulation studies and an application assessing Alzheimer's disease (AD) risk. We find that amyloid-beta is significantly associated with AD, but Tau is predictive of both AD and mortality.
Education is a key driver of social and economic mobility, yet disparities in attainment persist, particularly in low- and middle-income countries (LMICs). Existing indicators, such as mean years of schooling for adults aged 25 and older (MYS25) and expected years of schooling (EYS), offer a snapshot of an educational system, but lack either cohort-specific or temporal granularity. To address these limitations, we introduce the ultimate years of schooling (UYS)-a birth cohort-based metric targeting the final educational attainment of any individual cohort, including those with ongoing schooling trajectories. As with many attainment indicators, we propose to estimate UYS with cross-sectional household surveys. However, for younger cohorts, estimation fails, because these individuals are right-censored leading to severe downwards bias. To correct for this, we propose to re-frame educational attainment as a time-to-event process and deploy discrete-time survival models that explicitly account for censoring in the observations. At the national level, we estimate the parameters of the model using survey-weighted logistic regression, while for finer spatial resolutions, where sample sizes are smaller, we embed the discrete-time survival model within a Bayesian spatiotemporal framework to improve stability and precision. Applying our proposed methods to data from the 2022 Tanzania Demographic and Health Surveys, we estimate female educational trajectories corrected for censoring biases, and reveal substantial subnational disparities. By providing a dynamic, bias-corrected, and spatially disaggregated measure, our approach enhances education monitoring; it equips policymakers and researchers with a more precise tool for monitoring current progress towards education goals, and for designing future targeted policy interventions in LMICs.
Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and, therefore, we introduce novel estimands that summarize the nature and extent of the heterogeneity and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. Our approach is motivated by a study of the health effects of simultaneous exposure to the components of PM2.5, where we find that the negative health effects of exposure to environmental pollutants are exacerbated by low socioeconomic status, race and age.