Grounding large language models (LLMs) in external knowledge sources is a promising method for faithful prediction. While existing grounding approaches work well for simple queries, many real-world information needs require synthesizing multiple pieces of evidence. We introduce "integrative grounding" -- the challenge of retrieving and verifying multiple inter-dependent pieces of evidence to support a hypothesis query. To systematically study this problem, we repurpose data from four domains for evaluating integrative grounding capabilities. Our investigation reveals two critical findings: First, in groundedness verification, while LLMs are robust to redundant evidence, they tend to rationalize using internal knowledge when information is incomplete. Second, in examining retrieval planning strategies, we find that undirected planning can degrade performance through noise introduction, while premise abduction emerges as a promising approach due to its logical constraints. Additionally, LLMs' zero-shot self-reflection capabilities consistently improve grounding quality. These insights provide valuable direction for developing more effective integrative grounding systems.
Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
Integrative modeling of macromolecular assemblies allows for structural characterization of large assemblies that are recalcitrant to direct experimental observation. A Bayesian inference approach facilitates combining data from complementary experiments along with physical principles, statistics of known structures, and prior models, for structure determination. Here, we review recent methods for integrative modeling based on statistical inference and machine learning. These methods improve over the current state-of-the-art by enhancing the data collection, optimizing coarse-grained model representations, making scoring functions more accurate, sampling more efficient, and model analysis more rigorous. We also discuss three new frontiers in integrative modeling: incorporating recent deep learning-based methods, integrative modeling with in situ data, and metamodeling.
Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring communication efficiency pose significant challenges on distributed statistical analysis. In this article, we focus on integrative estimation of distributed heterogeneous precision matrices, a crucial task related to joint precision matrix estimation where computation-efficient algorithms and statistical optimality theories are still underdeveloped. To tackle these challenges, we introduce a novel HEterogeneity-adjusted Aggregating and Thresholding (HEAT) approach for distributed integrative estimation. HEAT is designed to be both communication- and computation-efficient, and we demonstrate its statistical optimality by establishing the convergence rates and the corresponding minimax lower bounds under various integrative losses. To enhance the optimality of HEAT, we further propose an iterative HEAT (IteHEAT) approach. By iteratively refining the higher-order errors of HEAT estimators through multi-round communications, IteHEAT achieves geometric contr
Understanding the biological and behavioral heterogeneity underlying psychiatric disorders is critical for advancing precision diagnosis, treatment, and prevention. This paper addresses the scientific question of how multimodal data, spanning clinical, cognitive, and neuroimaging measures, can be integrated to identify biologically meaningful subtypes of mental disorders. We introduce Mixed INtegrative Data Subtyping (MINDS), a Bayesian hierarchical model designed to jointly analyze mixed-type data for simultaneous dimension reduction and clustering. Using data from the Adolescent Brain Cognitive Development (ABCD) Study, MINDS integrates clinical symptoms, cognitive performance, and brain structure measures to subtype Attention-Deficit/Hyperactivity Disorder (ADHD) and Obsessive-Compulsive Disorder (OCD). Our method leverages Polya-Gamma augmentation for computational efficiency and robust inference. Simulations demonstrate improved stability and accuracy compared to existing clustering approaches. Application to the ABCD data reveals clinically interpretable subtypes of ADHD and OCD with distinct cognitive and neurodevelopmental profiles. These findings show how integrative multi
In this study, we focus on estimating the heterogeneous treatment effect (HTE) for survival outcome. The outcome is subject to censoring and the number of covariates is high-dimensional. We utilize data from both the randomized controlled trial (RCT), considered as the gold standard, and real-world data (RWD), possibly affected by hidden confounding factors. To achieve a more efficient HTE estimate, such integrative analysis requires great insight into the data generation mechanism, particularly the accurate characterization of unmeasured confounding effects/bias. With this aim, we propose a penalized-regression-based integrative approach that allows for the simultaneous estimation of parameters, selection of variables, and identification of the existence of unmeasured confounding effects. The consistency, asymptotic normality, and efficiency gains are rigorously established for the proposed estimate. Finally, we apply the proposed method to estimate the HTE of lobar/sublobar resection on the survival of lung cancer patients. The RCT is a multicenter non-inferiority randomized phase 3 trial, and the RWD comes from a clinical oncology cancer registry in the United States. The analys
Resilience in coupled systems is increasingly critical in addressing global challenges such as climate change and pandemics. These systems show unpredictable behaviour due to dynamic complexity and deep uncertainty across spatiotemporal scales. Despite growing interest, few studies systematically integrate both concepts when assessing resilience. This paper conducts an integrative review of 102 English-language publications to identify gaps in current approaches. Findings reveal that most papers address lower levels of uncertainty and rarely consider dynamic complexity and deep uncertainty simultaneously, which limits the effectiveness of resilience strategies. To advance systems research, we propose a conceptual framework and practical tools to support researchers and decision-makers in evaluating and improving resilience. The paper also outlines future research directions for more robust, adaptive, and integrative resilience assessments.
Developing computational tools for integrative analysis across multiple types of omics data has been of immense importance in cancer molecular biology and precision medicine research. While recent advancements have yielded integrative prediction solutions for multi-omics data, these methods lack a comprehensive and cohesive understanding of the rationale behind their specific predictions. To shed light on personalized medicine and unravel previously unknown characteristics within integrative analysis of multi-omics data, we introduce a novel integrative neural network approach for cancer molecular subtype and biomedical classification applications, named Integrative Graph Convolutional Networks (IGCN). IGCN can identify which types of omics receive more emphasis for each patient to predict a certain class. Additionally, IGCN has the capability to pinpoint significant biomarkers from a range of omics data types. To demonstrate the superiority of IGCN, we compare its performance with other state-of-the-art approaches across different cancer subtype and biomedical classification tasks.
Glycans are structurally diverse and flexible biomolecules that play key roles in many biological processes. Their conformational variability makes the modeling of their interactions with proteins particularly challenging. This chapter presents a step-by-step protocol for modeling protein-glycan interactions using HADDOCK3, an integrative modeling platform that supports the inclusion of experimental or predicted interaction restraints and allows for flexible refinement of the solutions. The workflow is illustrated using the interaction between a linear homopolymer glycan, 4-beta-glucopyranose, and the catalytic domain of the Humicola grisea Cel12A enzyme, for which an experimental X-ray structure is available as a reference. Detailed instructions are provided for input structure preparation, restraint definition, docking setup, execution, and result analysis. Application of the protocol starting from unbound structures yields models of acceptable to medium quality, with interface-ligand RMSD values below 3 angstroms. Although illustrated on a specific system, the protocol has been optimized and benchmarked on multiple protein-glycan complexes and is broadly applicable to similar sy
In the era of big data, secondary outcomes have become increasingly important alongside primary outcomes. These secondary outcomes, which can be derived from traditional endpoints in clinical trials, compound measures, or risk prediction scores, hold the potential to enhance the analysis of primary outcomes. Our method is motivated by the challenge of utilizing multiple secondary outcomes, such as blood biochemistry markers and urine assays, to improve the analysis of the primary outcome related to liver health. Current integration methods often fall short, as they impose strong model assumptions or require prior knowledge to construct over-identified working functions. This paper addresses these statistical challenges and potentially opens a new avenue in data integration by introducing a novel integrative learning framework that is applicable in a general setting. The proposed framework allows for the robust, data-driven integration of information from multiple secondary outcomes, promotes the development of efficient learning algorithms, and ensures optimal use of available data. Extensive simulation studies demonstrate that the proposed method significantly reduces variance in
Integrating high-dimensional, heterogeneous data from multi-site cohort studies with complex hierarchical structures poses significant feature selection and prediction challenges. We extend the Bayesian Integrative Analysis and Prediction (BIP) framework to enable simultaneous feature selection and outcome modeling in data of nested hierarchical structure. We apply the proposed Bayesian Integrative Mixed Modeling (BIPmixed) framework to the Adolescent Brain Cognitive Development (ABCD) Study, leveraging multi-view data, including structural and functional MRI and early life adversity (ELA) metrics, to identify relevant features and predict the behavioral outcome. BIPmixed incorporates 2-level nested random effects, to enhance interpretability and make predictions in hierarchical data settings. Simulation studies illustrate BIPmixed's robustness in distinct random effect settings, highlighting its use for complex study designs. Our findings suggest that BIPmixed effectively integrates multi-view data while accounting for nested sampling, making it a valuable tool for analyzing large-scale studies with hierarchical data.
This paper proposes the context driven Critical Integrative Levels (CIL), a novel approach to lighting asset management in public libraries that aligns with the transformative vision of human-centric and integrative lighting. This approach encompasses not only the visual aspects of lighting performance but also prioritizes the physiological and psychological well-being of library users. Incorporating a newly defined metric, Mean Time of Exposure (MTOE), the approach quantifies user-light interaction, enabling tailored lighting strategies that respond to diverse activities and needs in library spaces. Case studies demonstrate how the CIL matrix can be practically applied, offering significant improvements over conventional methods by focusing on optimized user experiences from both visual impacts and non-visual effects.
With the growing demand for interpretable deep learning models, this paper introduces Integrative CAM, an advanced Class Activation Mapping (CAM) technique aimed at providing a holistic view of feature importance across Convolutional Neural Networks (CNNs). Traditional gradient-based CAM methods, such as Grad-CAM and Grad-CAM++, primarily use final layer activations to highlight regions of interest, often neglecting critical features derived from intermediate layers. Integrative CAM addresses this limitation by fusing insights across all network layers, leveraging both gradient and activation scores to adaptively weight layer contributions, thus yielding a comprehensive interpretation of the model's internal representation. Our approach includes a novel bias term in the saliency map calculation, a factor frequently omitted in existing CAM techniques, but essential for capturing a more complete feature importance landscape, as modern CNNs rely on both weighted activations and biases to make predictions. Additionally, we generalize the alpha term from Grad-CAM++ to apply to any smooth function, expanding CAM applicability across a wider range of models. Through extensive experiments
Intrinsically disordered proteins and regions are increasingly appreciated for their abundance in the proteome and the many functional roles they play in the cell. In this short review, we describe a variety of approaches used to obtain biological insight from the structural ensembles of disordered proteins, regions, and complexes and the integrative biology challenges that arise from combining diverse experiments and computational models. Importantly, we highlight findings regarding structural and dynamic characterization of disordered regions involved in binding and phase separation, as well as drug targeting of disordered regions, using a broad framework of integrative modeling approaches.
Integrative modeling enables structure determination for large macromolecular assemblies by combining data from multiple sources of experiment data with theoretical and computational predictions. Recent advancements in AI-based structure prediction and electron cryo-microscopy have sparked renewed enthusiasm for integrative modeling; structures from AI-based methods can be integrated with in situ maps to characterize large assemblies. This approach previously allowed us and others to determine the architectures of diverse macromolecular assemblies, such as nuclear pore complexes, chromatin remodelers, and cell-cell junctions. Experimental data spanning several scales was used in these studies, ranging from high-resolution data, such as X-ray crystallography and Alphafold structures, to low-resolution data, such as cryo-electron tomography maps and data from co-immunoprecipitation experiments. Two recurrent modeling challenges emerged across a range of studies. First, modeling disordered regions, which constituted a significant portion of these assemblies, necessitated the development of new methods. Second, methods needed to be developed to utilize the information from cryo-electro
Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combining multiple models to tackle complex multimodal tasks. However, there is a lack of a flexible and composable platform to facilitate efficient and effective model composition and coordination. In this paper, we propose the i-Code Studio, a configurable and composable framework for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models in a finetuning-free fashion to conduct complex multimodal tasks. Instead of simple model composition, the i-Code Studio provides an integrative, flexible, and composable setting for developers to quickly and easily compose cutting-edge services and technologies tailored to their specific requirements. The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering. We also demonstrate how to quickly build a multimodal agent based on the i-Code Studio that can communica
Political biases in Large Language Model (LLM)-based artificial intelligence (AI) systems, such as OpenAI's ChatGPT or Google's Gemini, have been previously reported. While several prior studies have attempted to quantify these biases using political orientation tests, such approaches are limited by potential tests' calibration biases and constrained response formats that do not reflect real-world human-AI interactions. This study employs a multi-method approach to assess political bias in leading AI systems, integrating four complementary methodologies: (1) linguistic comparison of AI-generated text with the language used by Republican and Democratic U.S. Congress members, (2) analysis of political viewpoints embedded in AI-generated policy recommendations, (3) sentiment analysis of AI-generated text toward politically affiliated public figures, and (4) standardized political orientation testing. Results indicate a consistent left-leaning bias across most contemporary AI systems, with arguably varying degrees of intensity. However, this bias is not an inherent feature of LLMs; prior research demonstrates that fine-tuning with politically skewed data can realign these models across
In recent years, the convergence of cybersecurity, artificial intelligence (AI), and data management has emerged as a critical area of research, driven by the increasing complexity and interdependence of modern technological ecosystems. This paper provides a comprehensive review and analysis of integrative approaches that harness AI techniques to enhance cybersecurity frameworks and optimize data management practices. By exploring the synergies between these domains, we identify key trends, challenges, and future directions that hold the potential to revolutionize the way organizations protect, analyze, and leverage their data. Our findings highlight the necessity of cross-disciplinary strategies that incorporate AI-driven automation, real-time threat detection, and advanced data analytics to build more resilient and adaptive security architectures.
In this paper, we propose a novel negotiation dialogue agent designed for the online marketplace. Our agent is integrative in nature i.e, it possesses the capability to negotiate on price as well as other factors, such as the addition or removal of items from a deal bundle, thereby offering a more flexible and comprehensive negotiation experience. We create a new dataset called Integrative Negotiation Dataset (IND) to enable this functionality. For this dataset creation, we introduce a new semi-automated data creation method, which combines defining negotiation intents, actions, and intent-action simulation between users and the agent to generate potential dialogue flows. Finally, the prompting of GPT-J, a state-of-the-art language model, is done to generate dialogues for a given intent, with a human-in-the-loop process for post-editing and refining minor errors to ensure high data quality. We employ a set of novel rewards, specifically tailored for the negotiation task to train our Negotiation Agent, termed as the Integrative Negotiation Agent (INA). These rewards incentivize the chatbot to learn effective negotiation strategies that can adapt to various contextual requirements an
The COVID-19 pandemic has witnessed the role of online social networks (OSNs) in the spread of infectious diseases. The rise in severity of the epidemic augments the need for proper guidelines, but also promotes the propagation of fake news-items. The popularity of a news-item can reshape the public health behaviors and affect the epidemic processes. There is a clear inter-dependency between the epidemic process and the spreading of news-items. This work creates an integrative framework to understand the interplay. We first develop a population-dependent `saturated branching process' to continually track the propagation of trending news-items on OSNs. A two-time scale dynamical system is obtained by integrating the news-propagation model with SIRS epidemic model, to analyze the holistic system. It is observed that a pattern of periodic infections emerges under a linear behavioral influence, which explains the waves of infection and reinfection that we have experienced in the pandemic. We use numerical experiments to corroborate the results and use Twitter and COVID-19 data-sets to recreate the historical infection curve using the integrative model.