共找到 20 条结果
Data visualization is a fundamental tool in genomics research, enabling the exploration, interpretation, and communication of complex genomic features. While machine learning models show promise for transforming data into insightful visualizations, current models lack the training foundation for domain-specific tasks. In an effort to provide a foundational resource for genomics-focused model training, we present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations. Building on prior work with statistical plots, our approach adapts to the complexity of genomics data and the specialized representations used to depict them. We further incorporate multiple linked queries and visualizations, along with justifications for design choices, figure captions, and image alt-texts for each item in the dataset. We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis: a dataset consisting of 1.14 million single-query data points, 628k query pairs, and 589k query chains. The GQVis dataset and generation code are available at https://huggingface.co/dataset
Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms ge
Learning analytics has begun to use physiological signals because these have been linked with learners' cognitive and affective states. These signals, when interpreted through machine learning techniques, offer a nuanced understanding of the temporal dynamics of student learning experiences and processes. However, there is a lack of clear guidance on the optimal time window to use for analyzing physiological signals within predictive models. We conducted an empirical investigation of different time windows (ranging from 60 to 210 seconds) when analysing multichannel physiological sensor data for predicting cognitive load. Our results demonstrate a preference for longer time windows, with optimal window length typically exceeding 90 seconds. These findings challenge the conventional focus on immediate physiological responses, suggesting that a broader temporal scope could provide a more comprehensive understanding of cognitive processes. In addition, the variation in which time windows best supported prediction across classifiers underscores the complexity of integrating physiological measures. Our findings provide new insights for developing educational technologies that more accur
Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by
This paper aims to initiate new conversations about the use of physiological indicators when assessing the welfare of dogs. There are significant concerns about construct validity - whether the measures used accurately reflect welfare. The goal is to provide recommendations for future inquiry and encourage debate. We acknowledge that the scientific understanding of animal welfare has evolved and bring attention to the shortcomings of commonly used biomarkers like cortisol. These indicators are frequently used in isolation and with limited salient dog descriptors, so fail to reflect the canine experience adequately. Using a systems approach, we explore various physiological systems and alternative indicators, such as heart rate variability and oxidative stress, to address this limitation. It is essential to consider factors like age, body weight, breed, and sex when interpreting these biomarkers correctly, and researchers should report on these in their studies. This discussion identifies possible indicators for both positive and negative experiences. In conclusion, we advocate for a practical, evidence-based approach to assessing indicators of canine welfare, including non-invasive
Low-cost, high-throughput DNA and RNA sequencing (HTS) data is the backbone of the life sciences. Genome sequencing is now becoming a part of Predictive, Preventive, Personalized, and Participatory (termed 'P4') medicine. All genomic data are currently processed in energy-hungry computer clusters and centers, necessitating data transfer, consuming substantial energy, and wasting valuable time. Therefore, there is a need for fast, energy-efficient, and cost-efficient technologies that enable genomics research without requiring data centers and cloud platforms. We recently launched the BioPIM Project to leverage emerging processing-in-memory (PIM) technologies to enable energy- and cost-efficient analysis of bioinformatics workloads. The BioPIM Project focuses on co-designing algorithms and data structures commonly used in genomics with several PIM architectures to achieve the highest cost, energy, and time savings.
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic an
Artificial Intelligence (AI) algorithms, trained on emotion data extracted from physiological signals, provide a promising approach to monitoring emotions, affect, and mental well-being. However, the field encounters challenges because there is a lack of effective methods for collecting high-quality data in everyday settings that genuinely reflect changes in emotion or affect. This paper presents a position discussion on the current technique of annotating physiological signal-based emotion data. Our discourse underscores the importance of adopting a nuanced understanding of annotation processes, paving the way for a more insightful exploration of the intricate relationship between physiological signals and human emotions.
The effective visualization of genomic data is crucial for exploring and interpreting complex relationships within and across genes and genomes. Despite advances in developing dedicated bioinformatics software, common visualization tools often fail to efficiently integrate the diverse datasets produced in comparative genomics, lack intuitive interfaces to construct complex plots and are missing functionalities to inspect the underlying data iteratively and at scale. Here, we introduce gggenomes, a versatile R package designed to overcome these challenges by extending the widely used ggplot2 framework for comparative genomics. gggenomes is available from CRAN and GitHub, accompanied by detailed and user-friendly documentation (https://thackl.github.io/gggenomes).
In recent years, Reinforcement Learning (RL) has emerged as a powerful tool for solving a wide range of problems, including decision-making and genomics. The exponential growth of raw genomic data over the past two decades has exceeded the capacity of manual analysis, leading to a growing interest in automatic data analysis and processing. RL algorithms are capable of learning from experience with minimal human supervision, making them well-suited for genomic data analysis and interpretation. One of the key benefits of using RL is the reduced cost associated with collecting labeled training data, which is required for supervised learning. While there have been numerous studies examining the applications of Machine Learning (ML) in genomics, this survey focuses exclusively on the use of RL in various genomics research fields, including gene regulatory networks (GRNs), genome assembly, and sequence alignment. We present a comprehensive technical overview of existing studies on the application of RL in genomics, highlighting the strengths and limitations of these approaches. We then discuss potential research directions that are worthy of future exploration, including the development
Physiological computing uses human physiological data as system inputs in real time. It includes, or significantly overlaps with, brain-computer interfaces, affective computing, adaptive automation, health informatics, and physiological signal based biometrics. Physiological computing increases the communication bandwidth from the user to the computer, but is also subject to various types of adversarial attacks, in which the attacker deliberately manipulates the training and/or test examples to hijack the machine learning algorithm output, leading to possible user confusion, frustration, injury, or even death. However, the vulnerability of physiological computing systems has not been paid enough attention to, and there does not exist a comprehensive review on adversarial attacks to them. This paper fills this gap, by providing a systematic review on the main research areas of physiological computing, different types of adversarial attacks and their applications to physiological computing, and the corresponding defense strategies. We hope this review will attract more research interests on the vulnerability of physiological computing systems, and more importantly, defense strategies
As an emerging interaction paradigm, physiological computing is increasingly being used to both measure and feed back information about our internal psychophysiological states. While most applications of physiological computing are designed for individual use, recent research has explored how biofeedback can be socially shared between multiple users to augment human-human communication. Reflecting on the empirical progress in this area of study, this paper presents a systematic review of 64 studies to characterize the interaction contexts and effects of social biofeedback systems. Our findings highlight the importance of physio-temporal and social contextual factors surrounding physiological data sharing as well as how it can promote social-emotional competences on three different levels: intrapersonal, interpersonal, and task-focused. We also present the Social Biofeedback Interactions framework to articulate the current physiological-social interaction space. We use this to frame our discussion of the implications and ethical considerations for future research and design of social biofeedback interfaces.
The COVID-19 crisis has demonstrated the potential of cutting-edge genomics research. However, privacy of these sensitive pieces of information is an area of significant concern for genomics researchers. The current security models makes it difficult to create flexible and automated data sharing frameworks. These models also increases the complexity of adding or revoking access without contacting the data publisher. In this work, we investigate an automated attribute-based access control (AABAC) model for genomics data over Named Data Networking (NDN). AABAC secures the data itself rather than the storage location or transmission channel, provides automated data invalidation, and automates key retrieval and data validation while maintaining the ability to control access. We show that AABC when combined with NDN provide a secure and flexible combination for work with genomics research.
Rare diseases are collectively common, affecting approximately one in twenty individuals worldwide. In recent years, rapid progress has been made in rare disease diagnostics due to advances in DNA sequencing, development of new computational and experimental approaches to prioritize genes and genetic variants, and increased global exchange of clinical and genetic data. However, more than half of individuals suspected to have a rare disease lack a genetic diagnosis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was initiated to study thousands of challenging rare disease cases and families and apply, standardize, and evaluate emerging genomics technologies and analytics to accelerate their adoption in clinical practice. Further, all data generated, currently representing ~7500 individuals from ~3000 families, is rapidly made available to researchers worldwide via the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to catalyze global efforts to develop approaches for genetic diagnoses in rare diseases (https://gregorconsortium.org/data). The majority of these families have undergone prior clinical genetic testing
This paper will argue that one of the biggest challenges for livestock genomics is to make whole-genome sequencing and functional genomics applicable to breeding practice. It discusses potential explanations for why it is so difficult to consistently improve the accuracy of genomic prediction by means of whole-genome sequence data, and three potential attacks on the problem.
Genome data are crucial in modern medicine, offering significant potential for diagnosis and treatment. Thanks to technological advancements, many millions of healthy and diseased genomes have already been sequenced; however, obtaining the most suitable data for a specific study, and specifically for validation studies, remains challenging with respect to scale and access. Therefore, in silico genomics sequence generators have been proposed as a possible solution. However, the current generators produce inferior data using mostly shallow (stochastic) connections, detected with limited computational complexity in the training data. This means they do not take the appropriate biological relations and constraints, that originally caused the observed connections, into consideration. To address this issue, we propose cancer-inspired genomics mapper model (CGMM), that combines genetic algorithm (GA) and deep learning (DL) methods to tackle this challenge. CGMM mimics processes that generate genetic variations and mutations to transform readily available control genomes into genomes with the desired phenotypes. We demonstrate that CGMM can generate synthetic genomes of selected phenotypes
The exponential growth of popularity of multimedia has led to needs for user-centric adaptive applications that manage multimedia content more effectively. Implicit analysis, which examines users' perceptual experience of multimedia by monitoring physiological or behavioral cues, has potential to satisfy such demands. Particularly, physiological signals categorized into cerebral physiological signals (electroencephalography, functional magnetic resonance imaging, and functional near-infrared spectroscopy) and peripheral physiological signals (heart rate, respiration, skin temperature, etc.) have recently received attention along with notable development of wearable physiological sensors. In this paper, we review existing studies on physiological signal analysis exploring perceptual experience of multimedia. Furthermore, we discuss current trends and challenges.
Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing up to 2.1x smaller memory footprint, 2x smaller index size, and more truly detected small and structural variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x faster
The Global Alliance for Genomics and Health (GA4GH) Beacon protocol lets researchers ask whether a genomic variant has been observed in a participating cohort and receive aggregate variant-level counts. As Beacon networks grow, two privacy risks remain: host institutions can see plaintext queries, and repeated rare-variant queries can support membership-inference attacks. We present bioETH-Beacon, a smart-contract prototype that runs the Beacon "aggregate count" query over encrypted data on a fully homomorphic Ethereum Virtual Machine (fhEVM). Hospitals upload encrypted marker-count entries, authorized researchers submit encrypted marker queries, and the contract returns an encrypted answer that is released, via an off-chain key-management service, only to the requester named in the contract's on-chain ACL. The design is organized as a 3x4 tier-by-query-family grid spanning genotype, sex, age, and phenotype queries, with tiers that trade stronger confidentiality for lower query cost. For genotype paths, the prototype can add bounded on-chain noise to mitigate probing attacks. Experiments on synthetic panels derived from a Polygenic Score (PGS) catalog show the expected scaling beha
The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizin