Neuroscience studies entail the generation of massive collections of heterogeneous data (e.g. demographics, clinical records, medical images). Integration and analysis of such data in research centers is pivotal for elucidating disease mechanisms and improving clinical outcomes. However, data collection in clinics often relies on non-standardized methods, such as paper-based documentation. Moreover, diverse data types are collected in different departments hindering efficient data organization, secure sharing and compliance to the FAIR (Findable, Accessible, Interoperable, Reusable) principles. Henceforth, in this manuscript we present a specialized data management system designed to enhance research workflows in Deep Brain Stimulation (DBS), a state-of-the-art neurosurgical procedure employed to treat symptoms of movement and psychiatric disorders. The system leverages REDCap to promote accurate data capture in hospital settings and secure sharing with research institutes, Brain Imaging Data Structure (BIDS) as image storing standard and a DBS-specific SQLite database as comprehensive data store and unified interface to all data types. A self-developed Python tool automates the data flow between these three components, ensuring their full interoperability. The proposed framework has already been successfully employed for capturing and analyzing data of 107 patients from 2 medical institutions. It effectively addresses the challenges of managing, sharing and retrieving diverse data types, fostering advancements in data quality, organization, analysis, and collaboration among medical and research institutions.
The Clinical Practice Research Datalink (CPRD) is a large and widely used resource of electronic health records from the UK, linking primary care data to hospital data, death registration data, cancer registry data, deprivation data and mental health services data. Extraction and management of CPRD data is a computationally demanding process and requires a significant amount of work, in particular when using R. The rcprd package simplifies the process of extracting and processing CPRD data in order to build datasets ready for statistical analysis. Raw CPRD data is provided in thousands of.txt files, making querying this data cumbersome and inefficient. rcprd saves the relevant information into an SQLite database stored on the hard drive which can then be queried efficiently to extract required information about individuals. rcprd follows a four-stage process: 1) Definition of a cohort, 2) Read in medical/prescription data and save into an SQLite database, 3) Query this SQLite database for specific codes and tests to create variables for each individual in the cohort, 4) Combine extracted variables into a dataset ready for statistical analysis. Functions are available to extract common variable types (e.g., history of a condition, or time until an event occurs, relative to an index date), and more general functions for database queries, allowing users to define their own variables for extraction. The entire process can be done from within R, with no knowledge of SQL required. This manuscript showcases the functionality of rcprd by running through an example using simulated CPRD Aurum data. rcprd will reduce the duplication of time and effort among those using CPRD data for research, allowing more time to be focused on other aspects of research projects.
Genomic variant data are useful in detecting and treating antibiotic-resistant bacteria. However, there are no bacterial genomic variant databases that catalogue the variations in the different genes across strains. In this work a Nextflow- and Docker-based end-to-end pipeline, BVbase, that can automate the creation of databases from raw high-throughput sequences has been created to fill this lacuna with Pseudomonas aeruginosa as a case study. Pseudomonas aeruginosa is a Gram-negative adaptable pathogen with multiple antibiotic resistances that causes various types of infections, including respiratory, urinary, and bloodstream infections. The pipeline can take multistrain genomic files, detect missense variants, and save results in a database with the help of Python and SQLite (https://github.com/bic-sastra/BVbase). Using the generated database for P. aeruginosa, a web application interface has been made using Flask and HTML that runs in a server with MySQL backend (https://bic.sastra.edu/pavardb). The web application provides supports for different types of queries to select variants by gene, geographical group, isolation country, antibiotics, and resistance phenotype. This web interface generates results as variant tables, plots, and statistics for the selected data. By enabling interactive visualizations and advanced selection, the platform supports research and clinical use through the exploration of genomic variations associated with antimicrobial resistance.
This paper proposes DB-LIO (database-driven LiDAR-inertial odometry), a simultaneous localization and mapping (SLAM) system that addresses memory scalability challenges in extended autonomous operation. Existing LiDAR-SLAM systems accumulate keyframe history in memory, leading to O(N) growth and out-of-memory failures during extended operation. To overcome this limitation, DB-LIO introduces three core design elements. First, it proposes a spatially indexed keyframe management scheme that persistently stores keyframes in SQLite with R-Tree spatial indexing, enabling O(logN+k) spatial queries that tightly couple cache eviction with factor-graph optimization requirements-a design that ensures every keyframe potentially involved in the next optimization cycle resides in cache. Second, it presents a four-level memory bounding architecture-SLAM-engine keyframe trimming with transparent on-demand reloading, a DB-level least recently used (LRU) cache with a spatial active window, Scan Context descriptor pool bounding, and iSAM2 sliding window compaction with a sparse global anchor graph-that collectively bounds the dominant memory consumers to O(C). Third, the DB-based persistent storage enables a localization mode that can reload previously built maps-including full point clouds, six-degree-of-freedom poses, timestamps, and inter-keyframe relationships-and perform pose estimation using the stored map, which is particularly valuable for agricultural robots and other autonomous systems requiring map reuse. Experiments on a custom orchard dataset demonstrate an 81.9% reduction in memory usage compared with that of the in-memory baseline (2888 MB → 524 MB), while preserving equivalent trajectory accuracy (absolute trajectory error (ATE) root mean square error (RMSE) 0.305 ± 0.001 m vs. 0.296 m). Validation on the KITTI odometry benchmark confirms that the proposed localization mode generalizes across different LiDAR types (Livox Mid360, Velodyne HDL-64E) and environments (orchard, urban driving).
Modern biomedical imaging workflows generate large volumes of derived images and short videos that must be reviewed, compared, curated, and reused following primary acquisition and analysis. In practice, these assets are often dispersed across nested filesystem hierarchies on local drives, external media, or network storage, limiting efficient retrieval, deduplication, and figure assembly. We present PixelDeck, an open-source, local-first browser application for organizing and interactively browsing large biomedical image and video libraries on commodity workstations. PixelDeck integrates recursive folder import, SHA-256-based duplicate detection, metadata extraction, thumbnail and preview generation, full-text search, and asynchronous export within a responsive interface, supported by a modular ingestion pipeline, managed storage layer, and interactive browsing environment optimized for high-volume media collections. The system is implemented using a Next.js and React frontend, a SQLite metadata store accessed via Prisma, managed local media storage, and a background worker that executes import and export tasks asynchronously, enabling scalable processing on standard hardware. To evaluate performance, we conducted structured benchmark imports using public histopathology images curated from PanopTILs, SICAPv2, and PanNuke datasets, where dataset-specific import behavior, duplicate detection, and ingestion metrics were recorded as reproducible outputs. Embedding-based analysis further demonstrates dataset-level separation consistent with underlying image characteristics. These results show that PixelDeck provides an efficient, scalable local curation layer for heterogeneous biomedical imaging collections, enabling streamlined dataset exploration and preparation for downstream analysis.
Multi-omic integration involves the management of diverse omic datasets. Conducting an effective analysis of these datasets necessitates a data management system that meets a specific set of requirements, such as rapid storage and retrieval of data with varying numbers of features and mixed data-types, ensurance of reliable and secure database transactions, extension of stored data row and column-wise and facilitation of data distribution. SQLite and DuckDB are embedded databases that fulfil these requirements. However, they utilize the structured query language (SQL) that hinders their implementation by the uninitiated user, and complicates their use in repetitive tasks due to the necessity of writing SQL queries. This study offers Omilayers, a Python package that encapsulates these two databases and exposes a subset of their functionality that is geared towards frequent and repetitive analytical procedures. Synthetic data were used to demonstrate the use of Omilayers and compare the performance of SQLite and DuckDB.
Horizon scanning (HS) is a methodology that aims to capture signals and trends that highlight future opportunities and challenges. The National Institute for Health and Care Research (NIHR) Innovation Observatory routinely scans for medical technologies and therapeutics to inform policy and practice for healthcare in the United Kingdom (UK). To date, there is no standardized terminology for horizon scanning in healthcare. Here, we discuss the development of a data glossary and the IOAtlas web app. We extracted data points from 4 years' worth of NIHR Innovation Observatory HS projects and collated them by technology type and descriptive family. A source repository was established by extracting a list of all sources used in NIHR Innovation Observatory briefing notes between 2017 and 2021. The repository was validated by external HS organizations and experts, and sources were then mapped to the appropriate time horizons. The glossary and repository were converted to an SQLite database format and connected to a free web app, IOAtlas. After de-duplication and consolidation, a total of 148 data points were included in the glossary. The source repository consists of 149 sources, with 99 percent being compliant with searching for two or more technology types. The final SQLite database contained 35 tables with 36 relationships. We present a data glossary to provide globalized standardization for the terminology used in HS projects. The glossary can be accessed through the IOAtlas web app. Furthermore, we provide users with an interface to generate downloadable data extraction templates within IOAtlas.
The '2018 Marganai Forest Soil Erosion Experiment Database' is a comprehensive collection of measures taken during scientific experiment trials designed to investigate the effects of forest canopy coverage on soil erosion under intense artificial rainfall, four years after coppicing. The investigation involved the establishment of eight paired plots with and without forest canopy coverage, subjected to artificial rainfall simulation aimed to measure the amount of sediment transported by runoff. The work represents a valuable resource for researchers interested in understanding the complex implications of forest management practices on soil erosion. The paper, produced using Quarto in a Gitlab-based RStudio project, is an example of 'reproducible research' documenting that the database provides detailed information on the experimental setup as well as on the range of different measurements that have been collected. The database, produced using NFS-DataDocumentationProcedure, is stored in an SQLite file, extensively exploiting the relational properties of the engine, enhancing data accessibility, interoperability and reusability.
Mental health mobile applications are a cost-effective and scalable answer to the world's psychiatrist shortage and limited access to care in remote areas. However, there is currently no mobile application for providing mental health interventions in Ethiopia. Therefore, this project aimed to develop and test the preliminary effectiveness and acceptability of an Android-based mobile application for mental health information, psychological self-testing, and treatment recommendation during COVID-19 and beyond. We conducted a preliminary assessment to review experiences and demands associated with the mental health mobile apps. Object-oriented modeling and the Agile Development software development methodology were employed. Android Studio's layout editor, resource management, palette, and theme editor were used. We utilized Java as the programming language for writing application code, eXtensible Markup Language (XML) to construct the overall structure of the app, and SQLite to save data locally on the user's device. To ensure quality, tests were performed on a regular basis throughout the development process. The project developed an Android-based mobile app for mental health information, psychological self-testing, and treatment recommendations for COVID-19. A preliminary assessment found no existing mobile apps for mental health care. Of participants, 94.6% believed mental health apps benefit the public, patients, and healthcare professionals. However, some individuals opposed the app due to concerns about self-treatment and medication misuse. The study indicates a high demand for a mental health mobile app, but few participants fear self-treatment or drug abuse. Apps that support native languages are recommended, and nonpharmacological treatments should be used in conjunction with clinician consultation.
Transcriptome analysis of complex tissues remains challenging due to assembly errors, isoform diversity, and annotation bias, necessitating optimized computational pipelines. Scorpion venoms are a treasure trove of bioactive peptides with significant biomedical potential, but their complexity complicates transcriptome profiling. We present ToxIR (Toxin Identification and Recognition), an RNA-seq pipeline optimized for accurate toxin transcriptome analysis, validated in Odontobuthus doriae venom glands. ToxIR combines deep sequencing, rnaSPAdes based de novo assembly, and a tailored annotation strategy to detect even low-abundance toxins and resolve isoforms with high accuracy. It incorporates rigorous quality control (FastQC, Trimmomatic), curated UniProt toxin homology searches, and integrated structural analyses (SignalP, TMHMM, Pfam, InterProScan) to prioritize candidates based on signal peptides, cysteine content, and toxin-specific domains. Unlike general-purpose or previous toxin pipelines, ToxIR minimizes misassemblies and annotation bias through its modular design, automated structural queries, and SQLite-backed data integration. The pipeline identified 378 putative toxin candidates, including 192 high-confidence candidates (Group A) and 23 novel, divergent toxins (Group C). These included 180 sodium channels, 111 potassium channels, and 69 chloride channel toxins. By enabling flexible cross-species use and enhancing annotation precision, ToxIR provides a robust framework that accelerates the discovery of therapeutic toxins.
Fragment-based quantum chemistry offers a means to circumvent the nonlinear computational scaling of conventional electronic structure calculations by partitioning a large calculation into smaller subsystems, then considering the many-body interactions between them. Variants of this approach have been used to parameterize classical force fields and machine learning potentials, applications that benefit from interoperability between quantum chemistry codes. However, there is a dearth of software that provides interoperability yet is purpose-built to handle the combinatorial complexity of fragment-based calculations. To fill this void we introduce "Fragme∩t", an open-source software application that provides a tool for community validation of fragment-based methods, a platform for developing new approximations, and a framework for analyzing many-body interactions. Fragme∩t includes algorithms for automatic fragment generation and structure modification and for distance- and energy-based screening of the requisite subsystems. Checkpointing, database management, and parallelization are handled internally and results are archived into a portable database format. Interfaces are provided to quantum chemistry engines including Q-Chem, PySCF, xTB, Orca, CP2K, MRCC, Psi4, NWChem, GAMESS, and MOPAC. Applications reported here demonstrate parallel efficiencies around 96% on more than 1,000 processors but also showcase that the code can handle large-scale protein fragmentation using only workstation hardware, all with a codebase that is designed to be usable by non-experts. Fragme∩t conforms to modern software development best practices and is built upon well established technologies including Python, SQLite, and Ray. The source code is available under the Apache 2.0 license.
Verbal autopsy (VA) has been a crucial tool in ascertaining population-level cause of death (COD) estimates, specifically in countries where medical certification of COD is relatively limited. The World Health Organization has released an updated instrument (Verbal Autopsy Instrument 2022) that supports electronic data collection methods along with analytical software for assigning COD. This questionnaire encompasses the primary signs and symptoms associated with prevalent diseases across all age groups. Traditional methods have primarily involved paper-based questionnaires and physician-coded approaches for COD assignment, which is time-consuming and resource-intensive. Although computer-coded algorithms have advanced the COD assignment process, data collection in densely populated countries like India remains a logistical challenge. This study aimed to develop an Android-based mobile app specifically tailored for streamlining VA data collection by leveraging the existing Indian public health workforce. The app has been designed to integrate real-time data collection by frontline health workers and seamless data transmission and digital reporting of COD by physicians. This process aimed to enhance the efficiency and accuracy of COD assignment through VA. The app was developed using Android Studio, the primary integrated development environment for developing Android apps using Java. The front-end interface was developed using XML, while SQLite and MySQL were employed to streamline complete data storage on the local and server databases, respectively. The communication between the app and the server was facilitated through a PHP application programming interface to synchronize data from the local to the server database. The complete prototype was specifically built to reduce manual intervention and automate VA data collection. The app was developed to align with the current Indian public health system for district-level COD estimation. By leveraging this mobile app, the average duration required for VA data collection to ascertainment of COD, which typically ranges from 6 to 8 months, is expected to decrease by approximately 80%, reducing it to about 1-2 months. Based on annual caseload projections, the smallest administrative public health unit, health and wellness centers, is anticipated to handle 35-40 VA cases annually, while medical officers at primary health centers are projected to manage 150-200 physician-certified VAs each year. The app's data collection and transmission efficiency were further improved based on feedback from user and subject area experts. The development of a unified mobile app could streamline the VA process, enabling the generation of accurate national and subnational COD estimates. This mobile app can be further piloted and scaled to different regions to integrate the automated VA model into the existing public health system for generating comprehensive mortality statistics in India.
Understanding the dynamics and persistence of biodiversity patterns over short (contemporary) and long (thousands of years) time scales is crucial for predicting ecosystem changes under global climate and land-use changes. A key challenge is integrating currently scattered ecological data to assess complex vegetation dynamics over time. Here, we present VegVault, an interdisciplinary SQLite database that uniquely integrates paleo- and neo-ecological plot-based vegetation data on a global and millennial scale, directly linking them with functional traits, soil, and climate information. VegVault currently comprises data from BIEN, sPlotOpen, TRY, Neotoma, CHELSA, and WoSIS, providing a comprehensive and ready-to-use resource for researchers across various fields to address questions about past and contemporary biodiversity patterns and their abiotic drivers. To further support the usability of the data, VegVault is complemented by the {vaultkeepr} R package, enabling streamlined data access, extraction, and manipulation. This study introduces the structure, content, and diverse applications of VegVault, emphasizing its potential role in advancing ecological research to improve predictions of biodiversity responses to global climate change.
Head and neck cancers represent a critical global health issue, contributing to substantial morbidity and mortality. Recent research has explored the role of microRNAs (miRNAs) in these cancers by constructing miRNA-associated disease networks using bipartite graphs. Graph attention networks (GATs) have emerged as a powerful tool for predicting disease associations within such biological networks, offering enhanced accuracy in identifying potential miRNA-disease relationships. This study employs GATs to uncover and predict potential miRNA contributors to head and neck cancers. Data on miRNA-disease associations were sourced from the HMDD v4.0 database, a platform based on SQLite and Django. The head and neck neoplasms dataset included miRNA, disease, causality, category, and PubMed ID (PMID). GATs were applied to analyze the network, leveraging their ability to capture the significance and interdependencies of nodes and edges. The model used a learnable weight matrix to compute attention coefficients, normalize them, and aggregate information from neighboring nodes for edge prediction. The GAT model, integrating graph neural networks with attention mechanisms, achieved an accuracy of 83% in predicting miRNA-disease associations for head and neck neoplasms. This study highlights the potential of graph-based deep learning models, particularly GATs, in accurately predicting miRNA-disease associations. A functional enrichment analysis revealed significant involvement of miRNAs in oral cancer pathways, notably highlighting the critical roles of the TGF-beta and PI3K-Akt signaling pathways in tumor progression and cell survival. These findings offer a pathway to better understanding the molecular mechanisms underlying head and neck cancers. Future improvements in dataset size, model evaluation, and interpretability could further enhance prediction accuracy, potentially advancing diagnostic and therapeutic strategies for these cancers.
The tumour suppressor gene TP53 encodes the DNA binding transcription factor p53 and is one of the most mutated genes in human cancer. Tumour suppressor activity requires binding of p53 to its DNA response elements and subsequent transcriptional activation of a diverse set of target genes. Despite decades of close study, the logic underlying p53 interactions with its numerous potential genomic binding sites and target genes is not yet fully understood. Here, we present a database of DNA and chromatin-based information focused on putative p53 binding sites in the human genome to allow users to generate and test new hypotheses related to p53 activity in the genome. Users can query genomic locations based on experimentally observed p53 binding, regulatory element activity, genetic variation, evolutionary conservation, chromatin modification state, and chromatin structure. We present multiple use cases demonstrating the utility of this database for generating novel biological hypotheses, such as chromatin-based determinants of p53 binding and potential cell type-specific p53 activity. All database information is also available as a precompiled SQLite database for use in local analysis or as a Shiny web application. Database URL: https://p53motifDB.its.albany.edu.
Mass spectrometry (MS) generates large data sets that are stored in increasingly optimized and complex file types, demanding technical expertise to extract information rapidly and easily. We wondered whether a simple structured query language (SQL) database could hold raw MS data and allow for easily readable queries without incurring major penalties in the read time or disk space relative to other popular MS formats. Here, we describe a basic MS schema with intuitive database tables and fields that can outperform other formats for exploratory and interactive analysis according to six data subsets commonly extracted: single scans (both MS1 and MS2), ion chromatograms, retention time ranges, and fragmentation searches (both precursor and fragment search). Additionally, we compare SQLite, DuckDB, and Parquet implementations and find that they can perform these tasks in under a second, even when the files occupy over a gigabyte of data on the disk. We believe that this tidy data schema expands nicely to most forms of MS data and offers a way to transparently query data sets while preserving computational performance.
High-throughput genomic data analysis consists of the inexorably intertwined inputs and outputs of a vast array of bioinformatic analysis tools. To guarantee streamlined and reproducible analyses, the often complex data analysis pipelines need to be run using workflow management tools. Nextflow is one popular tool commonly used to automate such pipelines. Nextflow records key pipeline data, such as the submission time, start time, completion time, CPU usage, memory usage, and disk usage for each task run. These data are stored in log files, often scattered across a file system. Therefore, aggregating information about resource usage critical for the optimization of Nextflow pipelines and improving reproducibility, as well as parsing and managing such log data, can quickly become cumbersome. Here, we present a web-based tool, Nextpie, which provides both a database and a reporting tool for Nextflow pipelines. Nextpie stores comprehensive resource usage information in a relational database, thus facilitating and accelerating the performance of a variety of data analyses and interactive visualizations, providing an easily comprehensible overview of a pipeline's resource usage. The Nextpie source code, user documentation, an SQLite database with test data, and a Nextflow example pipeline are available at GitHub (https://github.com/bishwaG/Nextpie).
This article explores the performance optimizations of an embedded database memory management system to ensure high responsiveness of real-time healthcare data frameworks. SQLite is a popular embedded database engine extensively used in medical and healthcare data storage systems. However, SQLite is essentially built around lightweight applications in mobile devices, and it significantly deteriorates when a large transaction is issued such as high resolution medical images or massive health dataset, which is unlikely to occur in embedded systems but is quite common in other systems. Such transactions do not fit in the in-memory buffer of SQLite, and SQLite enforces memory reclamation as they are processed. The problem is that the current SQLite buffer management scheme does not effectively manage these cases, and the naïve reclamation scheme used significantly increases the user-perceived latency. Motivated by this limitation, this paper identifies the causes of high latency during processing of a large transaction, and overcomes the limitation via proactive and coarse-grained memory cleaning in SQLite.The proposed memory reclamation scheme was implemented in SQLite 3.29, and measurement studies with a prototype implementation demonstrated that the SQLite operation latency decreases by 13% on an average and up to 17.3% with our memory reclamation scheme as compared to that of the original version.
Implementation of technical hardware in hospitals is often delayed due to unclear responsibilities. In an interdisciplinary approach we developed a Technical Eligibility Items (TEI) list that maps technical requirements to responsible vendor and hospital departments. The TEI list is machine-readable and stored in an SQLite database, streamlining the implementation process and enabling future automated solutions.
The American Heart Association's Get With The Guidelines (GWTG) has emerged as a vital resource in advancing the standards and practices of inpatient care across stroke, heart failure, coronary artery disease, atrial fibrillation, and resuscitation focus areas. The GWTG registry data have also created new opportunities for secondary use of real-world clinical data in biomedical research. Our goal was to implement a scalable database with an integrated user interface (UI) to improve GWTG data management and accessibility. The curation of registry data begins by going through a data processing and quality control pipeline programmed in Python. This pipeline includes data cleaning and record exclusion, variable derivation and unit harmonization, limited data set preparation, and documentation generation of the registry data. The database was built using PostgreSQL, and integrations between the database and the UI were built using the Django Web Framework in Python. Smaller subsets of data were created using SQLite database files for distribution purposes. Use cases of these tools are provided in the article. We implemented an automated data curation pipeline, centralized database, and UI application for the American Heart Association GWTG registry data. The database and the UI are accessible through a Precision Medicine Platform workspace. As of March 2024, the database contains over 13.2 million cleaned GWTG patient records. The SQLite subsets benefit researchers by optimizing data extraction and manipulation using Structured Query Language. The UI improves accessibility for nontechnical researchers by presenting data in a user-friendly tabular format with intuitive filtering options. With the implementation of the GWTG database and UI application, we addressed data management and accessibility concerns despite its growing scale. We have launched tools to provide streamlined access and accessibility of GWTG registry data to all researchers, regardless of familiarity or experience in coding.