OMEGA V2 is open-source software for GPU-accelerated image reconstruction in positron emission tomography (PET), single photon emission computed tomography (SPECT), and computed tomography (CT). The software offers flexible GPU accelerated image reconstruction methods and tools for imaging algorithm development which are accessible from Python, MATLAB, and GNU Octave. This paper presents the software architecture, projector models, algorithms, and demonstrates its performance with realistic high-resolution 3D examples from PET, SPECT and cone-beam CT.

Approach:
OMEGA V2 is based on OpenCL and CUDA allowing wide GPU support. The software provides modular forward/backprojector operators and a broad suite of built-in iterative algorithms and regularization models. It supports features such as time-of-flight imaging, list-mode reconstruction, multi-resolution approaches, and various physical corrections including attenuation, scatter, and normalization. 

Main results:
OMEGA V2 provides a unified open-source reconstruction framework for PET, SPECT and CT, including hybrid workflows such as PET/CT and SPECT/CT. It provides cross-vendor GPU acceleration via OpenCL, supporting AMD and Intel devices alongside CUDA-capable GPUs, and introduces a new Python interface that complements and mirrors the existing MATLAB/GNU Octave workflow. The software substantially extends prior OMEGA releases with SPECT functionality, extensive CT functionality, and a Python implementation with wide interoperability such as with PyTorch. High-resolution 3D examples in PET, SPECT, and CBCT demonstrate high-quality reconstructions and fast runtimes on modern consumer GPUs. 

Significance:
Combination of PET, SPECT, and CT in an open-source, GPU-optimized framework with broad algorithmic and projector coverage offers a unified suite for computational imaging, method development and translation of methods in CT, PET and SPECT, and their hybrid combinations (PET/CT, SPECT/CT). Its open-source nature, extensive algorithm library, and flexible programming interfaces enable users to develop custom reconstruction methods with access to GPU-accelerated projectors.
Volumetric ultrafast ultrasound produces massive datasets with high frame rates, dense reconstruction grids, and large channel counts. Beamforming computational demands limit research throughput and prevent real-time applications in emerging modalities such as elastography, functional neuroimaging, and microscopy. We developed mach, an open-source, GPU-accelerated beamformer with a highly optimized delay-and-sum CUDA kernel and an accessible Python interface. mach uses a hybrid delay computation strategy that substantially reduces memory overhead compared with fully precomputed approaches. The CUDA implementation optimizes memory layout for coalesced access and reuses delay computations across frames via shared memory. We benchmarked mach on the PyMUST rotating disk dataset and validated numerical accuracy against existing open-source beamformers. mach processes 1.1 trillion points per second on a consumer-grade GPU, achieving > 10 × faster performance than existing open-source GPU beamformers. On the PyMUST rotating disk benchmark, mach completes reconstruction in 0.23 ms, 6× faster than the acoustic round-trip time to the imaging depth. Validation against other beamformers confirms numerical accuracy with errors below - 60    dB for Power Doppler and - 120    dB for B-mode. mach achieves 1.1 trillion points per second throughput, enabling real-time 3D ultrafast ultrasound reconstruction for the first time on consumer-grade hardware. By eliminating the beamforming bottleneck, mach enables real-time applications such as 3D functional neuroimaging, intraoperative guidance, and ultrasound localization microscopy. mach is freely available at https://github.com/Forest-Neurotech/mach.
The continually increasing volume of sequence data results in a growing demand for fast implementations of core algorithms. Computation of pairwise alignments based on dynamic programming is an important part in many bioinformatics pipelines and a major contributor to overall runtime due to the associated quadratic time complexity. This motivates the need for a library of efficient implementations on modern GPUs for a variety of alignment algorithms for different types of sequence data including DNA, RNA, and proteins. Accelign is a library of accelerated pairwise sequence alignment algorithms for CUDA-enabled GPUs. Its parallelization strategy is based on a common wavefront design that can be adapted to support a variety of dynamic programming algorithms: local, global, and semi-global alignment of genomic and protein sequences with a variety of commonly used scoring schemes supporting one-to-one, one-to-many or all-to-all pairwise sequence alignments. This leads to a peak performance between 16.1 TCUPS and 9.1 TCUPS for computing optimal global alignment scores with linear gaps and affine gap penalties on a single RTX PRO 6000 Blackwell GPU, respectively. In addition, our library demonstrates significant speedups in several real-world case studies over prior CPU-based (SeqAn, Parasail, BSalign, EdLib, KSW2, WFA2, A*PA2) and GPU-based libraries (ADEPT, GASAL2), and can even outperform highly customized algorithms (WFA-GPU, CUDASW++4.0). Furthermore, the performance of our approach scales linearly with the number of employed GPUs, which makes it feasible to exploit multi-GPU nodes for increased processing speeds. Accelign provides significant speedups for commonly used pairwise alignment algorithms compared to prior implementations. It is freely available at https://github.com/fkallen/Accelign .
We evaluate stereo-differentiable rendering-based pose estimation for marker-free real-time surgical robots tracking, mitigating occlusion-prone marker-based tracking in cluttered surgical environments, potentially improving safety, reducing setup times, and enabling intelligent multi-robot interaction. This work extends the differentiable rendering-based markerless robot pose estimation framework roboreg for online real-time dynamic tracking in two ways. (i) Sequential optimisation propagates pose estimates across consecutive frames, with motion-adaptive hyperparameter tuning balancing convergence and precision during estimation. (ii) Integrate CUDA stream parallelisation for segmentation and the optimisation steps and combines it with CUDA-graph accelerated segmentation. We collect 38 displacement video sequence datasets with unobstructed robot and 5 occluded-robot dataset with static start/end ground-truth pose calibrations and dynamic marker-based reference tracking in between for accuracy evaluation under different scenarios. Real-time localisation at 30 fps for 1080p video sequence is achieved, accelerating from 14 fps in the vanilla roboreg, thereby matching the camera frame rate. Near-1 cm accuracy is demonstrated, with 1.7 cm translational and 0. 6 ∘ rotational error against static ground-truth pose calibration; and with 1.2 cm average 3D error across 27,460 frames against a marker-based reference standard (1.53 cm in over 1242 frames in occlusion evaluation). Our method outperforms FoundationPose by 11% (63% in occlusion dataset) in dynamic estimation and 250% in static estimation, while achieving 6 × faster inference. We demonstrate real-time high-resolution marker-free tracking of surgical robots through stereo-differentiable rendering. Localisation accuracy performed on par with marker-based approaches and improved upon foundational baselines.
The Richardson-Lucy deconvolution (RLD) algorithm is widely used in fluorescence microscopy to enhance image sharpness, yet its high computational complexity limits scalability for large three-dimensional (3D) datasets and impedes real-time volumetric visualization. Here, we introduce an accelerated RLD approach using a Wiener-Butterworth unmatched backprojector, termed WB-ARL, which flattens the spectral product between the forward and backprojectors while effectively suppressing high-frequency noise beyond the diffraction limit. WB-ARL reduces the number of iterations required by more than 10-fold compared with conventional RLD while maintaining high-fidelity reconstruction. CUDA acceleration further increases the speed of both methods by 40-fold while maintaining our method's iterative advantage for up to 400× increase over non-CUDA accelerated matched backprojectors. We further analyze its robustness to noise and optical aberrations and validate its performance through 3D reconstructions of both wide-field mouse kidney tissue and confocal cell phantoms. Our results demonstrate that WB-ARL enables high-resolution, high-fidelity 3D imaging with significantly reduced computational cost, offering a scalable solution for high-throughput fluorescence microscopy.
Sorting can be approached in two main ways: sequentially and in parallel. In sequential sorting, data is processed in a single-threaded manner, which can be slow for large datasets. However, parallel sorting divides the task across multiple processing units, enabling faster results by processing data simultaneously. Furthermore, Compute Unified Device Architecture (CUDA) technology enables developers to leverage GPU power for general-purpose parallel computing, significantly accelerating tasks like sorting. This paper investigates the GPU-based parallelization of merge sort (MS), quick sort (QS), bubble sort (BS), radix top-k selection sort (RS), and slow sort (SS) presenting optimized algorithms designed for efficient sorting of large datasets using modern GPUs. The primary objective is to evaluate the performance of these algorithms on GPUs utilizing CUDA, with a focus on analyzing both parallel time complexity and space complexity across various data types. Experiments are conducted on four dataset scenarios: randomly generated data, reverse-sorted data, already-sorted data, and nearly-sorted data. Also, the performance of GPU-accelerated implementations is compared with their sequential counterparts to assess improvements in computational efficiency and scalability. Earlier GPU-based generations of this type typically achieved acceleration rates between 2× and 9× over scalar CPU code. With newer GPU enhancements, including parallel-aware primitives and radix- or merge-optimized operations, acceleration rates have seen significant improvement. Our experiments indicate that Radix Sort based on GPUs achieves a significant speedup of approximately 50× (sequential: 240.8 ms, parallel: 4.83 ms) on 10 million random sort elements. Quick Sort and Merge Sort have 97× and 103× speedups, respectively (Quick: 1461.97 ms vs. 15.1 ms; Merge: 2212.33 ms vs. 21.4 ms). Bubble Sort, while significantly improving in parallel (123,321.9 ms to 7377.8 ms for an ≈17× improvement), is considerably worse overall. Slow Sort demonstrates a moderate but consistent acceleration, reducing execution time from 74.07 ms in the sequential version to 3.99 ms on the GPU, yielding an ≈18.6× speedup. These experimental findings confirm that the new single-GPU implementations can get speedups ranging from 17× to over 100×, surpassing the typical gains reported in previous generations and comparable to or over rates of acceleration reported for cutting-edge parallel sorting algorithms in recent studies.
High-quality, densely annotated data serve as a crucial foundation for developing robust X-ray angiography segmentation models. However, obtaining per-object pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. This paper aims to reduce the annotation costs of X-ray angiography videos by leveraging few-shot video object segmentation (FSVOS), which separates target objects from the background using only a single annotated frame during inference. We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications. Code is available at: https://github.com/xilin-x/XRAVOS.
Minibeam Radiation Therapy (MBRT) is a form of spatially fractionated radiation therapy (SFRT) with electrons, protons, ions or kilovoltage photon beams, aiming to improve the therapeutic window compared to broad-beam irradiation. MBRT with X-rays uses an orthovoltage source and an array of narrow beamlets produced by collimator slits with a typical width of 0.2 mm to 0.7 mm and a center-to-center spacing of 1 to 4 mm. We developed an updated version of gCTD, a fast Monte Carlo (MC) code, initially designed for cone-beam CT imaging, to implement a treatment planning system for photon MBRT in mice within a virtual imaging and dosimetry tool (VIT-MBRT). We compared gCTD, running on a single GPU platform (CUDA environment) with TOPAS, a well validated CPU-based simulation code. We simulated irradiation of a water phantom using a multi-slit tungsten collimator with different aperture width and thickness, producing absorbed dose volume data. An example treatment plan was implemented with the gCTD code, using a voxelized mouse phantom derived from a CT scan and evaluating organ dose-volume histograms. Comparison with TOPAS simulations showed a good agreement in terms of dose values and peak-to-valley dose ratio (maximum absolute discrepancy of 13% at the phantom entrance surface), with a few percent differences in depth. Simulations with gCTD achieved a 300-fold reduction in computational time with respect to corresponding TOPAS simulations. We realized and validated a fast GPU-based MC simulation code for minibeam radiotherapy, as the basis of the MC platform VIT-MBRT for kilovoltage MBRT preclinical treatments.
Accurate description of light propagation in tissue is a primary requirement in most optical imaging techniques, and enables to quantify for example the oxygen saturation. The Monte Carlo (MC) method, employing Henyey-Greenstein (HG) phase function, is a classical numerical approach to simulate the path of photons in tissue. However, it loses accuracy when describing short light propagation distances (<10 mm), where scattering is anisotropic. The aim of this work was to develop and test different approaches to mitigate this deficiency, employing the modified Henyey-Greenstein (MHG) and Gegenbauer (GB) phase functions. The updated scattering angle probability was implemented in the MC toolbox MCXLAB, written in CUDA and callable in MATLAB. Simulations were performed at source-detector distances from 1.5 to 5 mm, to test the behavior of the new solutions. We observed higher adaptability of the simulated curve due to employing MHG and GB phase functions compared to the conventional HG, due to the presence of the additional γ parameter in the equation that enables to adjust for the anisotropy.
Relative binding free energy (RBFE) calculations, widely used to predict the potencies of congeneric small molecules binding to a protein receptor, can greatly increase the efficiency of the hit-to-lead and lead optimization stages of the drug discovery process. Traditional RBFE methods, however, cannot be easily applied to small molecules lacking a common core or binding mode, precluding their use in a challenging but crucial component of many drug discovery campaigns. In principle, an absolute binding free energy (ABFE) method can be applied to such molecules, but ABFE often suffers from high computational cost and poor statistical convergence due to the large amount of additional sampling required when compared to RBFE. Here, we introduce core-hopping binding free energy (CBFE) calculations, a computationally efficient framework for the accurate determination of relative binding free energies between small molecules with different cores, leveraging several recently developed techniques such as Alchemical Enhanced Sampling (ACES) with optimized transformation pathways and flexible λ-spacing, as well as λ-dependent Boresch restraints. We benchmark the performance of CBFE across 4 protein systems consisting of 56 small molecules, and find that the results are consistent with RBFE for a congeneric series of ligands and offer considerable improvement in computational cost and precision relative to ABFE results for a series of small molecules with diverse cores and binding modes. All CBFE-related developments are fully implemented in the GPU-accelerated AMBER free energy module (pmemd.cuda) and are available as part of the latest official AMBER release.
Modeling the wireless radiance field (WRF) is fundamental to modern communication systems, enabling key tasks such as localization, sensing, and channel estimation. Traditional approaches, which rely on empirical formulas or physical simulations, often suffer from limited accuracy or require strong scene priors. Recent neural radiance field (NeRF)-based methods improve reconstruction fidelity through differentiable volumetric rendering, but their reliance on computationally expensive multilayer perceptron (MLP) queries hinders real-time deployment. To overcome these challenges, we introduce Gaussian splatting (GS) to the wireless domain, leveraging its efficiency in modeling optical radiance fields to enable compact and accurate WRF reconstruction. Specifically, we propose SwiftWRF, a deformable 2D Gaussian splatting framework that synthesizes WRF spectra at arbitrary positions under single-sided transceiver mobility. SwiftWRF employs CUDA-accelerated rasterization to render spectra at over 100 k FPS and uses the lightweight MLP to model the deformation of 2D Gaussians, effectively capturing mobility-induced WRF variations. In addition to novel spectrum synthesis, the efficacy of SwiftWRF is further underscored in its applications in angle-of-arrival (AoA) and received signal strength indicator (RSSI) prediction. Experiments conducted on both real-world and synthetic indoor scenes demonstrate that SwiftWRF can reconstruct WRF spectra up to 500x faster than existing state-of-the-art methods, while significantly enhancing its signal quality.
The van der Waals (vdW) interaction is ubiquitous in materials and is long-range by nature. To facilitate vdW-included atomic simulations in large systems with tens of thousands of atoms, we developed LASP-D3, a Compute Unified Device Architecture (CUDA) implementation of the DFT-D3 method on graphics processing unit (GPU) devices, which realizes fast vdW corrections compatible with state-of-the-art machine-learning potential calculations. Our implementation achieves a linear-scaling time complexity, O(N), for large periodic systems, being up to 2 orders of magnitude faster than all current versions for systems above 100,000 atoms, and significantly reduces GPU memory consumption compared to existing PyTorch-based GPU implementations. By combining LASP-D3 with the generalized global neural network potential developed by us, we show that the leading solid electrolyte LiTaCl6 can achieve high conductivity, where the vdW interaction plays a key role in governing Li-ion diffusion and the simulated conductivity reproduces experimental measurements.
We present an open-source, graphics processing unit (GPU)-accelerated software implementation of the Uneyama-Doi model (UDM) for studying the collective dynamics of block copolymer blends and solutions. The UDM provides a field-theoretic framework that includes the entropy of mixing, binary interactions between segment species, and molecular connectivity, thereby capturing interfacial properties even in the strong-segregation regime. Our implementation utilizes a semi-implicit time-stepping scheme, incorporates thermal noise, and employs a concentration-conserving regularization algorithm that maintains non-negative concentrations. Spatial derivatives and convolutions are computed via optimized CUDA-based pseudo-spectral methods, enabling simulations of systems spanning tens of polymer end-to-end distances and thousands of molecular relaxation times within hours on a single GPU. We validate the implementation against established results, including the mean-field phase diagram of diblock copolymers, structure factors of disordered systems, and the fluctuation-induced order-disorder transition for symmetric copolymers. Dynamic simulations reproduce experimentally observed amphiphilic morphologies, including micellar lattices, vesicles, and phase-separated structures. The software provides an efficient and versatile tool for investigating equilibrium and nonequilibrium behavior of complex polymer systems.
X-ray propagation-based phase contrast imaging, a well established imaging technology in synchrotron radiation facilities, enables high-resolution 3D structural reconstruction. Nevertheless, the phase retrieval process required to restore quantitative phase information from holograms remains a significant challenge. Existing software solutions face problems such as performance bottlenecks and limitations in hardware support. Here, we describe a high-performance software named HiHolo based on the CUDA-MPI architecture for the holographic regime, and propose three improved iterative phase retrieval algorithms, providing an efficient framework for achieving high-quality holographic reconstruction. Experimental results demonstrate that HiHolo achieves 24%-37% performance improvement compared with current mainstream software and exhibits near-linear scalability in multi-GPU systems. The alternating projections with probe algorithm effectively reduces artifacts in traditional empty beam correction by simultaneously optimizing both object and probe wavefields; the extrapolation iteration method enhances the spatial resolution of limited field of view through the computational technique; furthermore, the parallel iterative reprojection optimizes the efficiency of 3D reconstruction, achieving a speedup of about 6-14 times compared with the serial version.
We present MXtalTools, a flexible Python package for the data-driven modeling of molecular crystals, facilitating machine learning studies of the molecular solid state. MXtalTools comprises several classes of utilities: (1) synthesis, collation, and curation of molecule and crystal data sets, (2) integrated workflows for model training and inference, (3) crystal parametrization and representation, (4) crystal structure sampling and optimization, (5) end-to-end differentiable crystal sampling, construction, and analysis. Our modular functions can be integrated into existing workflows or combined and used to build novel modeling pipelines. MXtalTools leverages CUDA acceleration to enable high-throughput crystal modeling. The Python code is available open-source on our GitHub page, with detailed documentation on ReadTheDocs.
Deep learning has become a key tool for carbonate thin-section image analysis. However, the lack of large public datasets limits reproducibility and fair model comparison. To address this, we present DeepCarbonate, a cleaned and standardized benchmark dataset. Samples were collected from the Ediacaran Dengying, Cambrian Longwangmiao, and Triassic Leikoupo and Jialingjiang Formations in the Sichuan Basin, China, and the Cretaceous Mishrif Formation in the UAE. The dataset was curated by petroleum geology experts; invalid images (blurred, low brightness, or corrupted) were removed through expert voting and 2σ filtering, and all images were reorganized in the ImageNet format. DeepCarbonate contains 22 lithological categories, hierarchically organized by optical mode (PPL, XPL, R) and split into train, validation, and test subsets, ensuring standardized benchmarking and reproducible experiments. Using PyTorch with CUDA acceleration, we evaluated ResNet, VGG, DenseNet, MobileNet, and EfficientNet models under baseline, ablation, long tailed distribution, and balanced Top 9 subset experiments. Results highlight the dataset's value as a robust benchmark for carbonate petrography research and applications.
This study numerically solves inhomogeneous Helmholtz equations modeling acoustic wave propagation in homogeneous and lossless, absorbing and dispersive, and inhomogeneous and nonlinear media. The traditional Born series (TBS) method has been employed to solve such equations. Simulated pressure field patterns for a linear array of acoustic sources (a line source) estimated by the TBS procedure exhibit excellent agreement with that of a standard time domain approach (the k-wave toolbox). For instance, the maximum absolute error of normalized pressure amplitude made by the proposed technique for the homogeneous and lossless medium is ≈2% with respect to the latter method. The TBS scheme, though iterative, is a very fast method. For example, the graphics processing unit (GPU)-enabled cuda c code implementing the TBS procedure for calculating the pressure field for the homogeneous and lossless medium is 102× faster than the k-wave module and also 4× faster than the corresponding central processing unit C code for the computational domain considered in this study (4096×4096). The findings of this study demonstrate the effectiveness of the TBS method for solving inhomogeneous Helmholtz equation, while the GPU-based implementation significantly reduces the computation time. In this work, the capability and performance of the method have been tested in two dimensions only.
Quantitative phase microscopy (QPM) is a holographic imaging technique often applied to studying cell morphology. To advance QPM for clinical applications, high-throughput implementations have been developed to allow imaging of thousands of cells at a time. To meet the needs of processing raw data and creating QPM holographic images, higher throughput processing methods are needed. Here, we report on the use of a system-on-module approach for QPM data processing. We have developed a real-time processing pipeline that leverages the parallel processing capabilities of the NVIDIA Jetson Orin Nano to implement processing of cell data. We demonstrate this pipeline on a holographic cytometry (HC) system, a high-throughput QPM implementation. The CUDA processing algorithm enables the generation of QPM data from raw interferograms followed by phase unwrapping, cell segmentation, and refocusing. We captured, processed, and analyzed 107,631 red blood cell images. The processing speed reaches 1200 cells/s in the speed test. Benchmarking shows that real-time refocusing maintains a high degree of structural similarity to the traditional refocusing method. The result demonstrates that our pipeline could accelerate the statistical analysis of cell populations. We expect this study to benefit the development of a portable, low-cost HC system.
Community detection methods are applied to single cell RNA sequencing (i.e. scRNA-seq) and mass cytometry data to efficiently identify major cell types and their subtypes, but their computational demands increase, particularly given the substantial growth in dataset sizes. The Leiden algorithm, an emerging method in this field, offers inherent parallelism that remains underutilized due to the limited parallel processing capabilities offered by today's modern multi-core CPUs, which have fewer than 100 cores (typically 32-64 CPUs). However, Leiden can achieve significant performance gains when implemented on GPUs. GPUs offer high memory bandwidth and an extensive array of parallel processing units that map well to the parallelism in Leiden. As far as we know, cuGraph is the only implementation that has mapped the Leiden algorithm to GPUs, using a blend of Python and C languages. However, it only supports undirected graphs, potentially discarding the valuable information carried by edge directionality. In addition, this Python implementation for GPUs is comparatively slower than a C/C++ based implementation, reducing the significant performance gains provided by a GPU-based speedup. Conversely, a C/C++ based implementation optimizes performance more effectively, ensuring an accurate baseline comparison when performing GPU acceleration. We developed a tool named gLeiden, a lightweight CUDA C++ based GPU implementation of the Leiden algorithm and, to the best of our knowledge, the very first GPU implementation that supports directed graphs, which generally demands nearly twice the computational time and memory resources compared to undirected graphs. The results show that our directed gLeiden outperforms the directed cLeiden version and shows 11× and 12× speedup on very large datasets. Our undirected ucLeiden and ugLeiden implementations significantly outperform the original Java version, with up to 42× speedup on large datasets. However, when comparing the undirected ugLeiden version with cuGraph, ugLeiden performance is comparable on smaller datasets and 58% faster on larger datasets. These results position our GPU-based Leiden implementation as a high-performance alternative to existing state-of-the-art community detection tools. The source code and sample data are available at: https://github.com/Beenishgul/Leiden and https://figshare.com/s/3b51e463a56e2a374bdf.
Reconstructing large-scale 3D scenes remains challenging due to the need to balance photorealistic quality, real-time rendering, and compact storage. Recent progress in 3D Gaussian Splatting (3DGS) has achieved impressive fidelity and speed, yet its large-scale application suffers from excessive primitive counts, leading to prohibitive storage and rendering costs. To overcome this inefficiency, we introduce a novel semantic-guided hybrid representation that unifies textured meshes and 3D Gaussians in a differentiable framework. The key idea is to leverage meshes for geometrically regular regions such as roads and building facades, while reserving Gaussians for fine, complex details like vegetation. Our method is realized through three key technical contributions. First, we develop a semantic-guided adaptive modeling pipeline that fuses multi-view segmentation onto the scene mesh to robustly partition the scene and prune redundant Gaussians. Second, we introduce a high-performance CUDA-based hybrid renderer that seamlessly combines mesh rasterization with Gaussian splatting, enabling correct occlusion handling and joint optimization of both representations. Finally, we propose a mesh-guided sampling strategy that adaptively adds Gaussians to recover fine details in under-reconstructed areas. Extensive experiments on diverse large-scale datasets demonstrate that our approach significantly reduces storage requirements and accelerates rendering performance while maintaining comparable or superior visual quality.