Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 re
In GitHub with its 518 million hosted projects, performance changes within these projects are highly relevant to the project's users. Although performance measurement is supported by GitHub CI/CD, performance change detection is a challenging topic. In this paper, we demonstrate how we incorporated Nyrkiö to MooBench. Prior to this work, Moobench continuously ran on GitHub virtual machines, measuring overhead of tracing agents, but without change detection. By adding the upload of the measurements to the Nyrkiö change detection service, we made it possible to detect performance changes. We identified one major performance regression and examined the performance change in depth. We report that (1) it is reproducible with GitHub actions, and (2) the performance regression is caused by a Linux Kernel version change.
The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.
This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to
The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and d
This review examines the roles of adaptation and synchronization in music performance, drawing on concepts from complex systems theory to understand the dynamic interactions between musicians, music, and listeners. Adaptation is explored through how musicians adjust their cognitive, emotional, and motor systems across the stages of preparation, execution, and reception, while synchronization is emphasized as essential for aligning internal states, coordinating actions with other performers, and engaging with the audience. The review highlights the interdisciplinary nature of music performance research, integrating cognitive, motor, and emotional processes to enhance both individual and collective musical expression. It also addresses the psychological state of flow, which arises from synchronized neurocognitive mechanisms that optimize performance. Additionally, the emotional synchronization facilitated by music is explored, emphasizing its role in both individual emotional coherence and social coordination within musical ensembles. Finally, the review highlights recent findings on interpersonal and inter-brain synchronization, particularly in live music performances and improvisat
Container startup latency is a critical performance metric for CI/CD pipelines, serverless computing, and auto-scaling systems, yet practitioners lack empirical guidance on how infrastructure choices affect this latency. We present a systematic measurement study that decomposes Docker container startup into constituent operations across three heterogeneous infrastructure tiers: Azure Premium SSD (cloud SSD), Azure Standard HDD (cloud HDD), and macOS Docker Desktop (developer workstation with hypervisor-based virtualization). Using a reproducible benchmark suite that executes 50 iterations per test across 10 performance dimensions, we quantify previously under-characterized relationships between infrastructure configuration and container runtime behavior. Our key findings include: (1) container startup is dominated by runtime overhead rather than image size, with only 2.5% startup variation across images ranging from 5 MB to 155 MB on SSD; (2) storage tier selection imposes a 2.04x startup penalty (HDD 1157 ms vs. SSD 568 ms); (3) Docker Desktop's hypervisor layer introduces a 2.69x startup penalty and 9.5x higher CPU throttling variance compared to native Linux; (4) OverlayFS write
Configuring stream processing systems for efficient performance, especially in cloud-native deployments, is a challenging and largely manual task. We present an experiment-driven approach for automated configuration optimization that combines three phases: Latin Hypercube Sampling for initial exploration, Simulated Annealing for guided stochastic search, and Hill Climbing for local refinement. The workflow is integrated with the cloud-native Theodolite benchmarking framework, enabling automated experiment orchestration on Kubernetes and early termination of underperforming configurations. In an experimental evaluation with Kafka Streams and a Kubernetes-based cloud testbed, our approach identifies configurations that improve throughput by up to 23% over the default. The results indicate that Latin Hypercube Sampling with early termination and Simulated Annealing are particularly effective in navigating the configuration space, whereas additional fine-tuning via Hill Climbing yields limited benefits.
Efficient simulation of complex plasma dynamics is crucial for advancing fusion energy research. Particle-in-Cell (PIC) Monte Carlo (MC) simulations provide insights into plasma behavior, including turbulence and confinement, which are essential for optimizing fusion reactor performance. Transitioning to exascale simulations introduces significant challenges, with traditional file input/output (I/O) inefficiencies remaining a key bottleneck. This work advances BIT1, an electrostatic PIC MC code, by improving the particle mover with OpenMP task-based parallelism, integrating the openPMD streaming API, and enabling in-memory data streaming with ADIOS2's Sustainable Staging Transport (SST) engine to enhance I/O performance, computational efficiency, and system storage utilization. We employ profiling tools such as gprof, perf, IPM and Darshan, which provide insights into computation, communication, and I/O operations. We implement time-dependent data checkpointing with the openPMD API enabling seamless data movement and in-situ visualization for real-time analysis without interrupting the simulation. We demonstrate improvements in simulation runtime, data accessibility and real-time i
GPU-embedded systems have gained popularity across various domains due to their efficient power consumption. However, in order to meet the demands of real-time or time-consuming applications running on these systems, it is crucial for them to be tuned to exhibit high performance. This paper addresses the issue by developing and comparing two tuning methodologies on GPU-embedded systems, and also provides performance insights for developers and researchers seeking to optimize applications running on these architectures. We focus on parallel prefix operations, such as FFT, scan primitives, and tridiagonal system solvers, which are performance-critical components in many applications. The study introduces an analytical model-driven tuning methodology and a Machine Learning (ML)-based tuning methodology. We evaluate the performance of the two tuning methodologies for different parallel prefix implementations of the BPLG library in an NVIDIA Jetson system, and compare their performance to the ones achieved through an exhaustive search. The findings shed light on the best strategies for handling the open challenge of performance portability for major computational patterns among server a
Performance analysis of microservices can be a challenging task, as a typical request to these systems involves multiple Remote Procedure Calls (RPC) spanning across independent services and machines. Practitioners primarily rely on distributed tracing tools to closely monitor microservices performance. These tools enable practitioners to trace, collect, and visualize RPC workflows and associated events in the context of individual end-to-end requests. While effective for analyzing individual end-to-end requests, current distributed tracing visualizations often fall short in providing a comprehensive understanding of the system's overall performance. To address this limitation, we propose a novel visualization approach that enables aggregate performance analysis of multiple end-to-end requests. Our approach builds on a previously developed technique for comparing structural differences of request pairs and extends it for aggregate performance analysis of sets of requests. This paper presents our proposal and discusses our preliminary ongoing progress in developing this innovative approach.
Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to $5\times$ slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects
Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance, making efficient use of different resources available. While extracting optimal performance of applications on an HPC infrastructure, developers should also ensure the applications have the least energy usage considering the massive power consumption of data centres and HPC servers. This thesis presents two models developed which can be utilized by developers in analysing the CUDA kernel's energy dissipation. The first one is a model that predicts the CUDA kernel's execution time. Here a PTX code is statically analysed to extract instruction features, control flow, and data dependence. We propose two scheduling algorithm approaches that satisfy the performance and hardware constraints. The second model is a static analysis-based power prediction built by utilizing machine learning techniques. Features used for building the model are derived using static analysis of PTX code. These features are chosen to understand the relationship between GPU pow
Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false posi
The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches. To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.
The utilization of performance monitoring probes is a valuable tool for programmers to gather performance data. However, the manual insertion of these probes can result in an increase in code size, code obfuscation, and an added burden of learning different APIs associated with performance monitoring tools. To mitigate these issues, EDPM, an embedded domain-specific language, was developed to provide a higher level of abstraction for annotating regions of code that require instrumentation in C and C++ programs. This paper presents the design and implementation of EDPM and compares it to the well-known tool PAPI, in terms of required lines of code, flexibility in configuring regions, and performance overhead. The results of this study demonstrate that EDPM is a low-resolution profiling tool that offers a reduction in required lines of code and enables programmers to express various configurations of regions. Furthermore, the design of EDPM is such that its pragmas are ignored by the standard compiler, allowing for seamless integration into existing software processes without disrupting build systems or increasing the size of the executable. Additionally, the design of the EDPM pre-c
The term "performance portability" has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of performance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning of the term, and authors' conclusions regarding their success (or failure) to achieve performance portability have thus been subjective. Comparing one approach to performance portability with another has generally been marked with vague claims and verbose, qualitative explanation of the comparison. This paper presents a concise definition for performance portability, along with a simple metric that accurately captures the performance and portability of an application across different platforms. The utility of this metric is then demonstrated with a retroactive application to previous work.
Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf\_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, although HPM is available for x86 systems since the early 90s, only a small subset of the HPM features is used in practice. Performance patterns provide a more comprehensive approach, enabling the identification of various performance-limiting effects. Patterns address issues like bandwidth saturation, load imbalance, non-local data access in ccNUMA systems, or false sharing of cache lines. This work defines HPM event sets that are best suited to identify a selection of performance patterns on the Intel Haswell processor. We validate the chosen event sets for accuracy in order to arrive at a reliable pattern detection mechanism and point out shortcomings that cannot be easily circumvented due to bugs or limitations in the hardware.
Software systems continuously evolve due to new functionalities, requirements, or maintenance activities. In the context of software evolution, software refactoring has gained a strategic relevance. The space of possible software refactoring is usually very large, as it is given by the combinations of different refactoring actions that can produce software system alternatives. Multi-objective algorithms have shown the ability to discover alternatives by pursuing different objectives simultaneously. Performance of such algorithms in the context of software model refactoring is of paramount importance. Therefore, in this paper, we conduct a performance analysis of three genetic algorithms to compare them in terms of performance and quality of solutions. Our results show that there are significant differences in performance among the algorithms (e.g., PESA2 seems to be the fastest one, while NSGA-II shows the least memory usage).