Computing demands for large scientific experiments, such as the CMS experiment at the CERN LHC, will increase dramatically in the next decades. To complement the future performance increases of software running on central processing units (CPUs), explorations of coprocessor usage in data processing hold great potential and interest. Coprocessors are a class of computer processors that supplement CPUs, often improving the execution of certain functions due to architectural design choices. We explore the approach of Services for Optimized Network Inference on Coprocessors (SONIC) and study the deployment of this as-a-service approach in large-scale data processing. In the studies, we take a data processing workflow of the CMS experiment and run the main workflow on CPUs, while offloading several machine learning (ML) inference tasks onto either remote or local coprocessors, specifically graphics processing units (GPUs). With experiments performed at Google Cloud, the Purdue Tier-2 computing center, and combinations of the two, we demonstrate the acceleration of these ML algorithms individually on coprocessors and the corresponding throughput improvement for the entire workflow. This
Platforms are nowadays typically equipped with tristed execution environments (TEES), such as Intel SGX and ARM TrustZone. However, recent microarchitectural attacks on TEEs repeatedly broke their confidentiality guarantees, including the leakage of long-term cryptographic secrets. These systems are typically also equipped with a cryptographic coprocessor, such as a TPM or Google Titan. These coprocessors offer a unique set of security features focused on safeguarding cryptographic secrets. Still, despite their simultaneous availability, the integration between these technologies is practically nonexistent, which prevents them from benefitting from each other's strengths. In this paper, we propose TALUS, a general design and a set of three main requirements for a secure symbiosis between TEEs and cryptographic coprocessors. We implement a proof-of-concept of TALUS based on Intel SGX and a hardware TPM. We show that with TALUS, the long-term secrets used in the SGX life cycle can be moved to the TPM. We demonstrate that our design is robust even in the presence of transient execution attacks, preventing an entire class of attacks due to the reduced attack surface on the shared hardw
The appearance and disappearance of coprocessors by integration into the CPU, the success or failure of coprocessors are examined by summarizing their characteristics from the mainframes of the 1960s. The coprocessors most particularly reviewed are the IBM 360 and CDC-6600 I/O processors, the Intel 8087 math coprocessor, the Cell processor, the Intel Xeon Phi coprocessors, the GPUs, the FPGAs, and the coprocessors of manycores SW26010 and Pezy SC-2 used in high-ranked supercomputers in the TOP500 or Green500. The conditions for a coprocessor to be viable in the medium or long-term are defined.
The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its wide use in biological sequence database search. Unfortunately, the high sensitivity comes at the expense of quadratic time complexity, which makes the algorithm computationally demanding for big databases. In this paper, we present SWAPHI, the first parallelized algorithm employing Xeon Phi coprocessors to accelerate SW protein database search. SWAPHI is designed based on the scale-and-vectorize approach, i.e. it boosts alignment speed by effectively utilizing both the coarse-grained parallelism from the many co-processing cores (scale) and the fine-grained parallelism from the 512-bit wide single instruction, multiple data (SIMD) vectors within each core (vectorize). By searching against the large UniProtKB/TrEMBL protein database, SWAPHI achieves a performance of up to 58.8 billion cell updates per second (GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors. Furthermore, it demonstrates good parallel scalability on varying number of coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of 1.
This paper concerns development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss suitability of the method for Xeon Phi architecture and present our experience of porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting with no code modification gives performance on Xeon Phi close to 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step application of optimization techniques such as improving data locality, enhancing parallelization efficiency and vectorization that leads to 3.75 x speedup on CPU and 7.5 x on Xeon Phi. The optimized version achieves 18.8 ns per particle update on Intel Xeon E5-2660 CPU and 9.3 ns per particle update on Intel Xeon Phi 5110P. On a real problem of laser ion acceleration in targets with surface grating that requires a large number of macroparticles per cell the speedup of Xeon Phi compared to CPU is 1.6 x.
In the next decade, the demands for computing in large scientific experiments are expected to grow tremendously. During the same time period, CPU performance increases will be limited. At the CERN Large Hadron Collider (LHC), these two issues will confront one another as the collider is upgraded for high luminosity running. Alternative processors such as graphics processing units (GPUs) can resolve this confrontation provided that algorithms can be sufficiently accelerated. In many cases, algorithmic speedups are found to be largest through the adoption of deep learning algorithms. We present a comprehensive exploration of the use of GPU-based hardware acceleration for deep learning inference within the data reconstruction workflow of high energy physics. We present several realistic examples and discuss a strategy for the seamless integration of coprocessors so that the LHC can maintain, if not exceed, its current performance throughout its running.
Efficient ordinary differential equation solvers for chemical kinetics must take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and nonstiff Runge-Kutta solver are implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms with OpenCL. The performances of these parallel implementations were measured with three chemical kinetic models across several multicore and many-core platforms. Two runtime benchmarks were conducted to clearly determine any performance advantage offered by either method: evaluating the right-hand-side source terms in parallel, and integrating a series of constant-pressure homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded code. The SIMT model on the host and Phi was 13-35% slower than the baseline while the SIMT model on the GPU provided approximately t
We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Computation intensive kernels, such as convolutions, matrix multiplication and Fourier transform, are fundamental to edge-computing AI, signal processing and cryptographic applications. Interleaved-Multi-Threading (IMT) processor cores are interesting to pursue energy efficiency and low hardware cost for edge-computing, yet they need hardware acceleration schemes to run heavy computational workloads. Following a vector approach to accelerate computations, this study explores possible alternatives to implement vector coprocessing units in RISC-V cores, showing the synergy between IMT and data-level parallelism in the target workloads.
The increasing computational demand from growing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments has driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) approach. SONIC accelerates ML inference by offloading it to local or remote coprocessors to optimize resource utilization. Leveraging its portability to different types of coprocessors, SONIC enhances data processing and model deployment efficiency for cutting-edge research in high energy physics (HEP) and multi-messenger astrophysics (MMA). We developed the SuperSONIC project, a scalable server infrastructure for SONIC, enabling the deployment of computationally intensive tasks to Kubernetes clusters equipped with graphics processing units (GPUs). Using NVIDIA Triton Inference Server, SuperSONIC decouples client workflows from server infrastructure, standardizing communication, optimizing throughput, load balancing, and monitoring. SuperSONIC has been successfully deployed for the CMS and ATLAS experiments at the CERN Large Hadron Collider (LHC), the IceCube Neutrino Observatory (IceCube), and the Laser Interferometer Gravitational-Wave Observat
Optimizing charged-particle track reconstruction algorithms is crucial for efficient event reconstruction in Large Hadron Collider (LHC) experiments due to their significant computational demands. Existing track reconstruction algorithms have been adapted to run on massively parallel coprocessors, such as graphics processing units (GPUs), to reduce processing time. Nevertheless, challenges remain in fully harnessing the computational capacity of coprocessors in a scalable and non-disruptive manner. This paper proposes an inference-as-a-service approach for particle tracking in high energy physics experiments. To evaluate the efficacy of this approach, two distinct tracking algorithms are tested: Patatrack, a rule-based algorithm, and Exa$.$TrkX, a machine learning-based algorithm. The as-a-service implementations show enhanced GPU utilization and can process requests from multiple CPU cores concurrently without increasing per-request latency. The impact of data transfer is minimal and insignificant compared to running on local coprocessors. This approach greatly improves the computational efficiency of charged particle tracking, providing a solution to the computing challenges anti
Linear operations, e.g., vector-matrix and vector-vector multiplications, are core operations of modern neural networks. To diminish computational time, these operations are implemented by parallel computations using different coprocessors. In this work we show that an open quantum system consisting of bosonic modes and interacting with bosonic reservoirs can be used as an analog thermodynamic coprocessor implementing multiple vector-matrix multiplications with stochastic matrices in parallel. Input vectors are encoded in occupancies of reservoirs, and the output result is presented by stationary energy flows. The operation takes time needed for the system's transition to a non-equilibrium stationary state independently on the number of the reservoirs, i.e., on the input vector dimension. With technological limitations being considered, a device of $5\times5$ cm$^2$ area covered with the coprocessors can conduct of the order of $10^{11}$ operations per second per a mode of the OQS. The computations are accompanied by an entropy growth. We construct a direct mapping between open quantum systems and electrical crossbar structures frequently used in analog vector-matrix multiplication
Analog coprocessors are intensively developing nowadays with the aim to optimize energy computations of neural networks. In this work we focus on the possibility of using detection of collective oscillations in optical systems for computational purposes. We show that in a system of coupled resonators, collective oscillations can be used to implement matrix-vector multiplication. The matrix is formed by the coupling constants between the resonators, and the input vector is formed by the initial occupancies of the involved modes. The frequency of the collective oscillations is growing with the number of the involved modes, similarly to Rabi oscillations. The time needed for their detection, i.e., averaging, decreases with an increase in the input vector dimension. We discuss the limitations imposed on parallel computation in the system by restriction of the allowed optical frequency band.
This paper proposes a model that enables permissionless and decentralized networks for complex computations. We explore the integration and optimize load balancing in an open, decentralized computational network. Our model leverages economic incentives and reputation-based mechanisms to dynamically allocate tasks between operators and coprocessors. This approach eliminates the need for specialized hardware or software, thereby reducing operational costs and complexities. We present a mathematical model that enhances restaking processes in blockchain systems by enabling operators to delegate complex tasks to coprocessors. The model's effectiveness is demonstrated through experimental simulations, showcasing its ability to optimize reward distribution, enhance security, and improve operational efficiency. Our approach facilitates a more flexible and scalable network through the use of economic commitments, adaptable dynamic rating models, and a coprocessor load incentivization system. Supported by experimental simulations, the model demonstrates its capability to optimize resource allocation, enhance system resilience, and reduce operational risks. This ensures significant improvemen
Analog coprocessors for neural networks are an intensively developing field. They provide approximate results of computations for relatively low energy cost and at high speed. We show that a set of ring resonators with Brillouin interaction between photons and phonons, being coupled to a waveguide, can be used to implement matrix-vector multiplication. The input vector is formed by occupancies of the anti-Stokes optical modes pumped via spontaneous Brillouin scattering, i.e, scattering on thermal phonons. Brillouin scattering rates and coupling constants between ring resonators and the waveguide form the matrix. The system allows for parallel computations in frequency band.
The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive accesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many researchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V programmable systolic array coprocessor for low-power edge applications that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, showing that it requires only 0.65 mm^2 and that it can reach up
We formalize a cross-domain "ZK coprocessor bridge" that lets Solana programs request private execution on Aztec L2 (via Ethereum) using Wormhole Verifiable Action Approvals (VAAs) as authenticated transport. The system comprises: (i) a Solana program that posts messages to Wormhole Core with explicit finality; (ii) an EVM Portal that verifies VAAs, enforces a replay lock, parses a bound payload secretHash||m from the attested VAA, derives a domain-separated field commitment, and enqueues an L1->L2 message into the Aztec Inbox (our reference implementation v0.1.0 currently uses consumeWithSecret(vaa, secretHash); we provide migration guidance to the payload-bound interface); (iii) a minimal Aztec contract that consumes the message privately; and (iv) an off-chain relayer that ferries VAAs and can record receipts on Solana. We present state machines, message formats, and proof sketches for replay-safety, origin authenticity, finality alignment, parameter binding (no relayer front-running of Aztec parameters), privacy, idempotence, and liveness. Finally, we include a concise Reproducibility note with pinned versions and artifacts to replicate a public testnet run.
Recent studies have shown promising results for track finding in dense environments using Graph Neural Network (GNN)-based algorithms. However, GNN-based track finding is computationally slow on CPUs, necessitating the use of coprocessors to accelerate the inference time. Additionally, the large input graph size demands a large device memory for efficient computation, a requirement not met by all computing facilities used for particle physics experiments, particularly those lacking advanced GPUs. Furthermore, deploying the GNN-based track-finding algorithm in a production environment requires the installation of all dependent software packages, exclusively utilized by this algorithm. These computing challenges must be addressed for the successful implementation of GNN-based track-finding algorithm into production settings. In response, we introduce a ``GNN-based tracking as a service'' approach, incorporating a custom backend within the NVIDIA Triton inference server to facilitate GNN-based tracking. This paper presents the performance of this approach using the Perlmutter supercomputer at NERSC.
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmente
Adaptive brain stimulation can treat neurological conditions such as Parkinson's disease and post-stroke motor deficits by influencing abnormal neural activity. Because of patient heterogeneity, each patient requires a unique stimulation policy to achieve optimal neural responses. Model-free reinforcement learning (MFRL) holds promise in learning effective policies for a variety of similar control tasks, but is limited in domains like brain stimulation by a need for numerous costly environment interactions. In this work we introduce Coprocessor Actor Critic, a novel, model-based reinforcement learning (MBRL) approach for learning neural coprocessor policies for brain stimulation. Our key insight is that coprocessor policy learning is a combination of learning how to act optimally in the world and learning how to induce optimal actions in the world through stimulation of an injured brain. We show that our approach overcomes the limitations of traditional MFRL methods in terms of sample efficiency and task success and outperforms baseline MBRL approaches in a neurologically realistic model of an injured brain.