Instruction set architectures are complex, with hundreds of registers and instructions that can modify dozens of them during execution, variably on each instance. Prose-style ISA specifications struggle to capture these intricacies of the ISAs, where often the important details about a single register are spread out across hundreds of pages of documentation. Ensuring that all ISA-state is swapped in context switch implementations of privileged software requires meticulous examination of these pages. This manual process is tedious and error-prone. We propose a tool called Sailor that leverages machine-readable ISA specifications written in Sail to automate this task. Sailor determines the ISA-state necessary to swap during the context switch using the data collected from Sail and a novel algorithm to classify ISA-state as security-sensitive. Using Sailor's output, we identify three different classes of mishandled ISA-state across four open-source confidential computing systems. We further reveal five distinct security vulnerabilities that can be exploited using the mishandled ISA-state. This research exposes an often overlooked attack surface that stems from mishandled ISA-state, en
Heterogeneous-ISA processor designs have attracted considerable research interest. However, unlike their homogeneous-ISA counterparts, explicit software support for bridging ISA heterogeneity is required. The lack of a compilation toolchain ready to support heterogeneous-ISA targets has been a major factor hindering research in this exciting emerging area. For any such compiler, "getting right" the mechanics involved in state transformation upon migration and doing this efficiently is of critical importance. In particular, any runtime conversion of the current program stack from one architecture to another would be prohibitively expensive. In this paper, we design and develop Unifico, a new multi-ISA compiler that generates binaries that maintain the same stack layout during their execution on either architecture. Unifico avoids the need for runtime stack transformation, thus eliminating overheads associated with ISA migration. Additional responsibilities of the Unifico compiler backend include maintenance of a uniform ABI and virtual address space across ISAs. Unifico is implemented using the LLVM compiler infrastructure, and we are currently targeting the x86-64 and ARMv8 ISAs. W
Deep Neural Networks (DNNs) are widely applied across domains and have shown strong effectiveness. As DNN workloads increasingly run on CPUs, dedicated Matrix Processing Units (MPUs) and Matrix Instruction Set Architectures (ISAs) have been introduced. At the same time, sparsity techniques are widely adopted in algorithms to reduce computational cost. Despite these advances, insufficient hardware-algorithm co-optimization leads to suboptimal performance. On the memory side, sparse DNNs incur irregular access patterns that cause high cache miss rates. While runahead execution is a promising prefetching technique, its direct application to MPUs is often ineffective due to significant prefetch redundancy. On the compute side, stride constraints in current Matrix ISAs prevent the densification of multiple logically related sparse operations, resulting in poor utilization of MPU processing elements. To address these irregularities, we propose DARE, an irregularity-tolerant MPU with a Densifying ISA and filtered Runahead Execution. DARE extends the ISA to support densifying sparse operations and equips a lightweight runahead mechanism with filtering capability. Experimental results show
Edge AI deployment faces critical challenges balancing computational performance, energy efficiency, and resource constraints. This paper presents FPGA-accelerated RISC-V instruction set architecture (ISA) extensions for efficient neural network inference on resource-constrained edge devices. We introduce a custom RISC-V core with four novel ISA extensions (FPGA.VCONV, FPGA.GEMM, FPGA.RELU, FPGA.CUSTOM) and integrated neural network accelerators, implemented and validated on the Xilinx PYNQ-Z2 platform. The complete system achieves 2.14x average latency speedup and 49.1% energy reduction versus an ARM Cortex-A9 software baseline across four benchmark models (MobileNet V2, ResNet-18, EfficientNet Lite, YOLO Tiny). Hardware implementation closes timing with +12.793 ns worst negative slack at 50 MHz while using 0.43% LUTs and 11.4% BRAM for the base core and 38.8% DSPs when accelerators are active. Hardware verification confirms successful FPGA deployment with verified 64 KB BRAM memory interface and AXI interconnect functionality. All performance metrics are obtained from physical hardware measurements. This work establishes a reproducible framework for ISA-guided FPGA acceleration t
Solana is an emerging blockchain platform, recognized for its high throughput and low transaction costs, positioning it as a preferred infrastructure for Decentralized Finance (DeFi), Non-Fungible Tokens (NFTs), and other Web 3.0 applications. In the Solana ecosystem, transaction initiators submit various instructions to interact with a diverse range of Solana smart contracts, among which are decentralized exchanges (DEXs) that utilize automated market makers (AMMs), allowing users to trade cryptocurrencies directly on the blockchain without the need for intermediaries. Despite the high throughput and low transaction costs of Solana, the advantages have exposed Solana to bot spamming for financial exploitation, resulting in the prevalence of failed transactions and network congestion. Prior work on Solana has mainly focused on the evaluation of the performance of the Solana blockchain, particularly scalability and transaction throughput, as well as on the improvement of smart contract security, leaving a gap in understanding the characteristics and implications of failed transactions on Solana. To address this gap, we conducted a large-scale empirical study of failed transactions o
Modern microprocessors extend their instruction set architecture (ISA) with Single Instruction, Multiple Data (SIMD) operations to improve performance. The Intel Advanced Vector Extensions (AVX) enhance the x86 ISA and are widely supported in Intel and AMD processors. The latest version, AVX10.2, places a strong emphasis on low-precision, non-standard floating-point formats, including bfloat16 and E4M3/E5M2 float8 (OCP 8-bit Floating Point, OFP8), primarily catering to deep learning applications rather than general-purpose arithmetic. However, as these formats remain within the IEEE 754 framework, they inherit its limitations, introducing inconsistencies and added complexity into the ISA. This paper examines the recently proposed tapered-precision takum floating-point format, which has been shown to offer significant advantages over IEEE 754 and its derivatives as a general-purpose number format. Using AVX10.2 as a case study, the paper explores the potential benefits of replacing the multitude of floating-point formats with takum as a uniform basis. The results indicate a more consistent instruction set, improving readability and flexibility while offering potential for 8- and 16-
Progress has recently been made on specifying instruction set architectures (ISAs) in executable formalisms rather than through prose. However, to date, those formal specifications are limited to the functional aspects of the ISA and do not cover its security guarantees. We present a novel, general method for formally specifying an ISAs security guarantees to (1) balance the needs of ISA implementations (hardware) and clients (software), (2) can be semi-automatically verified to hold for the ISA operational semantics, producing a high-assurance mechanically-verifiable proof, and (3) support informal and formal reasoning about security-critical software in the presence of adversarial code. Our method leverages universal contracts: software contracts that express bounds on the authority of arbitrary untrusted code. Universal contracts can be kept agnostic of software abstractions, and strike the right balance between requiring sufficient detail for reasoning about software and preserving implementation freedom of ISA designers and CPU implementers. We semi-automatically verify universal contracts against Sail implementations of ISA semantics using our Katamaran tool; a semi-automatic
The evolution of quantization and mixed-precision techniques has unlocked new possibilities for enhancing the speed and energy efficiency of NNs. Several recent studies indicate that adapting precision levels across different parameters can maintain accuracy comparable to full-precision models while significantly reducing computational demands. However, existing embedded microprocessors lack sufficient architectural support for efficiently executing mixed-precision NNs, both in terms of ISA extensions and hardware design, resulting in inefficiencies such as excessive data packing/unpacking and underutilized arithmetic units. In this work, we propose novel ISA extensions and a micro-architecture implementation specifically designed to optimize mixed-precision execution, enabling energy-efficient deep learning inference on RISC-V architectures. We introduce MaRVIn, a cross-layer hardware-software co-design framework that enhances power efficiency and performance through a combination of hardware improvements, mixed-precision quantization, ISA-level optimizations, and cycle-accurate emulation. At the hardware level, we enhance the ALU with configurable mixed-precision arithmetic (2, 4
In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension. Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE
Spiking Neural Network processing promises to provide high energy efficiency due to the sparsity of the spiking events. However, when realized on general-purpose hardware -- such as a RISC-V processor -- this promise can be undermined and overshadowed by the inefficient code, stemming from repeated usage of basic instructions for updating all the neurons in the network. One of the possible solutions to this issue is the introduction of a custom ISA extension with neuromorphic instructions for spiking neuron updating, and realizing those instructions in bespoke hardware expansion to the existing ALU. In this paper, we present the first step towards realizing a large-scale system based on the RISC-V-compliant processor called IzhiRISC-V, supporting the custom neuromorphic ISA extension.
As is widely known, the computational speed and power consumption are two critical parameters in microprocessor design. A solution for these issues is the application specific instruction set processor (ASIP) methodology, which can improve speed and reduce power consumption of the general purpose processor (GPP) technique. In ASIP, changing the instruction set architecture (ISA) of the processor will lead to alter the number and the mean time of accesses to the cache memory. This issue has a direct impact on the processor energy consumption. In this work, we study the impacts of extended ISA on the energy consumption of the extended ISA processor. Also, we demonstrate the extended ISA let the designer to reduce the cache size in order to minimize the energy consumption while meeting performance constraint.
In this article, we introduce an instruction set architecture (ISA) for processing-in-memory (PIM) based deep neural network (DNN) accelerators. The proposed ISA is for DNN inference on PIM-based architectures. It is assumed that the weights have been trained and programmed into PIM-based DNN accelerators before inference, and they are fixed during inference. We do not restrict the devices of PIM-based DNN accelerators. Popular devices used to build PIM-based DNN accelerators include resistive random-access memory (RRAM), flash, ferroelectric field-effect transistor (FeFET), static random-access memory (SRAM), etc. The target DNNs include convolutional neural networks (CNNs) and multi-layer perceptrons (MLPs). The proposed ISA is transparent to both applications and hardware implementations. It enables to develop unified toolchains for PIM-based DNN accelerators and software stacks. For practical hardware that uses a different ISA, the generated instructions by unified toolchains can easily converted to the target ISA. The proposed ISA has been used in the open-source DNN compiler PIMCOMP-NN (https://github.com/sunxt99/PIMCOMP-NN) and the associated open-source simulator PIMSIM-NN
ISA extensions are increasingly adopted to boost the performance of specialized workloads without requiring an entire architectural redesign. However, these enhancements can inadvertently expose new attack surfaces in the microarchitecture. In this paper, we investigate Intel's recently introduced cldemote extension, which promotes efficient data sharing by transferring cache lines from upper-level caches to the Last Level Cache (LLC). Despite its performance benefits, we uncover critical properties-unprivileged access, inter-cache state transition, and fault suppression-that render cldemote exploitable for microarchitectural attacks. We propose two new attack primitives, Flush+Demote and Demote+Time, built on our analysis. Flush+Demote constructs a covert channel with a bandwidth of 2.84 Mbps and a bit error rate of 0.018%, while Demote+Time derandomizes the kernel base address in 2.49 ms on Linux. Furthermore, we show that leveraging cldemote accelerates eviction set construction in non-inclusive LLC designs by obviating the need for helper threads or extensive cache conflicts, thereby reducing construction time by 36% yet retaining comparable success rates. Finally, we examine h
Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations. At hardware level, we firstly expand the ALU unit within our proof-of-concept micro-architecture to support con
Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks running on PIM architectures, a compiler, and a cycleaccurate configurable simulator. Compared with prior works, this work decouples software algorithms and hardware architectures through the proposed ISA, providing a more convenient way to evaluate the effectiveness of software/hardware optimizations. The simulator adopts an event-driven simulation approach and has better support for hardware parallelism. The framework is open-sourced at https://github.com/wangxy-2000/pimsim-nn.
Despite the recent improvements in supporting Persistent Hardware Transactions (PHTs) on emerging persistent memories (PM), the poor performance of Read-Only (RO) transactions remains largely overlooked. We propose DUMBO, a new design for PHT that eliminates the two most crucial bottlenecks that hinder RO transactions in state-of-the-art PHT. At its core, DUMBO exploits advanced instructions that some contemporary HTMs provide to suspend (and resume) transactional access tracking. Our experimental evaluation with an IBM POWER9 system using the TPC-C benchmark shows that DUMBO can outperform the state of the art designs for persistent hardware (SPHT) and software memory transactions (Pisces), by up to 4.0x.
Hardware Transactional Memory (HTM) allows lock-free programming as easy as with traditional coarse-grain locks or similar, while benefiting from the performance advantages of fine-grained locking. Many HTM implementations have been proposed, but they have not received widespread adoption because of their high hardware complexity, their need for additions to the Instruction Set Architecture (ISA), and often for modifications to the cache coherence protocol. We show that HTM can be implemented without adding new instructions -- merely by extending the semantics of two existing, Load-Linked and Store-Conditional. Also, our proposed design does not modify or extend standard coherence protocols. We further propose to drastically simplify the implementation of HTM -- confined to modifications in the L1 Data Cache only -- by restricting it to applications where the write set plus the read set of each transaction do not exceed a small number of cache lines. We also propose two alternative mechanisms to guarantee forward progress, both based on detecting retrial attempts. We simulated our proposed design in Gem5, and we used it to implement several popular concurrent data structures, showi
Blockchains are being positioned as the "technology of trust" that can be used to mediate transactions between non-trusting parties without the need for a central authority. They support transaction types that are native to the blockchain platform or user-defined via user programs called smart contracts. Despite the significant flexibility in transaction programmability that smart contracts offer, they pose several usability, robustness, and performance challenges. This paper proposes an alternative transaction framework that incorporates more primitives into the native set of transaction types (reducing the likelihood of requiring user-defined transaction programs often). The framework is based on the concept of declarative blockchain transactions whose strength lies in the fact that it addresses several of the limitations of smart contracts simultaneously. A formal and implementation framework is presented, and a subset of commonly occurring transaction behaviors are modeled and implemented as use cases, using an open-source blockchain database, BigchchainDB, as the implementation context. A performance study comparing the declarative transaction approach to equivalent smart cont
Symbolic execution is an SMT-based software verification and testing technique. Symbolic execution requires tracking performed computations during software simulation to reason about branches in the software under test. The prevailing approach on symbolic execution of binary code tracks computations by transforming the code to be tested to an architecture-independent IR and then symbolically executes this IR. However, the resulting IR must be semantically equivalent to the binary code, making this process complex and error-prone. The semantics of the binary code are specified by the targeted ISA, commonly given in natural language and requiring a manual implementation of the transformation to an IR. In recent years, the use of formal languages to describe ISA semantics in a machine-readable way has gained increased popularity. We investigate the utilization of such formal semantics for symbolic execution of binary code, achieving an accurate representation of instruction semantics. We present a prototype for the RISC-V ISA and conduct a case study to demonstrate that it can be easily extended to additional instructions. Furthermore, we perform an experimental comparison with prior
Cross-chain bridges are essential decentralized applications (DApps) to facilitate interoperability between different blockchain networks. Unlike regular DApps, the functionality of cross-chain bridges relies on the collaboration of information both on and off the chain, which exposes them to a wider risk of attacks. According to our statistics, attacks on cross-chain bridges have resulted in losses of nearly 4.3 billion dollars since 2021. Therefore, it is particularly necessary to understand and detect attacks on cross-chain bridges. In this paper, we collect the largest number of cross-chain bridge attack incidents to date, including 49 attacks that occurred between June 2021 and September 2024. Our analysis reveal that attacks against cross-chain business logic cause significantly more damage than those that do not. These cross-chain attacks exhibit different patterns compared to normal transactions in terms of call structure, which effectively indicates potential attack behaviors. Given the significant losses in these cases and the scarcity of related research, this paper aims to detect attacks against cross-chain business logic, and propose the BridgeGuard tool. Specifically,