With the increasing popularity of AArch64 processors in general-purpose computing, securing software running on AArch64 systems against control-flow hijacking attacks has become a critical part toward secure computation. Shadow stacks keep shadow copies of function return addresses and, when protected from illegal modifications and coupled with forward-edge control-flow integrity, form an effective and proven defense against such attacks. However, AArch64 lacks native support for write-protected shadow stacks, while software alternatives either incur prohibitive performance overhead or provide weak security guarantees. We present InversOS, the first hardware-assisted write-protected shadow stacks for AArch64 user-space applications, utilizing commonly available features of AArch64 to achieve efficient intra-address space isolation (called Privilege Inversion) required to protect shadow stacks. Privilege Inversion adopts unconventional design choices that run protected applications in the kernel mode and mark operating system (OS) kernel memory as user-accessible; InversOS therefore uses a novel combination of OS kernel modifications, compiler transformations, and another AArch64 fe
The plasma membrane and accompanying cortex serve as major hubs of signal transduction and cytoskeletal activities that collectively regulate cell physiological processes such as migration, polarity, macropinocytosis, phagocytosis, and cytokinesis. Yet, dynamically tracking membrane-cortex associated protein or lipid kinetics from live-cell image series remains challenging, primarily due to the difficulty of accurately extracting and aligning the cell boundary between consecutive frames as the cell continuously deforms and moves. Here, we present Membrane Kymograph Generator, a cross-platform software that accepts multichannel time-lapse live-cell fluorescent imaging datasets and automates boundary tracking, inter-frame alignment, and intensity sampling along the boundary. The software implements a rotational offset minimization algorithm that aligns boundaries across consecutive frames by exhaustively searching for the optimal angular shift that minimizes point-to-point distances, while handling variations in boundary point counts due to cell shape changes. The software outputs kymographs representing spatiotemporal dynamics of membrane-associated proteins or biosensors, allows users to fine-tune visualization parameters through an interactive interface, and provides built-in correlation analysis tools for multi-channel datasets. Furthermore, a native Python API enables programmatic usage for batch processing and further downstream analysis. Validation tests demonstrated that the Membrane Kymograph Generator accurately tracks, visualizes, and quantitates the spatial kinetics of a wide array of membrane proteins and lipid biosensors over extended time periods, in a variety of cell types including Dictyostelium amoeba, human neutrophils, mouse macrophages, and mammalian cancer cells. The GUI-based software is user-friendly, requires no technical expertise, and significantly reduces the manual effort required for kymograph generation and analysis while ensuring high accuracy and reproducibility. Membrane Kymograph Generator is free and open-source, licensed under GNU General Public License 3.0 or later. It can be installed on both x86-64 and AArch64/ARM64 computers running Windows, macOS, or any standard Linux distribution. The software is distributed as standalone installer files and portable executables targeting specific architectures and operating systems, requiring no dependency resolution. The source code, documentation/wiki, installers, portable binaries, and test data are freely available at https://github.com/tatsatb/membrane-kymograph-generator. The software can also be installed via PIP (package ID: membrane-kymograph, https://pypi.org/project/membrane-kymograph) and accessed programmatically via a built-in Python API. The source code is also archived on Zenodo (DOI: 10.5281/zenodo.20318834).
Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86_64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm's Scalable Vector Extensions (SVE). More specifically, we use Fujitsu's A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx.
Modern compilers are complex software systems that must correctly translate high-level programming languages into machine code across multiple architectures. Cranelift, a fast and modern compiler backend originally developed for WebAssembly and recently adopted as an experimental backend for Rust, has gained increasing importance due to its superior compilation speed compared to LLVM and comprehensive multi-architecture support, including x86-64, AArch64, s390x, and RISCV64. However, despite decades of development in compiler testing, testing Cranelift still presents unique challenges, including (1) constructing valid IR under the strict enforcement of SSA form, (2) generating sequences with sufficient computational density to stress backend components, and (3) balancing broad backend coverage with efficient root cause analysis across heterogeneous architectures. To address these challenges, we propose CLIR, a differential testing framework that integrates a syntax-preserving hierarchical generation strategy to guarantee SSA validity, a liveness-guided instruction refinement mechanism to maximize computational density, and a diagnosis-guided cross-architecture adaptation scheme to
Tagging of generic dynamic values is important in symbolic-computation and dynamic-language systems, but the trade-offs change as machine architectures and workloads evolve. In particular, old folklore about boxed values, immediate values, and type tags must be recalibrated from time to time. We revisit the performance of badged object headers, low-bit tagging, and two NaN-boxing layouts on a range of platforms in use today, including AArch64 and x86-64 architectures from different manufacturers. The experiments isolate two distinct effects: the cost avoided by not heap-allocating common scalar values, and the cost avoided by obtaining tag information from the value word rather than by performing a heap read. The results show that several local bit operations are often cheaper than opening a heap object to obtain a tag or small value. Low-bit tagging remains the simplest and usually fastest choice for mostly symbolic workloads, while NaN-boxing is close in access cost and avoids the time and space of heap allocation for ordinary floating-point values.
We present Elevator, the first binary translator that statically translates entire x86-64 executables to AArch64 without debug information, source code, or assumptions about code layout. Unlike existing systems, which rely on heuristics or runtime fallbacks to handle code-versus-data decoding errors, Elevator considers all possible interpretations of every byte and produces a separate translation for each feasible one ahead of time. Any byte may be interpreted as data, an opcode, or an opcode argument; we generate separate control flow paths for all interpretations, pruning only those leading to abnormal termination. Translations are built by composing code "tiles" automatically derived from a high-level description of the source ISA, yielding a nimble translation framework. The approach is deterministic and produces complete, self-contained binaries with no runtime component in the trusted code base. The principal cost is substantial code size expansion. The key benefit is that the output is the actual code that will run, enabling testing, validation, certification, and cryptographic signing prior to deployment, reducing risk compared to emulators or JIT compilers. We evaluate Ele
ARM-based and x86-64 laptop processors differ not only in instruction-set design, but also in memory hierarchy, core organization, system integration, and power-management mechanisms. This study presents a combined architectural and experimental comparison of an Apple M3 system and an AMD Ryzen 7 3750H system. The architectural analysis contrasts AArch64's fixed-width load-store design with the variable-length, memory-operand-rich x86-64 instruction model, and discusses how register organization, calling conventions, heterogeneous core organization, memory behavior, and low-power mechanisms shape observed performance and energy characteristics. The experimental part uses two native assembly benchmarks: a recursive Fibonacci workload and an integer matrix-multiplication workload. The analysis combines repeated timing measurements, processor-energy measurements, and cross-platform microarchitectural counter measurements from matched portable-C profiling runs. The Ryzen platform is decisively faster on the branch-heavy Fibonacci benchmark, while matrix multiplication shows no meaningful timing advantage for either platform in the present measurements. In contrast, the Apple platform i
Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as t
Frequent toolchain updates and growing ISA diversity have made system-level software package repair increasingly important. Diagnosing and repairing build failures remains challenging because failures involve heterogeneous evidence, dependency constraints, and architecture-specific build conventions. While recent LLM-based repair methods show promise for project-level source fixes, they struggle with system-level repair, where failures span multi-language artifacts such as build recipes, scripts, and source archives, and require iterative validation through external build services. In this paper, we first conduct a systematic empirical study of real-world system-level build failures. We find that 72% of failures stem from dependency and environment misconfigurations rather than isolated code defects, suggesting that effective repair must prioritize packaging logic and iterative feedback. Motivated by these insights, we propose EvidenT, an evidence-preserving repair framework that decouples iteration-aware evidence management from tool execution. EvidenT includes: (1) an external Build Service for reproducible execution and feedback; (2) an Evidence-Preserving Repair Controller that
Traditional side-channels take advantage of secrets being used as inputs to unsafe instructions, used for memory accesses, or used in control flow decisions. Constant-time programming, which restricts such code patterns, has been widely adopted as a defense against these vulnerabilities. However, new hardware optimizations in the form of Data Memory-dependent Prefetchers (DMP) present in Apple, Intel, and ARM CPUs have shown such defenses are not sufficient. These prefetchers, unlike classical prefetchers, use the content of memory as well as the trace of prior accesses to determine prefetch targets. An adversary abusing such a prefetcher has been shown to be able to mount attacks leaking data-at-rest; data that is never used by the program, even speculatively, in an unsafe manner. In response, this paper introduces SplittingSecrets, a compiler-based tool that can harden software libraries against side-channels arising from DMPs. SplittingSecrets's approach avoids reasoning about the complex internals of different DMPs and instead relies on one key aspect of all DMPs: activation requires data to resemble addresses. To prevent secret data from leaking, SplittingSecrets transforms me
Memory-safety escapes continue to form the launching pad for a wide range of security attacks, especially for the substantial base of deployed software that is coded in pointer-based languages such as C/C++. Although compiler and Instruction Set Architecture (ISA) extensions have been introduced to address elements of this issue, the overhead and/or comprehensive applicability have limited broad production deployment. The Memory Tagging Extension (MTE) to the ARM AArch64 Instruction Set Architecture is a valuable tool to address memory-safety escapes; when used in synchronous tag-checking mode, MTE provides deterministic detection and prevention of sequential buffer overflow attacks, and probabilistic detection and prevention of exploits resulting from temporal use-after-free pointer programming bugs. The AmpereOne processor, launched in 2024, is the first datacenter processor to support MTE. Its optimized MTE implementation uniquely incurs no memory capacity overhead for tag storage and provides synchronous tag-checking with single-digit performance impact across a broad range of datacenter class workloads. Furthermore, this paper analyzes the complete hardware-software stack, ide
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation across the studied models, Build-bench reveals that current models achieve a maximum build success rate of 63.19% and tool usage patterns differ significantly across models. By coupling real build env
Disassembly is the first step of a variety of binary analysis and transformation techniques, such as reverse engineering, or binary rewriting. Recent disassembly approaches consist of three phases: an exploration phase, that overapproximates the binary's code; an analysis phase, that assigns weights to candidate instructions or basic blocks; and a conflict resolution phase, that downselects the final set of instructions. We present a disassembly algorithm that generalizes this pattern for a wide range of architectures, namely x86, x64, arm32, and aarch64. Our algorithm presents a novel conflict resolution method that reduces disassembly to weighted interval scheduling.
Whilst RISC-V has grown phenomenally quickly in embedded computing, it is yet to gain significant traction in High Performance Computing (HPC). However, as we move further into the exascale era, the flexibility offered by RISC-V has the potential to be very beneficial in future supercomputers especially as the community places an increased emphasis on decarbonising its workloads. Sophon's SG2042 is the first mass produced, commodity available, high-core count RISC-V CPU designed for high performance workloads. First released in summer 2023, and at the time of writing now becoming widely available, a key question is whether this is a realistic proposition for HPC applications. In this paper we use NASA's NAS Parallel Benchmark (NPB) suite to characterise performance of the SG2042 against other CPUs implementing the RISC-V, x86-64, and AArch64 ISAs. We find that the SG2042 consistently outperforms all other RISC-V solutions, delivering between a 2.6 and 16.7 performance improvement at the single core level. When compared against the x86-64 and AArch64 CPUs, which are commonplace for high performance workloads, we find that the SG2042 performs comparatively well with computationally b
Fast machine code generation is especially important for fast start-up just-in-time compilation, where the compilation time is part of the end-to-end latency. However, widely used compiler frameworks like LLVM do not prioritize fast compilation and require an extra IR translation step increasing latency even further; and rolling a custom code generator is a substantial engineering effort, especially when targeting multiple architectures. Therefore, in this paper, we present TPDE, a compiler back-end framework that adapts to existing code representations in SSA form. Using an IR-specific adapter providing canonical access to IR data structures and a specification of the IR semantics, the framework performs one analysis pass and then performs the compilation in just a single pass, combining instruction selection, register allocation, and instruction encoding. The generated target instructions are primarily derived code written in high-level language through LLVM's Machine IR, easing portability to different architectures while enabling optimizations during code generation. To show the generality of our framework, we build a new back-end for LLVM from scratch targeting x86-64 and AArc
Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iterativ
Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.
As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8
Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.