In the second and third quarters of 2025, Google collaborated with Intel to conduct a security assessment of Intel Trust Domain Extensions (TDX), extending Google's previous review and covering major changes since Intel TDX Module 1.0 - namely support for Live Migration and Trusted Domain (TD) Partitioning (nested VMs within TDs). Intel provided guidance and support, including documentation and updated TDX 1.5 source code. Unlike the previous review, this time, we had access to a compute node capable of running TDX to develop a toolkit for live testing and Proof-of-Concept (PoC) generation. Furthermore, we integrated Gemini for analysis and NotebookLM to efficiently navigate complex specifications. This assessment resulted in the discovery of one vulnerability that enables a VMM to fully compromise a TD, and four vulnerabilities that enable a malicious VMM or TD to leak confidential memory of the Intel TDX Module. Several other security weaknesses and/or bugs were identified but not categorized as vulnerabilities despite having some impact on security. Beyond presenting the technical details of multiple bugs and vulnerabilities in this report, these findings underscore that confide
Novel confidential computing technologies such as Intel TDX, AMD SEV, and Arm CCA have recently emerged. In practice, due to its minimal trust boundaries, Intel SGX still remains widely used for enclave-based applications in cloud environments, including confidential cloud services, privacy-preserving communication, secure payment processing, and privacy-focused advertising. With the growing adoption of Arm CPUs in cloud systems, however, existing SGX applications face a significant portability challenge: they are tightly coupled to SGX-specific APIs and execution semantics. In this paper, we present the design and implementation of CCX, a framework that enables existing SGX applications to run on Arm CCA without source code modification. To this end, CCX redesigns SGX functionality within Arm CCA firmware, adapting SGX abstractions to CCA's architecture design while preserving full compatibility with existing applications originally developed for SGX. We implemented a prototype of CCX on both the QEMU emulator and a Nitrogen8M development board. Our evaluation shows that CCX is capable of executing existing SGX applications without requiring source code changes, while providing se
To mitigate interrupt-based stepping attacks (notably using SGX-Step), Intel introduced AEX-Notify, an ISA extension to Intel SGX that aims to prevent deterministic single-stepping. In this work, we introduce AEX-NStep, the first interrupt counting attack on AEX-Notify-enabled Enclaves. We show that deterministic single-stepping is not required for interrupt counting attacks to be practical and that, therefore, AEX-Notify does not entirely prevent such attacks. We specifically show that one of AEX-Notify's security guarantees, obfuscated forward progress, does not hold, and we introduce two new probabilistic interrupt counting attacks. We use these attacks to construct a practical ECDSA key leakage attack on an AEX-Notify-enabled SGX enclave. Our results extend the original security analysis of AEX-Notify and inform the design of future mitigations.
Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.
Data plane programming enables the programmability of network devices with domain-specific programming languages, like P4. One commonly used P4-programmable hardware target is the Intel Tofino switching ASIC. The runtime behavior of an implemented P4 program on Tofino can be configured with shell scripts or a Python library from Barefoot provided with the Tofino. Both are limited in their capabilities and usability. This paper introduces the Rust Barefoot Runtime (RBFRT), a Rust-based control plane library. The RBFRT provides a fast and memory-safe interface to configure the Intel Tofino. We showed that the RBFRT achieves a higher insertion rate for MAT entries and has a shorter response time compared to the Python library.
Intel Max GPUs are a new option available to CGYRO fusion simulation users. This paper outlines the changes that were needed to successfully run CGYRO on Intel Max 1550 GPUs on TACC's Stampede3 HPC system and presents benchmark results obtained there. Benchmark results were also run on Stampede3 Intel Max CPUs, as well as NVIDIA A100 and AMD MI250X GPUs at other major HPC systems. The Intel Max GPUs are shown to perform comparably to the other tested GPUs for smaller simulations but are noticeably slower for larger ones. Moreover, Intel Max GPUs are significantly faster than the tested Intel Max CPUs on Stampede3.
Trusted Execution Environments (TEEs), such as Intel's Software Guard Extensions (SGX), are increasingly being adopted to address trust and compliance issues in the public cloud. Intel SGX's second generation (SGXv2) addresses many limitations of its predecessor (SGXv1), offering the potential for secure and efficient analytical cloud DBMSs. We assess this potential and conduct the first in-depth evaluation study of analytical query processing algorithms inside SGXv2. Our study reveals that, unlike SGXv1, state-of-the-art algorithms like radix joins and SIMD-based scans are a good starting point for achieving high-performance query processing inside SGXv2. However, subtle hardware and software differences still influence code execution inside SGX enclaves and cause substantial overheads. We investigate these differences and propose new optimizations to bring the performance inside enclaves on par with native code execution outside enclaves.
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. T
Monte Carlo (MC) simulations play a pivotal role in diverse scientific and engineering domains, with applications ranging from nuclear physics to materials science. Harnessing the computational power of high-performance computing (HPC) systems, especially Graphics Processing Units (GPUs), has become essential for accelerating MC simulations. This paper focuses on the adaptation and optimization of the OpenMC neutron and photon transport Monte Carlo code for Intel GPUs, specifically the Intel Data Center Max 1100 GPU (codename Ponte Vecchio, PVC), through distributed OpenMP offloading. Building upon prior work by Tramm J.R., et al. (2022), which laid the groundwork for GPU adaptation, our study meticulously extends the OpenMC code's capabilities to Intel GPUs. We present a comprehensive benchmarking and scaling analysis, comparing performance on Intel MAX GPUs to state-of-the-art CPU execution (Intel Xeon Platinum 8480+ Processor, codename 4th generation Sapphire Rapids). The results demonstrate a remarkable acceleration factor compared to CPU execution, showcasing the GPU-adapted code's superiority over its CPU counterpart as computational load increases.
Large Language Models (LLMs) have become essential tools in natural language processing, finding large usage in chatbots such as ChatGPT and Gemini, and are a central area of research. A particular area of interest includes designing hardware specialized for these AI applications, with one such example being the neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra processor with codename Meteor Lake, featuring a CPU, GPU, and NPU system-on-chip. However, official software support for the NPU through Intel's OpenVINO framework is limited to static model inference. The dynamic nature of autoregressive token generation in LLMs is therefore not supported out of the box. To address this shortcoming, we present NITRO (NPU Inference for Transformers Optimization), a Python-based framework built on top of OpenVINO to support text and chat generation on NPUs. In this paper, we discuss in detail the key modifications made to the transformer architecture to enable inference, some performance benchmarks, and future steps towards improving the package. The code repository for NITRO can be found here: https://github.com/abdelfattah-lab/nitro.
User programs recover from hardware exceptions and respond to signals by executing custom handlers that they register specifically for such events. We present SIGY attack, which abuses this programming model on Intel SGX to break the confidentiality and integrity guarantees of enclaves. SIGY uses the untrusted OS to deliver fake hardware events and injects fake signals in an enclave at any point. Such unintended execution of benign program-defined handlers in an enclave corrupts its state and violates execution integrity. 7 runtimes and library OSes (OpenEnclave, Gramine, Scone, Asylo, Teaclave, Occlum, EnclaveOS) are vulnerable to SIGY. 8 languages supported in Intel SGX have programming constructs that are vulnerable to SIGY. We use SIGY to demonstrate 4 proof of concept exploits on webservers (Nginx, Node.js) to leak secrets and data analytics workloads in different languages (C and Java) to break execution integrity.
Intel Trust Domain Extensions (TDX) is a new architectural extension in the 4th Generation Intel Xeon Scalable Processor that supports confidential computing. TDX allows the deployment of virtual machines in the Secure-Arbitration Mode (SEAM) with encrypted CPU state and memory, integrity protection, and remote attestation. TDX aims to enforce hardware-assisted isolation for virtual machines and minimize the attack surface exposed to host platforms, which are considered to be untrustworthy or adversarial in the confidential computing's new threat model. TDX can be leveraged by regulated industries or sensitive data holders to outsource their computations and data with end-to-end protection in public cloud infrastructure. This paper aims to provide a comprehensive understanding of TDX to potential adopters, domain experts, and security researchers looking to leverage the technology for their own purposes. We adopt a top-down approach, starting with high-level security principles and moving to low-level technical details of TDX. Our analysis is based on publicly available documentation and source code, offering insights from security researchers outside of Intel.
In this paper we explore the performance of Intel Xeon MAX CPU Series, representing the most significant new variation upon the classical CPU architecture since the Intel Xeon Phi Processor. Given the availability of a large on-package high-bandwidth memory, the bandwidth-to-compute ratio has significantly shifted compared to other CPUs on the market. Since a large fraction of HPC workloads are sensitive to the available bandwidth, we explore how this architecture performs on a selection of HPC proxies and applications that are mostly sensitive to bandwidth, and how it compares to the previous 3rd generation Intel Xeon Scalable processors (codenamed Ice Lake) and an AMD EPYC 7003 Series Processor with 3D V-Cache Technology (codenamed Milan-X). We explore performance with different parallel implementations (MPI, MPI+OpenMP, MPI+SYCL), compiled with different compilers and flags, and executed with or without hyperthreading. We show how performance bottlenecks are shifted from bandwidth to communication latencies for some applications, and demonstrate speedups compared to the previous generation between 2.0x-4.3x.
In this paper, we explore how transfer learning, coupled with Intel Xeon, specifically 4th Gen Intel Xeon scalable processor, defies the conventional belief that training is primarily GPU-dependent. We present a case study where we achieved near state-of-the-art accuracy for image classification on a publicly available Image Classification TensorFlow dataset using Intel Advanced Matrix Extensions(AMX) and distributed training with Horovod.
In this paper, we focus on three sparse matrix operations that are relevant for machine learning applications, namely, the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or SYCL, the ESIMD API enables the writing of explicitly vectorized kernel code. Sparse matrix algorithms implemented with the ESIMD API achieved performance close to the peak of the targeted Intel Data Center GPU. We compare our performance results to Intel's oneMKL library on Intel GPUs and to a recent CUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and demonstrate that our implementations for sparse matrix operations outperform either.
Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
A critical enabler for progress in neuromorphic computing research is the ability to transparently evaluate different neuromorphic solutions on important tasks and to compare them to state-of-the-art conventional solutions. The Intel Neuromorphic Deep Noise Suppression Challenge (Intel N-DNS Challenge), inspired by the Microsoft DNS Challenge, tackles a ubiquitous and commercially relevant task: real-time audio denoising. Audio denoising is likely to reap the benefits of neuromorphic computing due to its low-bandwidth, temporal nature and its relevance for low-power devices. The Intel N-DNS Challenge consists of two tracks: a simulation-based algorithmic track to encourage algorithmic innovation, and a neuromorphic hardware (Loihi 2) track to rigorously evaluate solutions. For both tracks, we specify an evaluation methodology based on energy, latency, and resource consumption in addition to output audio quality. We make the Intel N-DNS Challenge dataset scripts and evaluation code freely accessible, encourage community participation with monetary prizes, and release a neuromorphic baseline solution which shows promising audio quality, high power efficiency, and low resource consump
Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x highe
Grey box fuzzing is one of the most successful methods for automatic vulnerability detection. However,conventional Grey box Fuzzers like AFL can open perform fuzzing against the whole input and spend more time on smaller seeds with lower execution time, which significantly impact fuzzing efficiency for complicated input types. In this work, we introduce one intelligent grey box fuzzing for Intel Media driver, MediaFuzzer, which can perform effective fuzzing based on selective fields of complicated input. Also, with one novel calling depth-based power schedule biased toward seed corpus which can lead to deeper calling chain, it dramatically improves the vulnerability exposures (~6.6 times more issues exposed) and fuzzing efficiency (~2.7 times more efficient) against the baseline AFL for Intel media driver with almost negligible overhead.
The Intel Software Guard Extensions (SGX) technology enables applications to run in an isolated SGX enclave environment, with elevated confidentiality and integrity guarantees. Gramine Library OS facilitates execution of existing unmodified applications in SGX enclaves, requiring only an accompanying manifest file that describes the application's security posture and configuration. However, Intel SGX is a CPU-only technology, thus Gramine currently supports CPU-only workloads. To enable a broader class of applications that offload computations to hardware accelerators - GPU offload, NIC offload, FPGA offload, TPM communications - Gramine must be augmented with device-backed mmap support and generic ioctl support. In this paper, we describe the design and implementation of this newly added support, the corresponding changes to the manifest-file syntax and the requisite deep copy algorithm. We evaluate our implementation on Intel Media SDK workloads and discuss the encountered caveats and limitations. Finally, we outline a use case for the presented mmap/ioctl support beyond mere device communication, namely the mechanism to slice the application into the trusted enclave part (where