Scalable nonvolatile memory DIMMs will finally be commercially available with the release of the Intel Optane DC Persistent Memory Module (or just "Optane DC PMM"). This new nonvolatile DIMM supports byte-granularity accesses with access times on the order of DRAM, while also providing data storage that survives power outages. This work comprises the first in-depth, scholarly, performance review of Intel's Optane DC PMM, exploring its capabilities as a main memory device, and as persistent, byte-addressable memory exposed to user-space applications. This report details the technologies performance under a number of modes and scenarios, and across a wide variety of macro-scale benchmarks. Optane DC PMMs can be used as large memory devices with a DRAM cache to hide their lower bandwidth and higher latency. When used in this Memory (or cached) mode, Optane DC memory has little impact on applications with small memory footprints. Applications with larger memory footprints may experience some slow-down relative to DRAM, but are now able to keep much more data in memory. When used under a file system, Optane DC PMMs can result in significant performance gains, especially when the file system is optimized to use the load/store interface of the Optane DC PMM and the application uses many small, persistent writes. For instance, using the NOVA-relaxed NVMM file system, we can improve the performance of Kyoto Cabinet by almost 2x. Optane DC PMMs can also enable user-space persistence where the application explicitly controls its writes into persistent Optane DC media. In our experiments, modified applications that used user-space Optane DC persistence generally outperformed their file system counterparts. For instance, the persistent version of RocksDB performed almost 2x faster than the equivalent program utilizing an NVMM-aware file system.
Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.
This book is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world examples. The authors distill their years of Xeon Phi programming experience coupled with insights from many expert customers Intel Field Engineers, Application Engineers, and Technical Consulting Engineers to create this authoritative book on the essentials of programming for Intel Xeon Phi products. Intel Xeon Phi Processor High-Performance Programming is useful even before you ever program a system with an Intel Xeon Phi processor. To help ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi processors, or other high-performance microprocessors. Applying these techniques will generally increase your program performance on any system and prepare you better for Intel Xeon Phi processors. A practical guide to the essentials for programming Intel Xeon Phi processors Definitive coverage of the Knights Landing architecture Presents best practices for portable, high-performance computing and a familiar and proven threads and vectors programming model Includes real world code examples that highlight usages of the unique aspects of this new highly parallel and high-performance computational productCovers use of MCDRAM, AVX-512, Intel Omni-Path fabric, many-cores (up to 72), and many threads (4 per core)Covers software developer tools, libraries and programming modelsCovers using Knights Landing as a processor and a coprocessor
Modern implementations of homomorphic encryption (HE) rely heavily on polynomial arithmetic over a finite field. This is particularly true of the BGV, BFV, and CKKS HE schemes. Two of the biggest performance bottlenecks in HE primitives and applications are polynomial modular multiplication and the forward and inverse number-theoretic transform (NTT). Here, we introduce Intel® Homomorphic Encryption Acceleration Library (Intel® HEXL), a C++ library which provides optimized implementations of polynomial arithmetic for Intel® processors. Intel HEXL takes advantage of the recent Intel® Advanced Vector Extensions 512 (Intel® AVX512) instruction set to provide state-of-the-art implementations of the NTT and modular multiplication, measuring up to 7.2x single-threaded speedup over a native C++ baseline. Intel HEXL is available open-source at https://github.com/intel/hexl under the Apache 2.0 license and has been adopted by the Microsoft SEAL and PALISADE homomorphic encryption libraries
Intel’s Software Guard Extensions (SGX) is a set of extensions to the Intel architecture that aims to provide integrity and confidentiality guarantees to securitysensitive computation performed on a computer where all the privileged software (kernel, hypervisor, etc) is potentially malicious. This paper analyzes Intel SGX, based on the 3 papers [14, 78, 137] that introduced it, on the Intel Software Developer’s Manual [100] (which supersedes the SGX manuals [94, 98]), on an ISCA 2015 tutorial [102], and on two patents [108, 136]. We use the papers, reference manuals, and tutorial as primary data sources, and only draw on the patents to fill in missing information. This paper’s contributions are a summary of the Intel-specific architectural and micro-architectural details needed to understand SGX, a detailed and structured presentation of the publicly available information on SGX, a series of intelligent guesses about some important but undocumented aspects of SGX, and an analysis of SGX’s security properties.
Multi-core chips from Intel and AMD offer a dramatic boost in speed and responsiveness, and plenty of opportunities for multiprocessing on ordinary desktop computers. But they also present a challenge: More than ever, multithreading is a requirement for good performance. This guide explains how to maximize the benefits of these processors through a portable C++ library that works on Windows, Linux, Macintosh, and Unix systems. With it, you'll learn how to use Intel Threading Building Blocks (TBB) effectively for parallel programming -- without having to be a threading expert. Written by James Reinders, Chief Evangelist of Intel Software Products, and based on the experience of Intel's developers and customers, this book explains the key tasks in multithreading and how to accomplish them with TBB in a portable and robust manner. With plenty of examples and full reference material, the book lays out common patterns of uses, reveals the gotchas in TBB, and gives important guidelines for choosing among alternatives in order to get the best performance. You'll learn how Intel Threading Building Blocks: * Enables you to specify tasks instead of threads for better portability, easier programming, more understandable source code, and better performance and scalability in general * Focuses on the goal of parallelizing computationally intensive work to deliver high-level solutions * Is compatible with other threading packages, and doesn't force you to pick one package for your entire program * Emphasizes scalable, data-parallel programming, which allows program performance to increase as you add processors * Relies on generic programming, which enables you to write the best possible algorithms with the fewest constraints Any C++ programmer who wants to write an application to run on a multi-core system will benefit from this book. TBB is also very approachable for a C programmer or a C++ programmer without much experience with templates. Best of all, you don't need experience with parallel programming or multi-core processors to use this book.
This paper studies the advent of Intel, manufacturer of microprocessors, to Costa Rica. We use indicators of both direct effects and selected macroeconomic effects as evidence, even though these indicators are in some cases more qualitative than quantitative. We also examine training externalities, as well as the “signaling” effect that Intel has had on other firms’ decision to enter the Costa Rican economy, thus making Intel itself into a factor of attraction. \n \nThe gross income generated by Intel in terms of net exports, investment, wages and benefits, and local purchases is very important for the Costa Rican economy. Net exports and the economy as a whole have been growing at a significantly higher rate since 1997, the year Intel started operations in the country. Also, the share of natural resource-based exports in total exports has declined while the share of manufactures has risen significantly. This has implied a dramatic change in the composition of Costa Rica’s exports. Available evidence supports the view that Intel has generated positive externalities for the Costa Rican economy.
Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also realworld industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms. We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave’s code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel’s vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.
This paper explores Intel's strategy with respect to complements. We find that, as the literature predicts, Intel's entry decisions are shaped by the belief that it does not have the capabilities to enter all possible markets, and thus that it must encourage widespread entry despite the fact that potential entrants (rationally) fear Intel's ability to “squeeze” them ex post. We explore the ways in which Intel addresses this issue, highlighting in particular the firm's use of organizational structure and processes as commitment mechanisms. Our results have implications for our understanding of the dynamics of competition in complements and of the role of organizational form in shaping competition.
For the first time, we practically demonstrate that Intel SGX enclaves are vulnerable against cache-timing attacks. As a case study, we present an access-driven cache-timing attack on AES when running inside an Intel SGX enclave. Using Neve and Seifert's elimination method, as well as a cache probing mechanism relying on Intel PMC, we are able to extract the AES secret key in less than 10 seconds by investigating 480 encrypted blocks on average. The AES implementation we attack is based on a Gladman AES implementation taken from an older version of OpenSSL, which is known to be vulnerable to cache-timing attacks. In contrast to previous works on cache-timing attacks, our attack is executed with root privileges running on the same host as the vulnerable enclave. Intel SGX, however, was designed to precisely protect applications against such root-level attacks. As a consequence, we show that SGX cannot withstand its designated attacker model when it comes to side-channel vulnerabilities. To the contrary, the attack surface for side-channels increases dramatically in the scenario of SGX due to the power of root-level attackers, for example, by exploiting the accuracy of PMC, which is restricted to kernel code.
Speculative execution side-channel vulnerabilities in micro-architecture processors have raised concerns about the security of Intel SGX. To understand clearly the security impact of this vulnerability against SGX, this paper makes the following studies: First, to demonstrate the feasibility of the attacks, we present SgxPectre Attacks (the SGX-variants of Spectre attacks) that exploit speculative execution side-channel vulnerabilities to subvert the confidentiality of SGX enclaves. We show that when the branch prediction of the enclave code can be influenced by programs outside the enclave, the control flow of the enclave program can be temporarily altered to execute instructions that lead to observable cache-state changes. An adversary observing such changes can learn secrets inside the enclave memory or its internal registers, thus completely defeating the confidentiality guarantee offered by SGX. Second, to determine whether real-world enclave programs are impacted by the attacks, we develop techniques to automate the search of vulnerable code patterns in enclave binaries using symbolic execution. Our study suggests that nearly any enclave program could be vulnerable to SgxPectre Attacks since vulnerable code patterns are available in most SGX runtimes (e.g., Intel SGX SDK, Rust-SGX, and Graphene-SGX). Third, we apply SgxPectre Attacks to steal seal keys and attestation keys from Intel signed quoting enclaves. The seal key can be used to decrypt sealed storage outside the enclaves and forge valid sealed data; the attestation key can be used to forge attestation signatures. For these reasons, SgxPectre Attacks practically defeat SGX's security protection. Finally, we evaluate Intel's existing countermeasures against SgxPectre Attacks and discusses the security implications.
Dynamic frequency and voltage scaling features have been introduced to manage ever-growing heat and power consumption in modern processors. Design restrictions ensure frequency and voltage are adjusted as a pair, based on the current load, because for each frequency there is only a certain voltage range where the processor can operate correctly. For this purpose, many processors (including the widespread Intel Core series) expose privileged software interfaces to dynamically regulate processor frequency and operating voltage.In this paper, we demonstrate that these privileged interfaces can be reliably exploited to undermine the system's security. We present the Plundervolt attack, in which a privileged software adversary abuses an undocumented Intel Core voltage scaling interface to corrupt the integrity of Intel SGX enclave computations. Plundervolt carefully controls the processor's supply voltage during an enclave computation, inducing predictable faults within the processor package. Consequently, even Intel SGX's memory encryption/authentication technology cannot protect against Plundervolt. In multiple case studies, we show how the induced faults in enclave computations can be leveraged in real-world attacks to recover keys from cryptographic algorithms (including the AES-NI instruction set extension) or to induce memory safety vulnerabilities into bug-free enclave code. We finally discuss why mitigating Plundervolt is not trivial, requiring trusted computing base recovery through microcode updates or hardware changes.
Virtualizing the physical resources of a computing system to improve sharing and utilization has been done for decades. Virtualization had once been confined to specialized server and mainframe systems, but improvements in the performance of platforms based on Intel technology now allow those platforms to efficiently support virtualization. However, the IA-32 and Itanium processor architectures pose a number of significant challenges to virtualization. The first generation of Intel Virtualization Technology (VT) for IA-32 and Itanium processors provides hardware support that simplifies processor virtualization, enabling reductions in virtual machine monitor (VMM) software size and complexity. Resulting VMMs can support a wider range of legacy and future operating systems (OSs) on the same physical platform while maintaining high performance. In this paper, we provide details of the virtualization challenges posed by IA-32 and Itanium processors; present an overview and furnish details of VT-x (Intel Virtualization Technology for the IA-32 architecture) and VT-i (Intel Virtualization Technology for the Itanium architecture); show how VT-x and VT-i address virtualization challenges; and finally provide examples of usage of the VT-x and VT-i architecture.
We introduce Intel® Software Guard Extensions (Intel® SGX) SGX2 which extends the SGX instruction set to include dynamic memory management support for enclaves. Intel® SGX is a subset of the Intel Architecture Instruction Set [1]. SGX1 allows an application developer to build a trusted environment and execute inside that space. However SGX1 imposes limitations regarding memory commitment and reuse of enclave memory. The software developer is required to allocate all memory at enclave instantiation. This paper describes new instructions and programming models to extend support for dynamic memory management inside an enclave.
This paper draws on a detailed history of Intel's strategy with respect to the complementary markets for microprocessors to explore the usefulness of the current theoretical literature for explaining behavior. We find that as the literature predicts, Intel invests heavily in these markets, both through direct entry and through subsidy. We also find, again consistent with the literature, that the firm's entry decisions are shaped by the belief that it does not have either the capabilities or the resources to enter all possible markets, and thus that it believes it is critical to encourage widespread entry. As several authors have pointed out, this imperative places the firm in a difficult strategic position, since it needs to attempt to commit to potential entrants that it will not engage in an ex-post "squeeze", despite the fact that ex post it has very strong incentives to do so. We find that the fact that the complementary markets in which Intel competes are complex, dynamic and multilayered considerably sharpens this dilemma. We explore the ways in which Intel attempts to solve it, highlighting in particular the organizational structure and processes through which they attempt to commit to making money in the markets which they choose to enter while also committing not to making too much. Our results have implications for both our understanding of the dynamics of competition in complements and of the role of organizational structures and processes in shaping competition.
A virtual machine monitor (VMM) allows multiple op-erating systems to run concurrently on virtual machines (VMs) on a single hardware platform. Each VM can be treated as an independent operating system platform. A secure VMM would enforce an overarching security policy on its VMs. The potential benefits of a secure VMM for PCs in-clude: a more secure environment, familiar COTS op-erating systems and applications, and enormous savings resulting from the elimination of the need for separate platforms when both high assurance policy enforcement, and COTS software are required. This paper addresses the problem of implementing se-cure VMMs on the Intel Pentium architecture. The re-quirements for various types of VMMs are reviewed. We report an analysis of the virtualizability of all of the ap-proximately 250 instructions of the Intel Pentium plat-form and address its ability to support a VMM. Cur-rent “virtualization ” techniques for the Intel Pentium ar-chitecture are examined and several security problems are identified. An approach to providing a virtualizable hardware base for a highly secure VMM is discussed. 1
Etch is a general-purpose tool for rewriting arbitrary Win32/x86 binaries without requiring source code. Etch provides a framework for modifying executables for both measurement and optimization. Etch handles the complexities of the Win32 executable file format and the x86 instruction set, allowing tool builders to focus on specifying transformations. Etch also handles the complexities of the Win32 execution environment, allowing tool users to focus on performing experiments. This paper describes Etch and some of the tools that we have built using Etch, including a hierarchical call graph profiler and an instruction layout optimization tool. 1 Introduction During the last decade, the Intel x86 instruction set has become a mainstay of the computing industry. Arguably, Intel processors have executed more instructions than all other computers ever built. Despite the widespread use of Intel processors and applications, however, few tools are available to assist the programmer and user in...
Summary This article and the accompanying paper has outlined an ambitious vision which, in some respects, is a wide departure from present-day processors and platforms. But in reality, this vision is based on a continued evolution of Intel’s drive for increased parallelism, and our proven investment, research, development, manufacturing and unparalleled ecosystem enabling capability that, when taken together, will continue to lead us into an era of more powerful, versatile and efficient processing engines and platforms containing those engines. Ultimately, this evolution is driven by usage—what people want from technology and what they do with it. And though no one can precisely predict the future course of technology, the developments now underway point to some likely outcomes. Based on current requirements and trends, we at Intel believe that processor and platform architecture needs to move toward a virtualized, reconfigurable CMP architecture with a large number of cores, a rich set of built-in processing capabilities, large on-chip memory subsystem and sophisticated microkernel. This architectural evolution, delivered with volume computing economics and an adherence to maintaining compatibility with thousands of already existing applications, will ensure that Intel processors and platforms will continue to power a breathtaking array of sophisticated new applications over the coming years, transforming business and daily life in ways we can only begin to imagine.
Skylake's core, processor graphics, and system on chip were designed to meet a demanding set of requirements for a wide range of power-performance points. Its coherent fabric was designed to provide high-memory bandwidth from multiple memory sources. Skylake's power management, which includes Intel Speed Shift technology, was designed to provide the largest dynamic power range among prior Intel processors. The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors. Skylake's Gen9 graphics provides new features designed to maximize energy efficiency and bring the best visual experience for gaming and media. Skylake offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.
We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel®Xeon®processors with 256-bit SIMD, and adding a single Intel®Xeon Phi™coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads.