共找到 20 条结果
FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an Ethernet-based scalable acceleration framework that allows FPGA to be hot-plugged into a network in a stand-alone fashion without connecting to a local host CPU, which enables flexible scalability. SAF provides a custom FPGA shell and a set of Ethernet protocols that allow FPGAs to connect with a remote host to accelerate application kernels. SAF can configure multiple FPGAs simultaneously, which significantly reduces the reconfiguration time in scaling effort. We implemented the SAF framework using Intel FPGA SDK for OpenCL and 20 Bittware 385A cards with Arria-10 FPGAs. We analyze a case study and conduct experiments to compare SAF with state-of-the-art multi-FPGA clusters. Results show that SAF provides 13X faster reconfiguration than sequential PCIe programming, reduces the hardware set
Trends in hardware, the prevalence of the cloud, and the rise of highly demanding applications have ushered an era of specialization that quickly changes how data is processed at scale. These changes are likely to continue and accelerate in the next years as new technologies are adopted and deployed: smart NICs, smart storage, smart memory, disaggregated storage, disaggregated memory, specialized accelerators (GPUS, TPUs, FPGAs), and a wealth of ASICs specifically created to deal with computationally expensive tasks (e.g., cryptography or compression). In this tutorial, we focus on data processing on FPGAs, a technology that has received less attention than, e.g., TPUs or GPUs but that is, however, increasingly being deployed in the cloud for data processing tasks due to the architectural flexibility of FPGAs, along with their ability to process data at line rate, something not possible with other types of processors or accelerators. In the tutorial, we will cover what FPGAs are, their characteristics, their advantages and disadvantages, as well as examples from deployments in the industry and how they are used in various data processing tasks. We will introduce FPGA programming wi
It is well known that to accelerate stencil codes on CPUs or GPUs and to exploit hardware caches and their lines optimizers must find spatial and temporal locality of array accesses to harvest data-reuse opportunities. On FPGAs there is the burden that there are no built-in caches (or only pre-built hardware descriptions for cache blocks that are inefficient for stencil codes). But this paper demonstrates that this lack is also a chance as polyhedral methods can be used to generate stencil-specific cache-structures of the right sizes on the FPGA and to fill and flush them efficiently with wide bursts during stencil execution. The paper shows how to derive the appropriate directives and code restructurings from stencil codes so that the FPGA compiler generates fast stencil hardware. Switching on our optimization improves the runtime of a set of 10 stencils by between 43x and 156x.
An increasing number of unhardened commercial-off-the-shelf embedded devices are deployed under harsh operating conditions and in highly-dependable systems. Due to the mechanisms of hardware degradation that affect these devices, ageing detection and monitoring are crucial to prevent critical failures. In this paper, we empirically study the propagation delay of 298 naturally-aged FPGA devices that are deployed in the European XFEL particle accelerator. Based on in-field measurements, we find that operational devices show significantly slower switching frequencies than unused chips, and that increased gamma and neutron radiation doses correlate with increased hardware degradation. Furthermore, we demonstrate the feasibility of developing machine learning models that estimate the switching frequencies of the devices based on historical and environmental data.
There is a growing call for greater amounts of increasingly agile computational power for edge and cloud infrastructure to serve the computationally complex needs of ubiquitous computing devices. Thus, an important challenge is addressing the holistic environmental impacts of these next-generation computing systems. To accomplish this, a life-cycle view of sustainability for computing advancements is necessary to reduce environmental impacts such as greenhouse warming gas emissions from these computing choices. Unfortunately, decadal efforts to address operational energy efficiency in computing devices have ignored and in some cases exacerbated embodied impacts from manufacturing these edge and cloud systems, particularly their integrated circuits. During this time FPGA architectures have not changed dramatically except to increase in size. Given this context, we propose REFRESH FPGAs to build new FPGA devices and architectures from recently retired FPGA dies using 2.5D integration. To build REFRESH FPGAs requires creative architectures that leverage existing chiplet pins with an inexpensive to-manufacture interposer coupled with creative design automation. In this paper, we discus
Cloud FPGAs strike an alluring balance between computational efficiency, energy efficiency, and cost. It is the flexibility of the FPGA architecture that enables these benefits, but that very same flexibility that exposes new security vulnerabilities. We show that a remote attacker can recover "FPGA pentimenti" - long-removed secret data belonging to a prior user of a cloud FPGA. The sensitive data constituting an FPGA pentimento is an analog imprint from bias temperature instability (BTI) effects on the underlying transistors. We demonstrate how this slight degradation can be measured using a time-to-digital (TDC) converter when an adversary programs one into the target cloud FPGA. This technique allows an attacker to ascertain previously safe information on cloud FPGAs, even after it is no longer explicitly present. Notably, it can allow an attacker who knows a non-secret "skeleton" (the physical structure, but not the contents) of the victim's design to (1) extract proprietary details from an encrypted FPGA design image available on the AWS marketplace and (2) recover data loaded at runtime by a previous user of a cloud FPGA using a known design. Our experiments show that BTI de
FPGAs are quickly becoming available in the cloud as a one more heterogeneous processing element complementing CPUs and GPUs. There are many reports in the literature showing the potential for FPGAs to accelerate a wide variety of algorithms, which combined with their growing availability, would seem to also indicate a widespread use in many applications. Unfortunately, there is not much published research exploring what it takes to integrate an FPGA into an existing application in a cost-effective way and keeping the algorithmic performance advantages. Building on recent results exploring how to employ FPGAs to improve the search engines used in the travel industry, this paper analyses the end-to-end performance of the search engine when using FPGAs, as well as the necessary changes to the software and the cost of such deployments. The results provide important insights on current FPGA deployments and what needs to be done to make FPGAs more widely used. For instance, the large potential performance gains provided by an FPGA are greatly diminished in practice if the application cannot submit request in the most optimal way, something that is not always possible and might require s
The rapid growth of Internet-of-things (IoT) and artificial intelligence applications have called forth a new computing paradigm--edge computing. In this paper, we study the suitability of deploying FPGAs for edge computing from the perspectives of throughput sensitivity to workload size, architectural adaptiveness to algorithm characteristics, and energy efficiency. This goal is accomplished by conducting comparison experiments on an Intel Arria 10 GX1150 FPGA and an Nvidia Tesla K40m GPU. The experiment results imply that the key advantages of adopting FPGAs for edge computing over GPUs are three-fold: 1) FPGAs can provide a consistent throughput invariant to the size of application workload, which is critical to aggregating individual service requests from various IoT sensors; (2) FPGAs offer both spatial and temporal parallelism at a fine granularity and a massive scale, which guarantees a consistently high performance for accelerating both high-concurrency and high-dependency algorithms; and (3) FPGAs feature 3-4 times lower power consumption and up to 30.7 times better energy efficiency, offering better thermal stability and lower energy cost per functionality.
We propose FPGA-Patch, the first-of-its-kind defense that leverages automated program repair concepts to thwart power side-channel attacks on cloud FPGAs. FPGA-Patch generates isofunctional variants of the target hardware by injecting faults and finding transformations that eliminate failure. The obtained variants display different hardware characteristics, ensuring a maximal diversity in power traces once dynamically swapped at run-time. Yet, FPGA-Patch forces the variants to have enough similarity, enabling bitstream compression and minimizing dynamic exchange costs. Considering AES running on AMD/Xilinx FPGA, FPGA-Patch increases the attacker's effort by three orders of magnitude, while preserving the performance of AES and a minimal area overhead of 14.2%.
Recently, FPGA accelerators have risen in popularity as they present a suitable way of satisfying the high-computation and low-power demands of real time applications. The modern electric transportation systems (such as aircraft, road vehicles) can greatly profit from embedded FPGAs, which incorporate both high-performance and flexibility features into a single SoC. At the same time, the virtualization of FPGA resources aims to reinforce these systems with strong isolation, consolidation and security. In this paper, we present a novel virtualization framework aimed for SoC-attached FPGA devices, in a Linux and QEMU/KVM setup. We use Virtio as a means to enable the configuration of FPGA resources from guest systems in an efficient way. Also, we employ the Linux VFIO and Device Tree Overlays technologies in order to render the FPGA resources dynamically accessible to guest systems. The ability to dynamically configure and utilize the FPGA resources from a virtualization environment is described in details. The evaluation procedure of the solution is presented and the virtualization overhead is benchmarked as minimal (around 10%) when accessing the FPGA devices from guest systems.
The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design pri
FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increase.
Field programmable gate arrays (FPGAs) can accelerate image processing by exploiting fine-grained parallelism opportunities in image operations. FPGA language designs are often subsets or extensions of existing languages, though these typically lack suitable hardware computation models so compiling them to FPGAs leads to inefficient designs. Moreover, these languages lack image processing domain specificity. Our solution is RIPL, an image processing domain specific language (DSL) for FPGAs. It has algorithmic skeletons to express image processing, and these are exploited to generate deep pipelines of highly concurrent and memory-efficient image processing components.
FPGAs (Field Programmable Gate arrays) have gained massive popularity today as accelerators for a variety of workloads, including big data analytics, and parallel and distributed computing. This has fueled the study of mechanisms to provision FPGAs among multiple tenants as general purpose computing resources on the cloud. Such mechanisms offer new challenges, such as ensuring IP protection and bitstream confidentiality for mutually distrusting clients sharing the same FPGA. A direct adoption of existing IP protection techniques from the single tenancy setting do not completely address these challenges, and are also not scalable enough for practical deployment. In this paper, we propose a dedicated and scalable framework for secure multi-tenant FPGA provisioning that can be easily integrated into existing cloud-based infrastructures such as OpenStack. Our technique has constant resource/memory overhead irrespective of the number of tenants sharing a given FPGA, and is provably secure under well-studied cryptographic assumptions. A prototype implementation of our proposition on Xilinx Virtex-7 UltraScale FPGAs is presented to validate its overheads and scalability when supporting mu
The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have proposed to implement PISA on FPGAs. However, little effort has been dedicated to analyze whether FPGAs are good candidates to implement PISA. In this work, we take a step back and evaluate the micro-architecture efficiency of various PISA blocks. We demonstrate, supported by a theoretical and experimental analysis, that the performance of a few PISA blocks is severely limited by the current FPGA architectures. Specifically, we show that match tables and programmable packet schedulers represent the main performance bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues to alleviate these shortcomings. First, we identify network applications well tailored to current FPGAs. Second, to support a wider range of networking applications, we propose modifications to the FPG
Integrating Field Programmable Gate Arrays (FPGAs) with cloud computing instances is a rapidly emerging trend on commercial cloud computing platforms such as Amazon Web Services (AWS), Huawei cloud, and Alibaba cloud. Cloud FPGAs allow cloud users to build hardware accelerators to speed up the computation in the cloud. However, since the cloud FPGA technology is still in its infancy, the security implications of this integration of FPGAs in the cloud are not clear. In this paper, we survey the emerging field of cloud FPGA security, providing a comprehensive overview of the security issues related to cloud FPGAs, and highlighting future challenges in this research area.
FPGAs are rarely mentioned when discussing the implementation of large machine learning applications, such as Large Language Models (LLMs), in the data center. There has been much evidence showing that single FPGAs can be competitive with GPUs in performance for some computations, especially for low latency, and often much more efficient when power is considered. This suggests that there is merit to exploring the use of multiple FPGAs for large machine learning applications. The challenge with using multiple FPGAs is that there is no commonly-accepted flow for developing and deploying multi-FPGA applications, i.e., there are no tools to describe a large application, map it to multiple FPGAs and then deploy the application on a multi-FPGA platform. In this paper, we explore the feasibility of implementing large transformers using multiple FPGAs by developing a scalable multi-FPGA platform and some tools to map large applications to the platform. We validate our approach by designing an efficient multi-FPGA version of the I-BERT transformer and implement one encoder using six FPGAs as a working proof-of-concept to show that our platform and tools work. Based on our proof-of-concept p
FPGA-level emulation is a key step in pre-silicon chip design validation. However, emulating large-scale multi-core systems increasingly exceed the hardware resource capacity of a single FPGA, limiting the feasibility of full-system emulation. To address this challenge, we introduce EMiX, a scalable multi-FPGA framework that enables distributed emulation of multi-core RISC-V architectures beyond single-FPGA resource limits. EMiX systematically partitions a monolithic multi-core design into multiple components and deploys them across multiple interconnected FPGAs, effectively exploiting inter-FPGA interconnects to balance scalability and performance without requiring fundamental RTL redesign. We prototype EMiX with a 64-core architecture across eight interconnected Alveo U55c FPGAs (scalable on core and FPGA counts), successfully demonstrating full-system execution including Linux boot. EMiX will be released as an open-source platform.
The adoption of FPGAs in cloud-native environments is facing impediments due to FPGA limitations and CPU-oriented design of orchestrators, as they lack virtualization, isolation, and preemption support for FPGAs. Consequently, cloud providers offer no orchestration services for FPGAs, leading to low scalability, flexibility, and resiliency. This paper presents Funky, a full-stack FPGA-aware orchestration engine for cloud-native applications. Funky offers primary orchestration services for FPGA workloads to achieve high performance, utilization, scalability, and fault tolerance, accomplished by three contributions: (1) FPGA virtualization for lightweight sandboxes, (2) FPGA state management enabling task preemption and checkpointing, and (3) FPGA-aware orchestration components following the industry-standard CRI/OCI specifications. We implement and evaluate Funky using four x86 servers with Alveo U50 FPGA cards. Our evaluation highlights that Funky allows us to port 23 OpenCL applications from the Xilinx Vitis and Rosetta benchmark suites by modifying 3.4% of the source code while keeping the OCI image sizes 28.7 times smaller than AMD's FPGA-accessible Docker containers. In additio
Recently, recycled field-programmable gate arrays (FPGAs) pose a significant hardware security problem due to the proliferation of the semiconductor supply chain. Ring oscillator (RO) based frequency analyzing technique is one of the popular methods, where most studies used the known fresh FPGAs (KFFs) in machine learning-based detection, which is not a realistic approach. In this paper, we present a novel recycled FPGA detection method by examining the symmetry information of the RO frequency using unsupervised anomaly detection method. Due to the symmetrical array structure of the FPGA, some adjacent logic blocks on an FPGA have comparable RO frequencies, hence our method simply analyzes the RO frequencies of those blocks to determine how similar they are. The proposed approach efficiently categorizes recycled FPGAs by utilizing direct density ratio estimation through outliers detection. Experiments using Xilinx Artix-7 FPGAs demonstrate that the proposed method accurately classifies recycled FPGAs from 10 fresh FPGAs by x fewer computations compared with the conventional method.