DNNs are rapidly evolving from streamlined single-modality single-task (SMST) to multi-modality multi-task (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e. deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10000X less than exhaustively searching the optimal solution.
Long-Time Coherent Integration (LTCI) utilizes digital integration to combine multiple coherent cycles, thereby improving the signal-to-noise ratio (SNR). Our previous work introduced single-bit LTCI, an approach optimized for FPGA implementation, but faced challenges of output saturation at high SNR levels and inherent limitations in SNR gain (SNRG), which are insufficient for certain applications. This paper presents a threshold tracking method that improves the performance of single-bit LTCI in high-SNR scenarios. In addition, a sampling rate enhancement technique and a Kalman filtering method are introduced to further enhance the SNR of the processed signals. An FPGA-based prototype was developed to validate these methods. The results demonstrate that the threshold tracking method extends the measurable input SNR range to 30. Under the specified conditions, the sampling rate enhancement technique yields a 30% improvement in SNR over the original method, while the Kalman filter reduces noise levels to 60% of their original values.
Over the past few years, the emergence and development of design space exploration (DSE) have shortened the deployment cycle of deep neural networks (DNNs). As a result, with these open-sourced DSE, we can automatically compute the optimal configuration and generate the corresponding accelerator intellectual properties (IPs) from the pretrained neural network models and hardware constraints. However, to date, the security of DSE has received little attention. Therefore, we explore this issue from an adversarial perspective and propose an automated hardware Trojan (HT) generation framework embedded within DSE. The framework uses an evolutionary algorithm (EA) to analyze user-input data to automatically generate the attack code before placing it in the final output accelerator IPs. The proposed HT is sufficiently stealthy and suitable for both single and multifield-programmable gate array (FPGA) designs. It can also implement controlled accuracy degradation attacks and specified category attacks. We conducted experiments on LeNet, VGG-16, and YOLO, respectively, and found that for the LeNet model trained on the CIFAR-10 dataset, attacking only one kernel resulted in 97.3% of images being classified in the category specified by the adversary and reduced accuracy by 59.58%. Moreover, for the VGG-16 model trained on the ImageNet dataset, attacking eight kernels can cause up to 96.53% of the images to be classified into the category specified by the adversary and causes the model's accuracy to decrease to 2.5%. Finally, for the YOLO model trained on the PASCAL VOC dataset, attacking with eight kernels can cause the model to identify the target as the specified category and cause slight perturbations to the bounding boxes. Compared to the un-compromised designs, the look-up tables (LUTs) overhead of the proposed HT design does not exceed 0.6%.
One of the challenges that wireless sensor networks (WSNs) need to address is achieving security and privacy while keeping low power consumption at sensor nodes. Physically unclonable functions (PUFs) offer a challenge-response functionality that leverages the inherent variations in the manufacturing process of a device, making them an optimal solution for sensor node authentication in WSNs. Thus, identifiability is the fundamental property of any PUF. Consequently, it is necessary to design structures that optimize the PUF in terms of identifiability. This work studies different architectures of oscillators to analyze which ones exhibit the best properties to construct a RO-based PUF. For this purpose, Generalized Galois Ring Oscillators (GenGAROs) are used. A GenGARO is a novel type of oscillator formed by a combination of up to two input logical operations connected in cascade, where one input is the output of the previous operation and the other is the feedback signal. GenGAROs include some previously proposed oscillators as well as many new oscillator designs. Thus, the architecture of GenGAROs is analyzed to implement a GenGARO-PUF on an Artix-FPGA. With this purpose, an exhaustive study of logical operation combinations that optimize PUF performance in terms of identifiability has been conducted. From this, it has been observed that certain logic gates in specific positions within the oscillator contribute to constructing a PUF with good properties, and by applying certain constraints, any oscillator generated with these constraints can be used to construct a PUF with an equal error rate on the order of or below 10-11 using 100-bit responses. As a result, a design methodology for FPGA-based RO-PUFs has been developed, enabling the generation of multiple PUF primitives with high identifiability that other designers could exploit to implement RO-based PUFs with good properties.
Various data acquisition systems have been developed based on the time-interleaved analog-to-digital converter (TIADC) technique, in which digital calibration is typically implemented on field-programmable gate arrays (FPGAs). However, FPGA-based TIADC systems suffer from high power consumption, high implementation complexity, limited hardware resources, and low integration efficiency. In this work, we propose a systematic design method to reduce the resource consumption of TIADC systems. By analyzing the characteristics of filter coefficients and optimizing allocation of computational errors, a resource-efficient digital calibration filter is proposed, achieving over 80% reduction in area and power consumption. A multi-phase sampling clock generation circuit with adjustable delay is integrated to simplify system implementation and provide coarse timing mismatch adjustment. Furthermore, the proposed architecture supports calibration of high-speed ADCs with different resolutions. To validate the proposed approach, a prototype application-specific integrated circuit (ASIC) was implemented in a 130 nm CMOS technology. The chip is designed to interface with up to four 5-Gsps ADCs and consumes 11.5 W of power. Simulation and analysis results demonstrate that broadband mismatch errors can be effectively calibrated by the ASIC. According to the scaling model, if fabricated in a process node comparable with that of modern FPGAs, the ASIC would achieve more than 90% reduction in power consumption relative to FPGA-based implementations consuming several tens of watts, while also lowering the implementation complexity of the TIADC system.
Aerospace-grade SRAM-based field-programmable gate arrays (FPGAs) used in space applications are highly susceptible to single event effects, leading to soft errors in FPGAs. Additionally, as FPGAs scale up, the difficulty of correcting soft errors also increases. This paper proposes that performing soft error sensitivity analysis on FPGAs can help target the more sensitive areas for detection and correction, thereby improving the efficiency of soft error repair. Firstly, in accordance with the dual-layer architecture of SRAM-based FPGAs, methods for the soft error sensitivity analysis of FPGA application layer resources and configuration bitstreams are reviewed. Subsequently, based on the analysis results, it also covers corresponding application layer memory scrubbing and configuration scrubbing techniques. A prospective look at emerging soft error mitigation technologies is discussed at the end of this review, supporting the development of highly reliable aerospace-grade SRAM-based FPGAs.
The advent of next-generation sequencing (NGS) has revolutionized genomic research by enabling cost-effective, high-throughput sequencing of a diverse range of organisms. This breakthrough has unleashed a "Cambrian explosion" in genomic data volume and diversity. This volume of workloads places genomics among the top four big data challenges anticipated for this decade. In this context, pairwise sequence alignment represents a very time- and energy-intensive step in common bioinformatics pipelines. Speeding up these computations requires the implementation of heuristic approaches, optimized algorithms, and/or hardware acceleration. Among the metrics used in sequence comparison, edit distance is an adopted measure of sequence similarity. Although state-of-the-art CPU and GPU implementations have demonstrated significant performance gains, recent FPGA implementations have shown improved energy efficiency. However, the latter often suffer from limited read-length scalability due to constraints on hardware resources, with some reported designs supporting comparison matrices for sequences of only up to 227 nucleotides. In this work, we present a flexible FPGA-based accelerator template that implements Myers's algorithm to compute exact unit-cost edit-distance up to 1000 bp using high-level synthesis and a worker-based architecture. GeneTEK, a set of instances of this accelerator template in a Xilinx Zynq UltraScale+ FPGA, achieves up to 113% increase in execution speed and up to 111× reduction in energy consumption compared to leading CPU and GPU solutions, while fitting comparison matrices up to 13× larger than previous FPGA-based systolic-array solutions. By following a SW-HW co-design approach, GeneTEK implements efficient memory access and exploits parallelization at multiple levels. These results reaffirm the potential of FPGAs as an energy-efficient platform for computing the exact unit-cost edit distance used in sequence comparisons of read-lengths up to 1000 bp.
The continually increasing volume of DNA sequence data has resulted in a growing demand for fast implementations of core algorithms. Computation of pairwise alignments between candidate haplotypes and sequencing reads using Pair-HMMs is a key component in DNA variant calling tools such as the GATK HaplotypeCaller but can be highly time consuming due to its quadratic time complexity and the large number of pairs to be aligned. Unfortunately, previous approaches to accelerate this task using the massively parallel processing capabilities of modern GPUs are limited by inefficient memory access schemes. This established the need for significantly faster solutions. We address this need by presenting gpuPairHMM - a novel GPU-based parallelization scheme for the dynamic-programming based Pair-HMM forward algorithm based on wavefronts and warp-shuffles. It gains efficiency by minimizing both memory accesses and instructions. We show that our approach achieves close-to-peak performance on several generations of modern CUDA-enabled GPUs (Volta, Ampere, Ada, Hopper, Blackwell). It also outperforms prior implementations on GPUs, CPUs, and FPGAs by a factor of at least 11.7, 14.2, and 19.8, respectively.
Multiplication is a fundamental mathematical operation that finds extensive applications across various disciplines, particularly in computation-intensive and error-resilient applications, such as image processing. As hardware circuits become more complex, there is a growing demand for approximation circuit methods. Implementation of approximate multipliers has the potential to yield substantial reductions in hardware costs while maintaining acceptable performance levels. Most current designs for approximate multipliers are optimized for ASIC-based circuits, which may not produce similar performance improvements when adapted for FPGA-based circuits. Additionally, many of these existing multiplier designs are limited to unsigned numbers. This paper proposes a novel approach for designing signed approximate multipliers tailored specifically for FPGAs. Two efficient architectures are introduced that efficiently utilize key FPGA components, such as LUTs and Carry4 primitives, by designing the optimal LUT-Carry4 netlists. A Pareto-based analysis is also performed to balance trade-offs and achieve a low mean error distance (MED). Simulation results confirm that the proposed architectures offer superior performance compared to existing signed approximate multipliers, delivering improved power efficiency, reduced resource usage, shorter critical path delay (CPD), and enhanced computational accuracy. The practical applicability of these approximate multipliers is further validated through their use in image processing applications.
With the rise of smart cities, technology has enabled more efficient urban management. A key part of this is the Internet of Vehicles (IoVs), which connects vehicles to smart city systems to improve transportation safety and efficiency. This integrated system enables wireless connection between vehicles, allowing for the sharing of essential traffic information. However, with all this connectivity, there are growing concerns about IoV security and privacy. This paper presents a new privacy-preserving authentication scheme for Autonomous Vehicles (AVs) in the IoV field using physical unclonable functions (PUFs). This scheme employs a bilinear pairing-based encryption technique that supports search over encrypted data. The primary aim of this scheme is to authenticate AVs inside the IoV architecture. A novel PUF design generates random keys for our authentication technique, hence boosting security. This dual-layer security strategy safeguards against a range of cyber threats, including identity fraud, man-in-the-middle attacks, and unauthorized access to personal user data. The PUF design will guarantee the true randomness of the AVs' users' secret keys. To handle the large amount of data involved, we use hardware acceleration with different Field-Programmable Gate Arrays (FPGAs). Our examination of privacy and security demonstrates the achievement of the defined design goals. The proposed authentication framework was fully implemented and validated on FPGA platforms to demonstrate its hardware feasibility and efficiency. The integrated heterogeneous PUF achieves an average reliability exceeding 98.5% across a wide temperature range, while maintaining near-ideal randomness with an average Hamming weight of 49.7% over multiple challenge sets. Furthermore, the uniqueness metric approaches 49.9%, confirming strong inter-device distinguishability among different PUF instances. The complete authentication architecture was synthesized on Nexys-100T, Zynq-104, and Kintex-116 devices, where the design utilizes less than 80% of slice Look-Up Tables (LUTs), under 27% of on-chip memory resources, and below 16% of DSP blocks, demonstrating low hardware overhead.
Optical coherence tomography (OCT) is a non-invasive, high-resolution imaging technique widely used in medical diagnosis, biomedical research and other fields. It plays an important role in the early detection and accurate diagnosis of diseases. The superluminescent light-emitting diode (SLED) is the ideal light source for OCT systems, where the stability of its drive current and operating temperature directly determines the imaging quality of OCT. Existing driving and temperature control schemes for similar light sources predominantly rely on microcontrollers or field programmable gate arrays (FPGAs), a reliance which often results in complex system architectures and difficulties in balancing simplicity with control precision. To address these issues, a stable and compact SLED source driver module designed for OCT was developed in this study, integrating both a constant-current drive circuit and a temperature control circuit. The negative feedback control and improved current-limiting protection are employed in the constant-current drive circuit to maintain stable SLED operation and reduce the circuit footprint. A miniature dedicated temperature control chip is adopted in the temperature control circuit. The operating temperature of the SLED is acquired by linearizing the negative temperature coefficient (NTC) thermistor value and regulated through a proportional-integral-derivative (PID) compensation circuit. The size of the fabricated module (including casing) is less than 10 × 8 × 3 cm3. Experimental results show that the driver module achieves a drive current control accuracy of 0.1% and a temperature control accuracy of 0.01 °C. The output optical power fluctuation is less than 0.005 mW and the average axial resolution for OCT is 6.5992 μm with a standard deviation of 0.0107 μm. This light source driver module successfully balances control precision with structural simplicity, demonstrating excellent applicability in OCT systems.
This paper presents a methodology for the remaining useful life (RUL) prediction in power electronic converters based on the health monitoring of semiconductor devices and DC capacitor(s). The primary components considered are Gallium Nitride High Electron Mobility Transistors (GaN HEMTs) and aluminum electrolytic capacitors (AECs). The proposed methodology leverages long-term component characterization under accelerated aging testing in a laboratory setup. A statistical approach based on uniform probability density functions (PDFs) is utilized to estimate the system-level probability of survival under the desired operating conditions. Given that the PDFs and system-level probability calculations involve taking integrals and other complex operations, a machine learning (ML)-assisted model is utilized to reduce the computational burden on the controller units in power electronics converters. A neural network (NN) is used to process the experimentally derived degradation data and the extracted PDFs for arriving at a simple model presented by a few matrices that can even be deployed on modern microcontrollers or field-programmable gate arrays (FPGAs) for in-situ implementation. Experimental results obtained using a laboratory-scale prototype show that the proposed data-driven approach can achieve accuracy levels higher than 99% in predicting the time evolution between degradation checkpoints under accelerated thermal cycling conditions. This validation confirms the model's consistency with established statistical approaches across the full reliability range (T99-T01). Hence, this approach can identify aged or potentially failing converters in-situ and avoid early decommissioning, thereby extending the operational life.
Autonomous driving perception demands low latency, high temporal resolution, and stringent hardware efficiency. While event-based spiking neural networks (SNNs) offer bio-inspired sparse computation, their deployment on edge field-programmable gate arrays (FPGAs) is obstructed by irregular execution patterns and temporal state storage overhead. To address this, we propose HAPQ, a unified hardware-aware pruning and quantization pipeline for compact event-based object detection. Starting from an end-to-end adaptive sampling SNN detector (EAS-SNN), HAPQ conducts hardware-aware configuration search within discrete digital signal processor (DSP) and block RAM (BRAM) budgets, applies single-instruction-multiple-data (SIMD)-aligned structured pruning for computational regularity, and jointly quantizes synaptic weights and membrane potentials via a shift-friendly fixed-point recurrence. Evaluation on the Prophesee Gen1 dataset and an FPGA accelerator shows that HAPQ improves detection accuracy from 0.284 to 0.425 in mean average precision (mAP50:95) and achieves 0.722 mAP50. Hardware implementation reveals a reduction in lookup table (LUT) usage to 1680, complete DSP elimination, and a maximum operating frequency of 920.81 MHz at 0.630 W. These results confirm that effective temporal SNN deployment requires joint optimization of model architecture, state precision, and hardware-aligned workload organization.
Heterogeneous computing infrastructures integrating CPUs, GPUs, and FPGAs present critical challenges in efficient task scheduling due to hardware diversity, complex task dependencies, and conflicting optimization objectives. This work formulates workflow scheduling as a multi-objective optimization problem that minimizes makespan and maximizes resource utilization. For synthetic benchmarks (FFT, Molecular), the approach minimizes makespan and maximizes resource utilization. For the CyberShake seismic workflow, energy consumption is added as a third objective. This research proposes QLSA-MOEAD, a hybrid framework combining three complementary mechanisms: Q-learning for intelligent initialization, Simulated Annealing for local refinement, and MOEA/D for multi-objective decomposition. This integration balances exploration and exploitation effectively. Comprehensive evaluations on 20 test cases (structured FFT, unstructured molecular, and real-world CyberShake workflows) show superior performance. QLSA-MOEAD achieves the best solution quality in 14 out of 16 FFT/molecular cases and outperforms all baselines on CyberShake. A large-scale Montage workflow (100 tasks, 179 dependencies) validates scalability under real-time task arrivals. The framework maintains excellent convergence and diversity across different CCR levels. Q-learning achieves fast decision-making with 0.80-1.70 ms response time. Statistical validation (Wilcoxon and Friedman tests), ablation studies, and parameter sensitivity analysis confirm framework robustness. These results establish QLSA-MOEAD as an effective solution for both static and dynamic workflow scheduling in heterogeneous environments.
This paper presents a neuromorphic processing system integrating a compressed sensing spiking neural network (CSSNN) designed for sparse signal classification. The proposed CSSNN combines data coding, data compression, and SNN classification, enabling end-to-end optimization of network performance and model compression. Evaluated on the MNIST, N-MNIST, and DVS Gesture datasets, under uniform compression ratios (CRs) of 0.1, 0.05, 0.025, and 0.01, the proposed CSSNN consistently reduces the total number of network operations (OPs) by at least 80% compared with compressed learning methods using fixed Gaussian random matrix (GRM) sampling matrices, while maintaining minimal accuracy loss. A specialized CSSNN processor is designed based on a spike-driven processing flow. Validated on field-programmable gate arrays (FPGAs) and evaluated in the 40 nm CMOS process for application-specific integrated circuit (ASIC) design, this CSSNN processor achieves 96.12% classification accuracy with 8-bit fixed-point quantization on the MNIST dataset. The energy consumption of the ASIC is estimated to be 2.089 mW under a 1.1-V supply voltage and 100 MHz frequency.
Mechanized harvesting in the industrial tomato sector is currently bottlenecked by excessive mechanical injuries and elevated levels of foreign materials generated during electro-mechanical combine harvesting operations. To combat these limitations, this comprehensive review explores recent breakthroughs in harvester-mounted smart grading systems engineered specifically for complex, open-field conditions. Rather than relying solely on conventional optical inspection, the study examines the transition toward advanced, heterogeneous edge-computing frameworks-incorporating FPGAs and embedded GPUs-deployed within electro-mechanical harvesting platforms. This architectural evolution plays a crucial role in mitigating unpredictable processing delays caused by intense operational vibrations, although achieving absolute real-time stability under extreme field conditions remains an ongoing challenge. To minimize bruising and physical deterioration, our analysis synthesizes findings from multi-scale explicit dynamic finite element simulations, unpacking the underlying microstructural failure modes of the crop. We illustrate how regulating applied forces via soft robotic effectors can help approach a 'damage-free' handling threshold, though empirical results vary depending on fruit maturity and dynamic operational speeds. Furthermore, coupling multi-modal sensor fusion with Convolutional Neural Networks (CNNs) shows promising potential for non-destructive internal property evaluation under the vibration, dust, and throughput constraints of electro-mechanical harvesters, pending broader validation across diverse field datasets. Ultimately, by projecting future trends in onboard electro-mechanical harvester separation and advocating for a closer synergy between agronomic practices and machine engineering, this paper delivers a comprehensive blueprint for building next-generation, highly resilient, and gentle sorting machinery.
暂无摘要(点击查看详情)
Adaptive Banded Event Alignment (ABEA) stands as a critical algorithmic component in sequence polishing and DNA methylation detection, employing dynamic programming to align raw Nanopore signal with reference reads. Motivated by the observation that, compared to CPUs and GPUs, cutting-edge FPGAs demonstrate-in certain cases-superior performance at a reduced cost and energy consumption, this paper presents an efficient FPGA-based accelerator for ABEA, leveraging the inherent high parallelism and sequential access pattern within ABEA. Our proposed FPGA-based ABEA accelerator significantly enhances ABEA performance compared to the original CPU-based implementation in Nanopolish as well as the state-of-art acceleration on GPU and FPGA platforms. Specifically, targeting Xilinx VU9P, our accelerator achieves an average throughput speedup of 10.05 × over the CPU-only implementation, an average 1.81 × speedup over the state-of-art GPU acceleration with only 7.2% of the energy, and a speedup of 10.11 × compared to an existing FPGA accelerator. Our work demonstrates that intensive genome analysis can benefit significantly from cutting-edge FPGAs, offering improvements in both performance and energy consumption.
This paper presents the implementation of a picosecond resolution timing generator (TG) insensitive to process, voltage, and temperature (PVT) variations for automatic test equipment. The TG is implemented in field-programmable gate arrays (FPGAs) using two-stage time interpolation, which utilizes a multi-phase generator, IDELAY3, and carry-chain resources. To enhance the test rate, each channel of the proposed TG consists of four parallel operating edge generators. The TG performance will deteriorate severely without offset correction due to its sensitivity to PVT variations. To improve the adaptability of the TG, we design a robust offset canceler to ensure stable performance of the TG, resilient to PVT variations. With the proposed architecture and offset canceler, the PVT-insensitive TG achieves a time resolution of 5 ps and offers a maximum dynamic range of 10 s. It also shows improved worst case integral non-linearity ranging from -4.7 to +4.6 ps with the operating temperature continuously varying from 15 to 65 °C and voltage ranging from 0.95 to 1.01 V in FPGAs. The proposed TG can be implemented in the Ultrascale or Ultrascale+ FPGA platform.
Wearable devices can be developed using hardware platforms such as Application Specific Integrated Circuits (ASICs), Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), Micro controller Units (MCUs), or Field Programmable Gate Arrays (FPGAs), each with distinct advantages and limitations. ASICs offer high efficiency but lack flexibility. GPUs excel in parallel processing but consume significant power. DSPs are optimized for signal processing but are limited in versatility. CPUs provide low power consumption but lack computational power. FPGAs are highly flexible, enabling powerful parallel processing at lower energy costs than GPUs but with higher resource demands than ASICs. The combined use of FPGAs and CPUs balances power efficiency and computational capability, making it ideal for wearable systems requiring complex algorithms in far-edge computing, where data processing occurs onboard the device. This approach promotes green electronics, extending battery life and reducing user inconvenience. The primary goal of this work was to develop a versatile framework, similar to existing software development frameworks, but specifically tailored for mixed FPGA/MCU platforms. The framework was validated through a real-world use case, demonstrating significant improvements in execution speed and power consumption. These results confirm its effectiveness in developing green and smart wearable systems.