This paper presents KyberSlash1 and KyberSlash2 – two timing vulnerabilities in several implementations (including the official reference code) of the Kyber Post-Quantum Key Encapsulation Mechanism, recently standardized as ML-KEM. We demonstrate the exploitability of both KyberSlash1 and KyberSlash2 on two popular platforms: the Raspberry Pi 2 (Arm Cortex-A7) and the Arm Cortex-M4 microprocessor. Kyber secret keys are reliably recovered within minutes for KyberSlash2 and a few hours for KyberSlash1. We responsibly disclosed these vulnerabilities to maintainers of various libraries and they have swiftly been patched. We present two approaches for detecting and avoiding similar vulnerabilities. First, we patch the dynamic analysis tool Valgrind to allow detection of variable-time instructions operating on secret data, and apply it to more than 1000 implementations of cryptographic primitives in SUPERCOP. We report multiple findings. Second, we propose a more rigid approach to guarantee the absence of variable-time instructions in cryptographic software using formal methods.
The emergence of quantum computing and its impact on current cryptographic algorithms has triggered the migration to post-quantum cryptography (PQC). Among the PQC candidates, CRYSTALS-Kyber is a key encapsulation mechanism (KEM) that stands out from the National Institute of Standards and Technology (NIST) standardization project. While software implementations of Kyber have been developed and evaluated recently, Kyber’s hardware implementations especially those designed with parallel architecture, are rarely discussed. To help better understand Kyber hardware designs and their security against side-channel analysis (SCA) attacks, in this paper, we first adapt the two most recent Kyber hardware designs for FPGA implementations. We then perform SCA attacks against these hardware designs with different architectures, i.e., parallelization and pipelining. Our experimental results show that Kyber designs on FPGA boards are vulnerable to SCA attacks including electromagnetic (EM) and power side channels. An attacker only needs <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$27 \sim 1,600$ </tex-math></inline-formula> power traces or <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$60 \sim 2,680$ </tex-math></inline-formula> EM traces to recover the decryption key successfully. Furthermore, we propose two first-order IND-CPA Kyber decapsulation masking protected designs, and then we evaluate their securities and overheads. The experimental results demonstrate that the side channel security of masked Kyber designs has increased by more than 10x.
Quantum computing raises questions about the security of data encrypted using modern methods. Hence, the National Institute of Standards and Technology (NIST) has undertaken standardization of post-quantum cryptography (PQC) algorithms to defend against attacks from both classical and quantum computers. Following four rounds of evaluation, CRYSTALS-Kyber has been selected for standardization. In this paper, we present an efficient hardware architecture of CRYSTALS-Kyber for resource-constrained IoT devices. Firstly, we propose a compact hash module for CRYSTALS-Kyber. A single buffer is designed to perform padding, hashing, and holding data. Hence, using large FIFOs for data input/output is eliminated. Then, we propose a novel non-memory-based iterative number theoretic transform (NMI-NTT) architecture. Finally, the data flow between modules is optimized to improve parallelization and execution time. Implementation results on an Artix-7 FPGA show that our design consumes minimal hardware resources compared to the designs reported to date, corresponding to 5487 LUTs, 3426 FFs, 1548 SLICEs, 3.5 BRAMs, and 2 DSPs. Our design computes key generation, encapsulation, and decapsulation phases in 3.3/4.5/6.1 K-cycles for Kyber512, 5.6/7.1/9.2 K-cycles for Kyber768, and 8.5/10.1/12.9 K-cycles for Kyber1024, with 185MHz operating frequency. Our area-time-product (ATP) performance outperforms other designs.
In 2022, the National Institute of Standards and Technology (NIST) made an announcement regarding the standardization of Post-Quantum Cryptography (PQC) candidates. Out of all the Key Encapsulation Mechanism (KEM) schemes, the CRYSTAL-Kyber emerged as the sole winner. This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber’s modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.14× larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented.We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50%~28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.
CRYSTALS-Kyber, as the only public key encryption (PKE) algorithm selected by the National Institute of Standards and Technology (NIST) in the third round, is considered one of the most promising post-quantum cryptography (PQC) schemes. Lattice-based cryptography uses complex discrete algorithm problems on lattices to build secure encryption and decryption systems to resist attacks from quantum computing. Performance is an important bottleneck affecting the promotion of post quantum cryptography. In this paper, we present a High-performance Implementation of Kyber (named HI-Kyber) on the NVIDIA GPUs, which can increase the key-exchange performance of Kyber to the million-level. Firstly, we propose a lattice-based PQC implementation architecture based on kernel fusion, which can avoid redundant global-memory access operations. Secondly, We optimize and implement the core operations of CRYSTALS-Kyber, including Number Theoretic Transform (NTT), inverse NTT (INTT), pointwise multiplication, etc. Especially for the calculation bottleneck NTT operation, three novel methods are proposed to explore extreme performance: the sliced layer merging (SLM), the sliced depth-first search (SDFS-NTT) and the entire depth-first search (EDFS-NTT), which achieve a speedup of 7.5%, 28.5%, and 41.6% compared to the native implementation. Thirdly, we conduct comprehensive performance experiments with different parallel dimensions based on the above optimization. Finally, our key exchange performance reaches 1,664 kops/s. Specifically, based on the same platform, our HI-Kyber is 3.52× that of the GPU implementation based on the same instruction set and 1.78× that of the state-of-the-art one based on AI-accelerated tensor core.
Large-degree polynomial multiplication is an integral component of post-quantum secure lattice-based cryptographic algorithms like CRYSTALS-Kyber and Dilithium. The computational complexity of large-degree polynomial multiplication can be reduced significantly through Number Theoretic Transformation (NTT). In this paper, we aim to develop a unified and shared NTT architecture that can support polynomial multiplication for both CRYSTALS-Kyber and Dilithium. More specifically, in this paper, we have proposed three different unified architectures for NTT multiplication in CRYSTALS-Kyber and Dilithium with varying number of configurable radix-2 butterfly units. Additionally, the developed implementation is coupled with a conflict-free memory mapping scheme that allows the architecture to be fully pipelined. We have validated our implementation on Artix-7, Zynq-7000 and Zynq Ultrascale+ FPGAs. Our standalone implementations for NTT multiplication for CRYSTALS-Kyber and Dilithium perform better than the existing works, and our unified architecture shows excellent area and timing performance compared to both standalone and existing unified implementations. This architecture can potentially be used for compact and efficient implementation for CRYSTALS-Kyber and Dilithium.
Commercially available quantum computers are expected to reshape the world in the near future. They are said to break conventional cryptographic security mechanisms that are deeply embedded in our today’s communication. Symmetric cryptography, such as AES, will withstand quantum attacks as long as the key sizes are doubled compared to today’s key lengths. Asymmetric cryptographic procedures, e.g. RSA, however are broken. It is therefore necessary to change the way we assure our privacy by adopting and moving towards post-quantum cryptography (PQC) principles. In this work, we benchmark three PQC algorithms, Falcon, Dilithium, and Kyber. Moreover, we present an implementation of a PQC stack consisting of the algorithms Dilithium/Kyber and Falcon/Kyber which use hardware accelerators for some key functions and evaluate their performance and resource utilization. Regarding a classic server-client model, the computational load of the Dilithium/Kyber stack is distributed more equally among server and client. The stack Falcon/Kyber biases the computational challenges towards the server, hence relieving the client of performing costly operations. We found that Dilithium’s advantage over Falcon is that Dilithium’s execution is faster while the workload to be performed is distributed equally among client and server, whereas Falcon’s advantage over Dilithium lies within the small signature sizes and the unequally distributed computational tasks. In a client server model with a performance limited client (i.e. Internet-of-Things - IoT - environments) Falcon could proof useful for it constrains the computational hard tasks to the server and leaves a minimal workload to the client. Furthermore, Falcon requires smaller bandwidth, making it a strong candidate for deep-edge or IoT applications.
The efficiency of polynomial multiplication execution majorly impacts the performance of lattice-based post-quantum cryptosystems. In this research, we propose a high-speed hardware architecture to accelerate polynomial multiplication based on the Number Theoretic Transform (NTT) in CRYSTAL-Kyber and CRYSTAL-Dilithium. We design a Digital Signal Processing (DSP) architecture for modular multiplication in butterfly and Point-Wise Multiplication (PWM) operations. Our method reduces the critical path delay of an n-bit multiplier to that of a (2n-2)-bit adder, optimizing both area and speed. These dedicated DSPs are employed in butterfly and PWM operations, completely eliminating the pre-process and post-process of NTT transforms. Furthermore, we introduce a novel unified pipelined architecture for the NTT and Inverse NTT (INTT) transformations of Kyber and Dilithium, with corresponding high-speed (Radix-2) and ultra high-speed (Radix-4) versions. Lastly, we construct a complete hardware accelerator for polynomial matrix-vector multiplication in Kyber. The Field-Programmable Gate Array (FPGA) implementation results have proven that our designs have significantly improved execution time by 3.4×–9.6× for the NTT transforms in Dilithium and 1.36×–34.16× for Kyber polynomial multiplication, compared to previous studies reported to date. Additionally, the hardware footprint results indicate that our proposed architectures exhibit superior hardware performance in Area-Time-Product (ATP), corresponding to a 44%–96% improvement. The proposed architectures are efficient and well-suited for high-performance lattice-based cryptography systems.
CRYSTALS-Kyber has been recently selected by the NIST as a new public-key encryption and key-establishment algorithm to be standardized. This makes it important to assess how well CRYSTALS-Kyber implementations withstand side-channel attacks. Software implementations of CRYSTALS-Kyber have already been analyzed and the discovered vulnerabilities were patched in the subsequently released versions. In this paper, we present a profiling side-channel attack on a hardware implementation of CRYSTALS-Kyber. Since hardware implementations carry out computations in parallel, they are typically more difficult to break than their software counterparts. We demonstrate a successful message (session key) recovery attack on a Xilinx Artix-7 FPGA implementation of CRYSTALS-Kyber by deep learning-based power analysis. Our results indicate that currently available hardware implementations of CRYSTALS-Kyber need better protection against side-channel attacks.
Significant advancements have been achieved in the field of quantum computing in recent years. If somebody ever creates a sufficiently strong quantum computer, many of the public-key cryptosystems in use today might be compromised. Kyber is a post-quantum encryption technique that depends on lattice problem hardness, and it was recently standardized. Despite extensive testing by the National Institute of Standards and Technology (NIST), new investigations have demonstrated the effectiveness of CRYSTALS-Kyber attacks and their applicability in non-controlled environments. We investigated CRYSTALS-Kyber’s susceptibility to side-channel attacks. In the reference implementation of Kyber512, additional functions can be compromised by employing the selected ciphertext. The implementation of the selected ciphertext allows the attacks to succeed. Real-time recovery of the entire secret key is possible for all assaults.
Current improvements in quantum computing present a substantial challenge to classical cryptographic systems, which typically rely on problems that can be solved in polynomial time using quantum algorithms. Consequently, post-quantum cryptography (PQC) has emerged as a promising solution to emerging quantum-based cryptographic challenges. The greatest threat is public-key cryptosystems, which are primarily responsible for key exchanges. In PQC, key encapsulation mechanisms (KEMs) are crucial for securing key exchange protocols, particularly in Internet communication, virtual private networks (VPNs), and secure messaging applications. CRYSTALS-Kyber and NTRU are two well-known PQC KEMs offering robust security in the quantum world. However, even when quantum computers are functional, they are not easily accessible. IoT devices will not be able to utilize them directly, so there will still be a requirement to protect IoT devices from quantum attacks. Concerns such as limited computational power, energy efficiency, and memory constraints in devices such as those used in IoTs, embedded systems, and smart cards limit the use of these techniques in constrained environments. These concerns always arise there. To address this issue, this study conducts a broad comparative analysis of Kyber and NTRU, with special focus on their security, performance, and implementation efficiency in such environments (IOT/constrained environments). In addition, a case study was conducted by applying KEMs to a low-power embedded device to analyze their performance in real-world scenarios. These results offer an important comparison for cyber security engineers and cryptographers who are involved in integrating post-quantum cryptography into resource-constrained devices.
CRYSTALS-Kyber (Kyber) was recently chosen as the first quantum resistant Key Encapsulation Mechanism (KEM) scheme for standardisation, after three rounds of the National Institute of Standards and Technology (NIST) initiated PQC competition which begin in 2016 and search of the best quantum resistant KEMs and digital signatures. Kyber is based on the Module-Learning with Errors (M-LWE) class of Lattice-based Cryptography, that is known to manifest efficiently on FPGAs. This work explores several architectural optimizations and proposes a high-performance and area-time (AT) product efficient hardware accelerator for Kyber. The proposed architectural optimizations include inter-module and intra-module pipelining, that are designed and balanced via FIFO based buffering to ensure maximum parallelisation. The implementation results show that compared to state-of-the-art designs, the proposed architecture delivers 25–51% speedups for Kyber's three different security levels on Artix-7 and Zynq UltraScale+ devices, and a 50–75% reduction in DSPs at comparable security level. Consequently, the proposed design achieve higher AT product efficiencies of 19–33%.
The attack on quantum computers is an enormous threat to conventional public-key cryptography. Hence, it is crucial to study quantum-resistant cryptosystems. After four rounds of evaluation, the National Institute of Standards and Technology (NIST) has decided to standardize CRYSTALS-Kyber as one of the public-key post-quantum cryptography (PQC) algorithms. In the hardware design of CRYSTALS-Kyber, the polynomial-related calculations are the most time-consuming. In this paper, we present a highly-efficient hardware architecture for CRYSTALS-Kyber. Firstly, we propose the CRYSTALS-Kyber-oriented conflict-free memory mapping scheme with two modes. Based on this scheme, we construct the mixed radix-2/4 NTT/INTT algorithm, which has no pre- or post-processing, for the first time. By using the “lazy-last-layer” trick, the available memory bandwidth of NTT is temporarily increased, and the average performance of NTT is improved. Besides, the point-wise-multiplication (PWM) is performed in a single memory bank by cooperating with the two modes of our memory mapping scheme. This avoids the waste of memory bandwidth, thus avoiding the usage of large FIFOs for the sampled data. Last, we propose an efficient modular multiplier for CRYSTALS-Kyber, and we merge the divide-by-2 operations in the finite field into modular adders and subtractors to reduce resource consumption. This design, which supports all three security levels, is implemented on Xilinx Artix-7 FPGA with 7.3k LUTs, 3.2k FFs, 2.2k Slices, 5 BRAMs, and 4 DSPs. It performs 12% better in area-time-product than other leading designs in the literature.
In this work, we propose generic and novel adaptations to the binary Plaintext-Checking (PC) oracle based side-channel attacks for Kyber KEM. These attacks operate in a chosen-ciphertext setting, and are fairly generic and easy to mount on a given target, as the attacker requires very minimal information about the target device. However, these attacks have an inherent disadvantage of requiring a few thousand traces to perform full key recovery. This is due to the fact that these attacks typically work by recovering a single bit of information about the secret key per query/trace. In this respect, we propose novel parallel PC oracle based side-channel attacks, which are capable of recovering a generic P number of bits of information about the secret key in a single query/trace. We propose novel techniques to build chosen-ciphertexts so as to efficiently realize a parallel PC oracle for Kyber KEM. We also build a multi-class classifier, which is capable of realizing a practical side-channel based parallel PC oracle with very high success rate. We experimentally validated the proposed attacks (upto P = 10) on the fastest implementation of unprotected Kyber KEM in the pqm4 library. Our experiments yielded improvements in the range of 2.89× and 7.65× in the number of queries, compared to state-of-the-art binary PC oracle attacks, while arbitrarily higher improvements are possible for a motivated attacker, given the generic nature of the proposed attacks. We further conduct a thorough study on applicability to different scenarios, based on the presence/absence of a clone device, and also partial key recovery. Finally, we also show that the proposed attacks are able to achieve the lowest number of queries for key recovery, even for implementations protected with low-cost countermeasures such as shuffling. Our work therefore, concretely demonstrates the power of PC oracle attacks on Kyber KEM, thereby stressing the need for concrete countermeasures such as masking for Kyber and other lattice-based KEMs.
This study proposes a chosen-ciphertext side-channel attack against a lattice-based key encapsulation mechanism (KEM), the third-round candidate of the national institute of standards and technology (NIST) standardization project. Unlike existing attacks that target operations, such as inverse NTT and message encoding/decoding, we target <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {Barrett~reduction}$ </tex-math></inline-formula> in the decapsulation phase of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {CRYSTALS{-}KYBER}$ </tex-math></inline-formula> to obtain a secret key. We show that a sensitive variable-dependent leakage of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {Barrett~reduction}$ </tex-math></inline-formula> exposes an entire secret key. The results of experiments conducted on the ARM Cortex-M4 microcontroller accomplish a success rate of 100%. We only need six chosen ciphertexts for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER512}$ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER768}$ </tex-math></inline-formula> and eight chosen ciphertexts for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER1024}$ </tex-math></inline-formula> . We also show that the <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {m4}$ </tex-math></inline-formula> scheme of the <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {pqm4}$ </tex-math></inline-formula> library, an implementation with the ARM Cortex-M4 specific optimization (typically in assembly), is vulnerable to the proposed attack. In this scheme, six, nine, and twelve chosen ciphertexts are required for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER512}$ </tex-math></inline-formula> , <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER768}$ </tex-math></inline-formula> , and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathsf {KYBER1024}$ </tex-math></inline-formula> , respectively.
NIST has recently selected CRYSTALS-Kyber as a new public key encryption and key establishment algorithm to be standardized. This makes it important to evaluate the resistance of CRYSTALS-Kyber implementations to side-channel attacks. Software implementations of CRYSTALS-Kyber have already been thoroughly analysed. The discovered vulnerabilities helped improve the subsequently released versions and promoted stronger countermeasures against side-channel attacks. In this paper, we present the first attack on a protected hardware implementation of CRYSTALS-Kyber. We demonstrate a practical message (shared key) recovery attack on the first-order masked FPGA implementation of Kyber-512 by Kamucheka et al. (2022) using power analysis based on the Hamming distance leakage model. The presented attack exploits a vulnerability located in the masked message decoding procedure which is called during the decryption step of the decapsulation. The message recovery is performed using a profiled deep learning-based method which extracts the message directly, without extracting each share explicitly. By repeating the same decapsulation process multiple times, it is possible to increase the success rate of full shared key recovery to 99%.
In this work, we present a systematic study of Side-Channel Attacks (SCA) and Fault Injection Attacks (FIA) on structured lattice-based schemes, with main focus on Kyber Key Encapsulation Mechanism (KEM) and Dilithium signature scheme, which are leading candidates in the NIST standardization process for Post-Quantum Cryptography (PQC). Through our study, we attempt to understand the underlying similarities and differences between the existing attacks while classifying them into different categories. Given the wide variety of reported attacks, simultaneous protection against all the attacks requires to implement customized protections/countermeasures for both Kyber and Dilithium. We therefore present a range of customized countermeasures, capable of providing defenses/mitigations against existing SCA/FIA, and incorporate several SCA and FIA countermeasures within a single design of Kyber and Dilithium. Among the several countermeasures discussed in this work, we present novel countermeasures that offer simultaneous protection against several SCA- and FIA-based chosen-ciphertext attacks for Kyber KEM. We implement the presented countermeasures within two well-known public software libraries for PQC: (1) pqm4 library for the ARM Cortex-M4-based microcontroller and (2) liboqs library for the Raspberry Pi 3 Model B Plus based on the ARM Cortex-A53 processor. Our performance evaluation reveals that the presented custom countermeasures incur reasonable performance overheads on both the evaluated embedded platforms. We therefore believe our work argues for usage of custom countermeasures within real-world implementations of lattice-based schemes, either in a standalone manner or as reinforcements to generic countermeasures such as masking.
CRYSTALS-Kyber has been selected by the NIST as a public-key encryption and key encapsulation mechanism to be standardized. It is also included in the NSA’s suite of cryptographic algorithms recommended for national security systems. This makes it important to evaluate the resistance of CRYSTALS-Kyber’s implementations to side-channel attacks. The unprotected and first-order masked software implementations have been already analysed. In this paper, we present deep learning-based message recovery attacks on the ω -order masked implementations of CRYSTALS-Kyber in ARM Cortex-M4 CPU for ω ≤ 5. The main contribution is a new neural network training method called recursive learning. In the attack on an ω -order masked implementation, we start training from an artificially constructed neural network Mω whose weights are partly copied from a model Mω − 1 trained on the (ω − 1)-order masked implementation, and then extended to one more share. Such a method allows us to train neural networks that can recover a message bit with the probability above 99% from high-order masked implementations.
In this work, we present a configurable and side channel resistant implementation of the post-quantum key-exchange algorithm CRYSTALS-Kyber . The implemented design can be configured for different performance and area requirements leading to different trade-offs for different applications. A low area implementation can be achieved in 5,269 LUTs and 2,422 FFs, whereas a high performance implementation required 7,151 LUTs and 3,730 FFs. Due to a deeply pipelined architecture, a high operating speed of more than 250 MHz could be achieved on 28nm Xilinx FPGAs. The side channel resistance is implemented using a carefully chosen set of novel and known techniques such as Fault Detection Hashes, Instruction Randomization, FSM Protection and so on. resulting in a low overhead of less than 5% while being highly configurable. To the best of our knowledge, this work presents the first side-channel and fault attack protected configurable accelerator for CRYSTALS-Kyber . Using TVLA (test vector leakage assessment), we validate the implemented protection techniques and demonstrate that the design does not leak information even after 200 K traces. Furthermore, one of the configuration choices results in the smallest hardware implementation of CRYSTALS-Kyber known in the literature.
CRYSTALS-Kyber is the first quantum-resilient, lattice-based Public Key Encryption (PKE)/Key Encapsulation Mechanism (KEM) cryptosystem that is chosen by the ongoing National Institute of Standards and Technology post-quantum cryptography standardization (NIST PQC) for standardization. This work presents a lightweight and efficient, FPGA-based hardware implementation for polynomial multiplication unit (NTT), which is the major bottleneck in the Kyber scheme. As a first step, an optimzed modular multiplication architecture combining KRED and lookup table-based algorithms is presented, which reduces the resources of slices by 16.7%. It is used in a pipelined NTT/INTT architecture that is completely BRAM free and instead uses 3 FIFOs for coefficients storage. We hereby present the most compact FPGA based design for NTT architecture in Kyber till date. Experimental results bench marked on comparable FPGA devices show that our proposed design is 36-75% better than the state-of-the-art implementations in terms of hardware efficiency for NTT/INTT calculations and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$3.4-4.4\times$</tex> better for the Point-wise Multiplication (PWM) operation.