782 resultados para FPGA, VHDL, Picoblaze, SERDES
Resumo:
The inclusion of the Discrete Wavelet Transform in the JPEG-2000 standard has added impetus to the research of hardware architectures for the two-dimensional wavelet transform. In this paper, a VLSI architecture for performing the symmetrically extended two-dimensional transform is presented. This architecture conforms to the JPEG-2000 standard and is capable of near-optimal performance when dealing with the image boundaries. The architecture also achieves efficient processor utilization. Implementation results based on a Xilinx Virtex-2 FPGA device are included.
Resumo:
Security devices are vulnerable to Differential Power Analysis (DPA) that reveals the key by monitoring the power consumption of the circuits. In this paper, we present the first DPA attack against an FPGA implementation of the Camellia encryption algorithm with all key sizes and evaluate the DPA resistance of the algorithm. The Camellia cryptographic algorithm involves several different key-dependent intermediate operations including S-Box operations. In previous research, it was believed that the Camellia is stronger than AES due to the additional Whitening phase protecting the S-Box operation. However, we propose an attack that bypasses the Whitening phase and targets the S-Box. In this paper, we also discuss a lowcost countermeasure strategy to protect the Pre-whitening / Post-whitening and FL function of Camellia using Dual-rail Precharged Logic and to protect against attacks of the S-Box using Random Delay Insertion. © 2009 IEEE.
Resumo:
A new, single and unified Montgomery modular inverse algorithm, which performs both classical and Montgomery modular inversion, is proposed. This reduces the number of Montgomery multiplication operations required by 33% when compared with previous algorithms reported in the literature. The use of this in practice has been investigated by implementation of the improved unified algorithm and the previous algorithms on FPGA devices. The unified algorithm implementation shows a significant speed-up and a reduction in silicon area usage.
Resumo:
A methodology for the production of silicon cores for wavelet packet decomposition has been developed. The scheme utilizes efficient scalable architectures for both orthonormal and biorthogonal wavelet transforms. The cores produced from these architectures can be readily scaled for any wavelet function and are easily configurable for any subband structure. The cores are fully parameterized in terms of wavelet choice and appropriate wordlengths. Designs produced are portable across a range of silicon foundries as well as FPGA and PLD technologies. A number of exemplar implementations have been produced.
Resumo:
As ubiquitous computing becomes a reality, sensitive information is increasingly processed and transmitted by smart cards, mobile devices and various types of embedded systems. This has led to the requirement of a new class of lightweight cryptographic algorithm to ensure security in these resource constrained environments. The International Organization for Standardization (ISO) has recently standardised two low-cost block ciphers for this purpose, Clefia and Present. In this paper we provide the first comprehensive hardware architecture comparison between these ciphers, as well as a comparison with the current National Institute of Standards and Technology (NIST) standard, the Advanced Encryption Standard.
Resumo:
With the over-provisioned routing resource on FPGA, the topology choice for NoC implementation on FPGA is more flexible than on ASIC. However, it is well understood that the global wire routing impacts the performance of NoC on FPGA because the topology is routed by using fixed routing fabric. An important question that arises is: will the benefit of diameter reduction by using a highly connective topology outweigh the impact of global routing? To answer this question, we investigate FPGA based packet switched NoC implementations with different sizes and topologies, and quantitatively measure the impact of global routing to each of these networks. The result shows that with sufficient routing resources on modern FPGA, the global routing is not on the critical path of the system, and thus is not a dominating factor for the performance of practical multi-hop NoC system. © 2011 IEEE.
Resumo:
Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on OpenCL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. OpenCL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified OpenCL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3x faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.
Resumo:
A fully homomorphic encryption (FHE) scheme is envisioned as a key cryptographic tool in building a secure and reliable cloud computing environment, as it allows arbitrary evaluation of a ciphertext without revealing the plaintext. However, existing FHE implementations remain impractical due to very high time and resource costs. To the authors’ knowledge, this paper presents the first hardware implementation of a full encryption primitive for FHE over the integers using FPGA technology. A large-integer multiplier architecture utilising Integer-FFT multiplication is proposed, and a large-integer Barrett modular reduction module is designed incorporating the proposed multiplier. The encryption primitive used in the integer-based FHE scheme is designed employing the proposed multiplier and modular reduction modules. The designs are verified using the Xilinx Virtex-7 FPGA platform. Experimental results show that a speed improvement factor of up to 44 is achievable for the hardware implementation of the FHE encryption scheme when compared to its corresponding software implementation. Moreover, performance analysis shows further speed improvements of the integer-based FHE encryption primitives may still be possible, for example through further optimisations or by targeting an ASIC platform.
Resumo:
The overall aim of the work presented in this paper has been to develop Montgomery modular multiplication architectures suitable for implementation on modern reconfigurable hardware. Accordingly, novel high-radix systolic array Montgomery multiplier designs are presented, as we believe that the inherent regular structure and absence of global interconnect associated with these, make them well-suited for implementation on modern FPGAs. Unlike previous approaches, each processing element (PE) comprises both an adder and a multiplier. The inclusion of a multiplier in the PE means that the need to pre-compute or store any multiples of the operands is avoided. This also allows very high-radix implementations to be realised, further reducing the amount of clock cycles per modular multiplication, while still maintaining a competitive critical delay. For demonstrative purposes, 512-bit and 1024-bit FPGA implementations using radices of 2(8) and 2(16) are presented. The subsequent throughput rates are the fastest reported to date.
Resumo:
The design and VLSI implementation of two key components of the class-IV partial response maximum likelihood channel (PR-IV) the adaptive filter and the Viterbi decoder are described. These blocks are implemented using parameterised VHDL modules, from a library of common digital signal processing (DSP) and arithmetic functions. Design studies, based on 0.6 micron 3.3V standard cell processes, indicate that worst case sampling rates of 49 mega-samples per second are achievable for this system, with proportionally high sampling rates for full custom designs and smaller dimension processes. Significant increases in the sampling rate, from 49 MHz to approximately 180 MHz, can be achieved by operating four filter modules in parallel, and this implementation has 50% lower power consumption than a pipelined filter operating at the same speed.
Resumo:
This paper presents a hardware solution for network flow processing at full line rate. Advanced memory architecture using DDR3 SDRAMs is proposed to cope with the flow match limitations in packet throughput, number of supported flows and number of packet header fields (or tuples) supported for flow identifications. The described architecture has been prototyped for accommodating 8 million flows, and tested on an FPGA platform achieving a minimum of 70 million lookups per second. This is sufficient to process internet traffic flows at 40 Gigabit Ethernet.
Resumo:
Fully Homomorphic Encryption (FHE) is a recently developed cryptographic technique which allows computations on encrypted data. There are many interesting applications for this encryption method, especially within cloud computing. However, the computational complexity is such that it is not yet practical for real-time applications. This work proposes optimised hardware architectures of the encryption step of an integer-based FHE scheme with the aim of improving its practicality. A low-area design and a high-speed parallel design are proposed and implemented on a Xilinx Virtex-7 FPGA, targeting the available DSP slices, which offer high-speed multiplication and accumulation. Both use the Comba multiplication scheduling method to manage the large multiplications required with uneven sized multiplicands and to minimise the number of read and write operations to RAM. Results show that speed up factors of 3.6 and 10.4 can be achieved for the encryption step with medium-sized security parameters for the low-area and parallel designs respectively, compared to the benchmark software implementation on an Intel Core2 Duo E8400 platform running at 3 GHz.
Resumo:
The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
Resumo:
A Physical Unclonable Function (PUF) can be used to provide authentication of devices by producing die-unique responses. In PUFs based on ring oscillators (ROs) the responses are derived from the oscillation frequencies of the ROs. However, RO PUFs can be vulnerable to attack due to the frequency distribution characteristics of the RO arrays. In this letter, in order to improve the design of RO PUFs for FPGA devices, the frequencies of RO arrays implemented on a large number of FPGA chips are statistically analyzed. Three RO frequency distribution (ROFD) characteristics, which can be used to improve the design of RO PUFs are observed and discussed.