718 resultados para Fpga
Resumo:
Massively parallel networks of highly efficient, high performance Single Instruction Multiple Data (SIMD) processors have been shown to enable FPGA-based implementation of real-time signal processing applications with performance and
cost comparable to dedicated hardware architectures. This is achieved by exploiting simple datapath units with deep processing pipelines. However, these architectures are highly susceptible to pipeline bubbles resulting from data and control hazards; the only way to mitigate against these is manual interleaving of
application tasks on each datapath, since no suitable automated interleaving approach exists. In this paper we describe a new automated integrated mapping/scheduling approach to map algorithm tasks to processors and a new low-complexity list scheduling technique to generate the interleaved schedules. When applied to a spatial Fixed-Complexity Sphere Decoding (FSD) detector
for next-generation Multiple-Input Multiple-Output (MIMO) systems, the resulting schedules achieve real-time performance for IEEE 802.11n systems on a network of 16-way SIMD processors on FPGA, enable better performance/complexity balance than current approaches and produce results comparable to handcrafted implementations.
Resumo:
This paper investigates the implementation of a number of circuits used to perform a high speed closest value match lookup. The design is targeted particularly for use in a search trie, as used in various networking lookup applications, but can be applied to many other areas where such a match is required. A range of different designs have been considered and implemented on FPGA. A detailed description of the architectures investigated is followed by an analysis of the synthesis results. © 2006 IEEE.
Resumo:
A methodology which allows a non-specialist to rapidly design silicon wavelet transform cores has been developed. This methodology is based on a generic architecture utilizing time-interleaved coefficients for the wavelet transform filters. The architecture is scaleable and it has been parameterized in terms of wavelet family, wavelet type, data word length and coefficient word length. The control circuit is designed in such a way that the cores can also be cascaded without any interface glue logic for any desired level of decomposition. This parameterization allows the use of any orthonormal wavelet family thereby extending the design space for improved transformation from algorithm to silicon. Case studies for stand alone and cascaded silicon cores for single and multi-stage analysis respectively are reported. The typical design time to produce silicon layout of a wavelet based system has been reduced by an order of magnitude. The cores are comparable in area and performance to hand-crafted designs. The designs have been captured in VHDL so they are portable across a range of foundries and are also applicable to FPGA and PLD implementations.
Resumo:
The design of a System-on-a-Chip (SoC) demonstrator for a baseline JPEG encoder core is presented. This combines a highly optimized Discrete Cosine Transform (DCT) and quantization unit with an entropy coder which has been realized using off-the-shelf synthesizable IP cores (Run-length coder, Huffman coder and data packer). When synthesized in a 0.35 µm CMOS process, the core can operate at speeds up to 100 MHz and contains 50 k gates plus 11.5 kbits of RAM. This is approximately 20% less than similar JPEG encoder designs reported in literature. When targeted at FPGA the core can operate up to 30 MHz and is capable of compressing 9-bit full-frame color input data at NTSC or PAL rates.
Resumo:
A novel hardware architecture for elliptic curve cryptography (ECC) over GF(p) is introduced. This can perform the main prime field arithmetic functions needed in these cryptosystems including modular inversion and multiplication. This is based on a new unified modular inversion algorithm that offers considerable improvement over previous ECC techniques that use Fermat's Little Theorem for this operation. The processor described uses a full-word multiplier which requires much fewer clock cycles than previous methods, while still maintaining a competitive critical path delay. The benefits of the approach have been demonstrated by utilizing these techniques to create a field-programmable gate array (FPGA) design. This can perform a 256-bit prime field scalar point multiplication in 3.86 ms, the fastest FPGA time reported to date. The ECC architecture described can also perform four different types of modular inversion, making it suitable for use in many different ECC applications. © 2006 IEEE.
Resumo:
The inclusion of the Discrete Wavelet Transform in the JPEG-2000 standard has added impetus to the research of hardware architectures for the two-dimensional wavelet transform. In this paper, a VLSI architecture for performing the symmetrically extended two-dimensional transform is presented. This architecture conforms to the JPEG-2000 standard and is capable of near-optimal performance when dealing with the image boundaries. The architecture also achieves efficient processor utilization. Implementation results based on a Xilinx Virtex-2 FPGA device are included.
Resumo:
Security devices are vulnerable to Differential Power Analysis (DPA) that reveals the key by monitoring the power consumption of the circuits. In this paper, we present the first DPA attack against an FPGA implementation of the Camellia encryption algorithm with all key sizes and evaluate the DPA resistance of the algorithm. The Camellia cryptographic algorithm involves several different key-dependent intermediate operations including S-Box operations. In previous research, it was believed that the Camellia is stronger than AES due to the additional Whitening phase protecting the S-Box operation. However, we propose an attack that bypasses the Whitening phase and targets the S-Box. In this paper, we also discuss a lowcost countermeasure strategy to protect the Pre-whitening / Post-whitening and FL function of Camellia using Dual-rail Precharged Logic and to protect against attacks of the S-Box using Random Delay Insertion. © 2009 IEEE.
Resumo:
A methodology has been developed which allows a non-specialist to rapidly design silicon wavelet transform cores for a variety of specifications. The cores include both forward and inverse orthonormal wavelet transforms. This methodology is based on efficient, modular and scaleable architectures utilising time-interleaved coefficients for the wavelet transform filters. The cores are parameterized in terms of wavelet type and data and coefficient word lengths. The designs have been captured in VHDL and are hence portable across a range of silicon foundries as well as FPGA and PLD implementations.
Resumo:
A new, single and unified Montgomery modular inverse algorithm, which performs both classical and Montgomery modular inversion, is proposed. This reduces the number of Montgomery multiplication operations required by 33% when compared with previous algorithms reported in the literature. The use of this in practice has been investigated by implementation of the improved unified algorithm and the previous algorithms on FPGA devices. The unified algorithm implementation shows a significant speed-up and a reduction in silicon area usage.
Resumo:
A rapid design methodology for biorthogonal wavelet transform cores has been developed. This methodology is based on a generic, scaleable architecture for the wavelet filters. The architecture offers efficient hardware utilization by combining the linear phase property of biorthogonal filters with decimation in a MAC based implementation. The design has been captured in VHDL and parameterized in terms of wavelet type, data word length and coefficient word length. The control circuit is embedded within the cores and allows them to be cascaded without any interface glue logic for any desired level of decomposition. The design time to produce silicon layout of a biorthogonal wavelet based system is typically less than a day. The resulting silicon cores produced are comparable in area and performance to hand-crafted designs. The designs are portable across a range of foundries and are also applicable to FPGA and PLD implementations.
Resumo:
A methodology for the production of silicon cores for wavelet packet decomposition has been developed. The scheme utilizes efficient scalable architectures for both orthonormal and biorthogonal wavelet transforms. The cores produced from these architectures can be readily scaled for any wavelet function and are easily configurable for any subband structure. The cores are fully parameterized in terms of wavelet choice and appropriate wordlengths. Designs produced are portable across a range of silicon foundries as well as FPGA and PLD technologies. A number of exemplar implementations have been produced.
Resumo:
As ubiquitous computing becomes a reality, sensitive information is increasingly processed and transmitted by smart cards, mobile devices and various types of embedded systems. This has led to the requirement of a new class of lightweight cryptographic algorithm to ensure security in these resource constrained environments. The International Organization for Standardization (ISO) has recently standardised two low-cost block ciphers for this purpose, Clefia and Present. In this paper we provide the first comprehensive hardware architecture comparison between these ciphers, as well as a comparison with the current National Institute of Standards and Technology (NIST) standard, the Advanced Encryption Standard.
Resumo:
With the over-provisioned routing resource on FPGA, the topology choice for NoC implementation on FPGA is more flexible than on ASIC. However, it is well understood that the global wire routing impacts the performance of NoC on FPGA because the topology is routed by using fixed routing fabric. An important question that arises is: will the benefit of diameter reduction by using a highly connective topology outweigh the impact of global routing? To answer this question, we investigate FPGA based packet switched NoC implementations with different sizes and topologies, and quantitatively measure the impact of global routing to each of these networks. The result shows that with sufficient routing resources on modern FPGA, the global routing is not on the critical path of the system, and thus is not a dominating factor for the performance of practical multi-hop NoC system. © 2011 IEEE.
Resumo:
Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on OpenCL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. OpenCL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified OpenCL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3x faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.
Resumo:
A fully homomorphic encryption (FHE) scheme is envisioned as a key cryptographic tool in building a secure and reliable cloud computing environment, as it allows arbitrary evaluation of a ciphertext without revealing the plaintext. However, existing FHE implementations remain impractical due to very high time and resource costs. To the authors’ knowledge, this paper presents the first hardware implementation of a full encryption primitive for FHE over the integers using FPGA technology. A large-integer multiplier architecture utilising Integer-FFT multiplication is proposed, and a large-integer Barrett modular reduction module is designed incorporating the proposed multiplier. The encryption primitive used in the integer-based FHE scheme is designed employing the proposed multiplier and modular reduction modules. The designs are verified using the Xilinx Virtex-7 FPGA platform. Experimental results show that a speed improvement factor of up to 44 is achievable for the hardware implementation of the FHE encryption scheme when compared to its corresponding software implementation. Moreover, performance analysis shows further speed improvements of the integer-based FHE encryption primitives may still be possible, for example through further optimisations or by targeting an ASIC platform.