138 resultados para FPGA, Elettronica digitale, Sintesi logica
Resumo:
Realising memory intensive applications such as image and video processing on FPGA requires creation of complex, multi-level memory hierarchies to achieve real-time performance; however commerical High Level Synthesis tools are unable to automatically derive such structures and hence are unable to meet the demanding bandwidth and capacity constraints of these applications. Current approaches to solving this problem can only derive either single-level memory structures or very deep, highly inefficient hierarchies, leading in either case to one or more of high implementation cost and low performance. This paper presents an enhancement to an existing MC-HLS synthesis approach which solves this problem; it exploits and eliminates data duplication at multiple levels levels of the generated hierarchy, leading to a reduction in the number of levels and ultimately higher performance, lower cost implementations. When applied to synthesis of C-based Motion Estimation, Matrix Multiplication and Sobel Edge Detection applications, this enables reductions in Block RAM and Look Up Table (LUT) cost of up to 25%, whilst simultaneously increasing throughput.
Resumo:
The increasing design complexity associated with modern Field Programmable Gate Array (FPGA) has prompted the emergence of 'soft'-programmable processors which attempt to replace at least part of the custom circuit design problem with a problem of programming parallel processors. Despite substantial advances in this technology, its performance and resource efficiency for computationally complex operations remains in doubt. In this paper we present the first recorded implementation of a softcore Fast-Fourier Transform (FFT) on Xilinx Virtex FPGA technology. By employing a streaming processing architecture, we show how it is possible to achieve architectures which offer 1.1 GSamples/s throughput and up to 19 times speed-up against the Xilinx Radix-2 FFT dedicated circuit with comparable cost.
Resumo:
Current data-intensive image processing applications push traditional embedded architectures to their limits. FPGA based hardware acceleration is a potential solution but the programmability gap and time consuming HDL design flow is significant. The proposed research approach to develop “FPGA based programmable hardware acceleration platform” that uses, large number of Streaming Image processing Processors (SIPPro) potentially addresses these issues. SIPPro is pipelined in-order soft-core processor architecture with specific optimisations for image processing applications. Each SIPPro core uses 1 DSP48, 2 Block RAMs and 370 slice-registers, making the processor as compact as possible whilst maintaining flexibility and programmability. It is area efficient, scalable and high performance softcore architecture capable of delivering 530 MIPS per core using Xilinx Zynq SoC (ZC7Z020-3). To evaluate the feasibility of the proposed architecture, a Traffic Sign Recognition (TSR) algorithm has been prototyped on a Zedboard with the color and morphology operations accelerated using multiple SIPPros. Simulation and experimental results demonstrate that the processing platform is able to achieve a speedup of 15 and 33 times for color filtering and morphology operations respectively, with a significant reduced design effort and time.
Resumo:
Homomorphic encryption offers potential for secure cloud computing. However due to the complexity of homomorphic encryption schemes, performance of implemented schemes to date have been unpractical. This work investigates the use of hardware, specifically Field Programmable Gate Array (FPGA) technology, for implementing the building blocks involved in somewhat and fully homomorphic encryption schemes in order to assess the practicality of such schemes. We concentrate on the selection of a suitable multiplication algorithm and hardware architecture for large integer multiplication, one of the main bottlenecks in many homomorphic encryption schemes. We focus on the encryption step of an integer-based fully homomorphic encryption (FHE) scheme. We target the DSP48E1 slices available on Xilinx Virtex 7 FPGAs to ascertain whether the large integer multiplier within the encryption step of a FHE scheme could fit on a single FPGA device. We find that, for toy size parameters for the FHE encryption step, the large integer multiplier fits comfortably within the DSP48E1 slices, greatly improving the practicality of the encryption step compared to a software implementation. As multiplication is an important operation in other FHE schemes, a hardware implementation using this multiplier could also be used to improve performance of these schemes.
Resumo:
The increasing scale of Multiple-Input Multiple- Output (MIMO) topologies employed in forthcoming wireless communications standards presents a substantial implementation challenge to designers of embedded baseband signal processing architectures for MIMO transceivers. Specifically the increased scale of such systems has a substantial impact on the perfor- mance/cost balance of detection algorithms for these systems. Whilst in small-scale systems Sphere Decoding (SD) algorithms offer the best quasi-ML performance/cost balance, in larger systems heuristic detectors, such Tabu-Search (TS) detectors are superior. This paper addresses a dearth of research in architectures for TS-based MIMO detection, presenting the first known realisations of TS detectors for 4 × 4 and 10 × 10 MIMO systems. To the best of the authors’ knowledge, these are the largest single-chip detectors on record.
Resumo:
Field-programmable gate arrays are ideal hosts to custom accelerators for signal, image, and data processing but de- mand manual register transfer level design if high performance and low cost are desired. High-level synthesis reduces this design burden but requires manual design of complex on-chip and off-chip memory architectures, a major limitation in applications such as video processing. This paper presents an approach to resolve this shortcoming. A constructive process is described that can derive such accelerators, including on- and off-chip memory storage from a C description such that a user-defined throughput constraint is met. By employing a novel statement-oriented approach, dataflow intermediate models are derived and used to support simple ap- proaches for on-/off-chip buffer partitioning, derivation of custom on-chip memory hierarchies and architecture transformation to ensure user-defined throughput constraints are met with minimum cost. When applied to accelerators for full search motion estima- tion, matrix multiplication, Sobel edge detection, and fast Fourier transform, it is shown how real-time performance up to an order of magnitude in advance of existing commercial HLS tools is enabled whilst including all requisite memory infrastructure. Further, op- timizations are presented that reduce the on-chip buffer capacity and physical resource cost by up to 96% and 75%, respectively, whilst maintaining real-time performance.
Resumo:
With security and surveillance, there is an increasing need to process image data efficiently and effectively either at source or in a large data network. Whilst a Field-Programmable Gate Array (FPGA) has been seen as a key technology for enabling this, the design process has been viewed as problematic in terms of the time and effort needed for implementation and verification. The work here proposes a different approach of using optimized FPGA-based soft-core processors which allows the user to exploit the task and data level parallelism to achieve the quality of dedicated FPGA implementations whilst reducing design time. The paper also reports some preliminary
progress on the design flow to program the structure. An implementation for a Histogram of Gradients algorithm is also reported which shows that a performance of 328 fps can be achieved with this design approach, whilst avoiding the long design time, verification and debugging steps associated with conventional FPGA implementations.
Resumo:
A hardware performance analysis of the SHACAL-2 encryption algorithm is presented in this paper. SHACAL-2 was one of four symmetric key algorithms chosen in the New European Schemes for Signatures, Integrity and Encryption (NESSIE) initiative in 2003. The paper describes a fully pipelined encryption SHACAL-2 architecture implemented on a Xilinx Field Programmable Gate Array (FPGA) device that achieves a throughput of over 25 Gbps. This is the fastest private key encryption algorithm architecture currently available. The SHACAL-2 decryption algorithm is also defined in the paper as it was not provided in the NESSIE submission.
Resumo:
A generic architecture for implementing the advanced encryption standard (AES) encryption algorithm in silicon is proposed. This allows the instantiation of a wide range of chip specifications, with these taking the form of semiconductor intellectual property (IP) cores. Cores implemented from this architecture can perform both encryption and decryption and support four modes of operation: (i) electronic codebook mode; (ii) output feedback mode; (iii) cipher block chaining mode; and (iv) ciphertext feedback mode. Chip designs can also be generated to cover all three AES key lengths, namely 128 bits, 192 bits and 256 bits. On-the-fly generation of the round keys required during decryption is also possible. The general, flexible and multi-functional nature of the approach described contrasts with previous designs which, to date, have been focused on specific implementations. The presented ideas are demonstrated by implementation in FPGA technology. However, the architecture and IP cores derived from this are easily migratable to other silicon technologies including ASIC and PLD and are capable of covering a wide range of modem communication systems cryptographic requirements. Moreover, the designs produced have a gate count and throughput comparable with or better than the previous one-off solutions.
Resumo:
A novel tag computation circuit for a credit based Self-Clocked Fair Queuing (SCFQ) Scheduler is presented. The scheduler combines Weighted Fair Queuing (WFQ) with a credit based bandwidth reallocation scheme. The proposed architecture is able to reallocate bandwidth on the fly if particular links suffer from channel quality degradation .The hardware architecture is parallel and pipelined enabling an aggregated throughput rate of 180 million tag computations per second. The throughput performance is ideal for Broadband Wireless Access applications, allowing room for relatively complex computations in QoS aware adaptive scheduling. The high-level system break-down is described and synthesis results for Altera Stratix II FPGA technology are presented.
Resumo:
An area-efficient high-throughput architecture based on distributed arithmetic is proposed for 3D discrete wavelet transform (DWT). The 3D DWT processor was designed in VHDL and mapped to a Xilinx Virtex-E FPGA. The processor runs up to 85 MHz, which can process the five-level DWT analysis of a 128 x 128 x 128 fMRI volume image in 20 ms.
Resumo:
A methodology for rapid silicon design of biorthogonal wavelet transform systems has been developed. This is based on generic, scalable architectures for the forward and inverse wavelet filters. These architectures offer efficient hardware utilisation by combining the linear phase property of biorthogonal filters with decimation and interpolation. The resulting designs have been parameterised in terms of types of wavelet and wordlengths for data and coefficients. Control circuitry is embedded within these cores that allows them to be cascaded for any desired level of decomposition without any interface logic. The time to produce silicon designs for a biorthogonal wavelet system is only the time required to run synthesis and layout tools with no further design effort required. The resulting silicon cores produced are comparable in area and performance to hand-crafted designs. These designs are also portable across a range of foundries and are suitable for FPGA and PLD implementations.
Resumo:
In DSP applications such as fixed transforms and filtering, the full flexibility of a general-purpose multiplier is not required and only a limited range of values is needed on one of the multiplier inputs. A new design technique has been developed for deriving multipliers that operate on a limited range of multiplicands. This can be used to produce FPGA implementations of DSP systems where area is dramatically improved. The paper describes the technique and its application to the design of a poly-phase filter on a Virtex FPGA. A 62% area reduction and 7% speed increase is gained when compared to an equivalent design using general purpose multipliers. It is also compared favourably to other known fixed coefficient approaches.
Resumo:
This paper presents a matrix inversion architecture based on the novel Modified Squared Givens Rotations (MSGR) algorithm, which extends the original SGR method to complex valued data, and also corrects erroneous results in the original SGR method when zeros occur on the diagonal of the matrix either initially or during processing. The MSGR algorithm also avoids complex dividers in the matrix inversion, thus minimising the complexity of potential real-time implementations. A systolic array architecture is implemented and FPGA synthesis results indicate a high-throughput low-latency complex matrix inversion solution. © 2008 IEEE.
Resumo:
SoC systems are now being increasingly constructed using a hierarchy of subsystems or silicon Intellectual Property (IP) cores. The key challenge is to use these cores in a highly efficient manner which can be difficult as the internal core structure may not be known. A design methodology based on synthesizing hierarchical circuit descriptions is presented. The paper employs the MARS synthesis scheduling algorithm within the existing IRIS synthesis flow and details how it can be enhanced to allow for design exploration of IP cores. It is shown that by accessing parameterised expressions for the datapath latencies in the cores, highly efficient FPGA solutions can be achieved. Hardware sharing at both the hierarchical and flattened levels is explored for a normalized lattice filter and results are presented.