116 resultados para FPGA (Field programmable gate arrays)
Resumo:
Dynamic power consumption is very dependent on interconnect, so clever mapping of digital signal processing algorithms to parallelised realisations with data locality is vital. This is a particular problem for fast algorithm implementations where typically, designers will have sacrificed circuit structure for efficiency in software implementation. This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation; this allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption. It is shown how a 50% reduction in relative index space separation results in a measured power gain of 36 and 37% over a Cooley-Tukey Fast Fourier Transform (FFT)-based solution for both actual power measurements for a Xilinx Virtex-II FPGA implementation and circuit measurements for a Xilinx Virtex-5 implementation. The authors show the generality of the approach by applying it to a number of other fast algorithms namely the discrete cosine, the discrete Hartley and the Walsh-Hadamard transforms.
Resumo:
A queue manager (QM) is a core traffic management (TM) function used to provide per-flow queuing in access andmetro networks; however current designs have limited scalability. An on-demand QM (OD-QM) which is part of a new modular field-programmable gate-array (FPGA)-based TM is presented that dynamically maps active flows to the available physical resources; its scalability is derived from exploiting the observation that there are only a few hundred active flows in a high speed network. Simulations with real traffic show that it is a scalable, cost-effective approach that enhances per-flow queuing performance, thereby allowing per-flow QM without the need for extra external memory at speeds up to 10 Gbps. It utilizes 2.3%–16.3% of a Xilinx XC5VSX50t FPGA and works at 111 MHz.
Resumo:
This paper presents single-chip FPGA Rijndael algorithm implementations of the Advanced Encryption Standard (AES) algorithm, Rijndael. In particular, the designs utilise look-up tables to implement the entire Rijndael Round function. A comparison is provided between these designs and similar existing implementations. Hardware implementations of encryption algorithms prove much faster than equivalent software implementations and since there is a need to perform encryption on data in real time, speed is very important. In particular, Field Programmable Gate Arrays (FPGAs) are well suited to encryption implementations due to their flexibility and an architecture, which can be exploited to accommodate typical encryption transformations. In this paper, a Look-Up Table (LUT) methodology is introduced where complex and slow operations are replaced by simple LUTs. A LUT-based fully pipelined Rijndael implementation is described which has a pre-placement performance of 12 Gbits/sec, which is a factor 1.2 times faster than an alternative design in which look-up tables are utilised to implement only one of the Round function transformations, and 6 times faster than other previous single-chip implementations. Iterative Rijndael implementations based on the Look-Up-Table design approach are also discussed and prove faster than typical iterative implementations.
Resumo:
With security and surveillance, there is an increasing need to be able to process image data efficiently and effectively either at source or in a large data networks. Whilst Field Programmable Gate Arrays have been seen as a key technology for enabling this, they typically use high level and/or hardware description language synthesis approaches; this provides a major disadvantage in terms of the time needed to design or program them and to verify correct operation; it considerably reduces the programmability capability of any technique based on this technology. The work here proposes a different approach of using optimised soft-core processors which can be programmed in software. In particular, the paper proposes a design tool chain for programming such processors that uses the CAL Actor Language as a starting point for describing an image processing algorithm and targets its implementation to these custom designed, soft-core processors on FPGA. The main purpose is to exploit the task and data parallelism in order to achieve the same parallelism as a previous HDL implementation but avoiding the design time, verification and debugging steps associated with such approaches.
Resumo:
Physically Unclonable Functions (PUFs), exploit inherent manufacturing variations and present a promising solution for hardware security. They can be used for key storage, authentication and ID generations. Low power cryptographic design is also very important for security applications. However, research to date on digital PUF designs, such as Arbiter PUFs and RO PUFs, is not very efficient. These PUF designs are difficult to implement on Field Programmable Gate Arrays (FPGAs) or consume many FPGA hardware resources. In previous work, a new and efficient PUF identification generator was presented for FPGA. The PUF identification generator is designed to fit in a single slice per response bit by using a 1-bit PUF identification generator cell formed as a hard-macro. In this work, we propose an ultra-compact PUF identification generator design. It is implemented on ten low-cost Xilinx Spartan-6 FPGA LX9 microboards. The resource utilization is only 2.23%, which, to the best of the authors' knowledge, is the most compact and robust FPGA-based PUF identification generator design reported to date. This PUF identification generator delivers a stable range of uniqueness of around 50% and good reliability between 85% and 100%.
Resumo:
Field-programmable gate arrays are ideal hosts to custom accelerators for signal, image, and data processing but de- mand manual register transfer level design if high performance and low cost are desired. High-level synthesis reduces this design burden but requires manual design of complex on-chip and off-chip memory architectures, a major limitation in applications such as video processing. This paper presents an approach to resolve this shortcoming. A constructive process is described that can derive such accelerators, including on- and off-chip memory storage from a C description such that a user-defined throughput constraint is met. By employing a novel statement-oriented approach, dataflow intermediate models are derived and used to support simple ap- proaches for on-/off-chip buffer partitioning, derivation of custom on-chip memory hierarchies and architecture transformation to ensure user-defined throughput constraints are met with minimum cost. When applied to accelerators for full search motion estima- tion, matrix multiplication, Sobel edge detection, and fast Fourier transform, it is shown how real-time performance up to an order of magnitude in advance of existing commercial HLS tools is enabled whilst including all requisite memory infrastructure. Further, op- timizations are presented that reduce the on-chip buffer capacity and physical resource cost by up to 96% and 75%, respectively, whilst maintaining real-time performance.
Resumo:
In this paper, we present a methodology for implementing a complete Digital Signal Processing (DSP) system onto a heterogeneous network including Field Programmable Gate Arrays (FPGAs) automatically. The methodology aims to allow design refinement and real time verification at the system level. The DSP application is constructed in the form of a Data Flow Graph (DFG) which provides an entry point to the methodology. The netlist for parts that are mapped onto the FPGA(s) together with the corresponding software and hardware Application Protocol Interface (API) are also generated. Using a set of case studies, we demonstrate that the design and development time can be significantly reduced using the methodology developed.
Resumo:
The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
Resumo:
Cloud computing technology has rapidly evolved over the last decade, offering an alternative way to store and work with large amounts of data. However data security remains an important issue particularly when using a public cloud service provider. The recent area of homomorphic cryptography allows computation on encrypted data, which would allow users to ensure data privacy on the cloud and increase the potential market for cloud computing. A significant amount of research on homomorphic cryptography appeared in the literature over the last few years; yet the performance of existing implementations of encryption schemes remains unsuitable for real time applications. One way this limitation is being addressed is through the use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) for implementations of homomorphic encryption schemes. This review presents the current state of the art in this promising new area of research and highlights the interesting remaining open problems.
Resumo:
High-speed field-programmable gate array (FPGA) implementations of an adaptive least mean square (LMS) filter with application in an electronic support measures (ESM) digital receiver, are presented. They employ "fine-grained" pipelining, i.e., pipelining within the processor and result in an increased output latency when used in the LMS recursive system. Therefore, the major challenge is to maintain a low latency output whilst increasing the pipeline stage in the filter for higher speeds. Using the delayed LMS (DLMS) algorithm, fine-grained pipelined FPGA implementations using both the direct form (DF) and the transposed form (TF) are considered and compared. It is shown that the direct form LMS filter utilizes the FPGA resources more efficiently thereby allowing a 120 MHz sampling rate.
Resumo:
Sphere Decoding (SD) is a highly effective detection technique for Multiple-Input Multiple-Output (MIMO) wireless communications receivers, offering quasi-optimal accuracy with relatively low computational complexity as compared to the ideal ML detector. Despite this, the computational demands of even low-complexity SD variants, such as Fixed Complexity SD (FSD), remains such that implementation on modern software-defined network equipment is a highly challenging process, and indeed real-time solutions for MIMO systems such as 4 4 16-QAM 802.11n are unreported. This paper overcomes this barrier. By exploiting large-scale networks of fine-grained softwareprogrammable processors on Field Programmable Gate Array (FPGA), a series of unique SD implementations are presented, culminating in the only single-chip, real-time quasi-optimal SD for 44 16-QAM 802.11n MIMO. Furthermore, it demonstrates that the high performance software-defined architectures which enable these implementations exhibit cost comparable to dedicated circuit architectures.
Resumo:
The paper presents IPPro which is a high performance, scalable soft-core processor targeted for image processing applications. It has been based on the Xilinx DSP48E1 architecture using the ZYNQ Field Programmable Gate Array and is a scalar 16-bit RISC processor that operates at 526MHz, giving 526MIPS of performance. Each IPPro core uses 1 DSP48, 1 Block RAM and 330 Kintex-7 slice-registers, thus making the processor as compact as possible whilst maintaining flexibility and programmability. A key aspect of the approach is in reducing the application design time and implementation effort by using multiple IPPro processors in a SIMD mode. For different applications, this allows us to exploit different levels of parallelism and mapping for the specified processing architecture with the supported instruction set. In this context, a Traffic Sign Recognition (TSR) algorithm has been prototyped on a Zedboard with the colour and morphology operations accelerated using multiple IPPros. Simulation and experimental results demonstrate that the processing platform is able to achieve a speedup of 15 to 33 times for colour filtering and morphology operations respectively, with a reduced design effort and time.
Resumo:
In this paper, a new field-programmable gate array (FPGA) identification generator circuit is introduced based on physically unclonable function (PUF) technology. The new identification generator is able to convert flip-flop delay path variations to unique n-bit digital identifiers (IDs), while requiring only a single slice per ID bit by using 1-bit ID cells formed as hard-macros. An exemplary 128-bit identification generator is implemented on ten Xilinx Spartan-6 FPGA devices. Experimental results show an uniqueness of 48.52%, and reliability of 92.41% over a 25°C to 70°C temperature range and 10% fluctuation in supply voltage