138 resultados para FPGA, Elettronica digitale, Sintesi logica
Resumo:
For modern FPGA, implementation of memory intensive processing applications such as high end image and video processing systems necessitates manual design of complex multilevel memory hierarchies incorporating off-chip DDR and onchip BRAM and LUT RAM. In fact, automated synthesis of multi-level memory hierarchies is an open problem facing high level synthesis technologies for FPGA devices. In this paper we describe the first automated solution to this problem.
By exploiting a novel dataflow application modelling dialect, known as Valved Dataflow, we show for the first time how, not only can such architectures be automatically derived, but also that the resulting implementations support real-time processing for current image processing application standards such as H.264. We demonstrate the viability of this approach by reporting the performance and cost of hierarchies automatically generated for Motion Estimation, Matrix Multiplication and Sobel Edge Detection applications on Virtex-5 FPGA.
Resumo:
Realising high performance image and signal processing
applications on modern FPGA presents a challenging implementation problem due to the large data frames streaming through these systems. Specifically, to meet the high bandwidth and data storage demands of these applications, complex hierarchical memory architectures must be manually specified
at the Register Transfer Level (RTL). Automated approaches which convert high-level operation descriptions, for instance in the form of C programs, to an FPGA architecture, are unable to automatically realise such architectures. This paper
presents a solution to this problem. It presents a compiler to automatically derive such memory architectures from a C program. By transforming the input C program to a unique dataflow modelling dialect, known as Valved Dataflow (VDF), a mapping and synthesis approach developed for this dialect can
be exploited to automatically create high performance image and video processing architectures. Memory intensive C kernels for Motion Estimation (CIF Frames at 30 fps), Matrix Multiplication (128x128 @ 500 iter/sec) and Sobel Edge Detection (720p @ 30 fps), which are unrealisable by current state-of-the-art C-based synthesis tools, are automatically derived from a C description of the algorithm.
Resumo:
This paper presents single-chip FPGA Rijndael algorithm implementations of the Advanced Encryption Standard (AES) algorithm, Rijndael. In particular, the designs utilise look-up tables to implement the entire Rijndael Round function. A comparison is provided between these designs and similar existing implementations. Hardware implementations of encryption algorithms prove much faster than equivalent software implementations and since there is a need to perform encryption on data in real time, speed is very important. In particular, Field Programmable Gate Arrays (FPGAs) are well suited to encryption implementations due to their flexibility and an architecture, which can be exploited to accommodate typical encryption transformations. In this paper, a Look-Up Table (LUT) methodology is introduced where complex and slow operations are replaced by simple LUTs. A LUT-based fully pipelined Rijndael implementation is described which has a pre-placement performance of 12 Gbits/sec, which is a factor 1.2 times faster than an alternative design in which look-up tables are utilised to implement only one of the Round function transformations, and 6 times faster than other previous single-chip implementations. Iterative Rijndael implementations based on the Look-Up-Table design approach are also discussed and prove faster than typical iterative implementations.
Resumo:
A generic, parameterisable key scheduling core is presented, which can be utilised in pipelinable private-key encryption algorithms. The data encryption standard (DES) algorithm, which lends itself readily to pipelining, is utilised to exemplify this novel key scheduling method and the broader applicability of the method to other encryption algorithms is illustrated. The DES design is implemented on Xilinx Virtex FPGA technology. Utilising the novel method, a 16-stage pipelined DES design is achieved, which can run at an encryption rate of 3.87 Gbit/s. This result is among the fastest hardware implementations and is a factor 28 times faster than software implementations.
Resumo:
A novel Networks-on-Chip (NoC) router architecture specified for FPGA based implementation with configurable Virtual-Channel (VC) is presented. Each pipeline stage of the proposed architecture has been optimized so that low packet propagation latency and reduced hardware overhead can be achieved. The proposed architecture enables high performance and cost effective VC NoC based on-chip system interconnects to be deployed on FPGA.
Resumo:
A novel cost-effective and low-latency wormhole router for packet-switched NoC designs, tailored for FPGA, is presented. This has been designed to be scalable at system level to fully exploit the characteristics and constraints of FPGA based systems, rather than custom ASIC technology. A key feature is that it achieves a low packet propagation latency of only two cycles per hop including both router pipeline delay and link traversal delay - a significant enhancement over existing FPGA designs - whilst being very competitive in terms of performance and hardware complexity. It can also be configured in various network topologies including 1-D, 2-D, and 3-D. Detailed design-space exploration has been carried for a range of scaling parameters, with the results of various design trade-offs being presented and discussed. By taking advantage of abundant buildin reconfigurable logic and routing resources, we have been able to create a new scalable on-chip FPGA based router that exhibits high dimensionality and connectivity. The architecture proposed can be easily migrated across many FPGA families to provide flexible, robust and cost-effective NoC solutions suitable for the implementation of high-performance FPGA computing systems. © 2011 IEEE.
Resumo:
Performance evaluation of parallel software and architectural exploration of innovative hardware support face a common challenge with emerging manycore platforms: they are limited by the slow running time and the low accuracy of software simulators. Manycore FPGA prototypes are difficult to build, but they offer great rewards. Software running on such prototypes runs orders of magnitude faster than current simulators. Moreover, researchers gain significant architectural insight during the modeling process. We use the Formic FPGA prototyping board [1], which specifically targets scalable and cost-efficient multi-board prototyping, to build and test a 64-board model of a 512-core, MicroBlaze-based, non-coherent hardware prototype with a full network-on-chip in a 3D-mesh topology. We expand the hardware architecture to include the ARM Versatile Express platforms and build a 520-core heterogeneous prototype of 8 Cortex-A9 cores and 512 MicroBlaze cores. We then develop an MPI library for the prototype and evaluate it extensively using several bare-metal and MPI benchmarks. We find that our processor prototype is highly scalable, models faithfully single-chip multicore architectures, and is a very efficient platform for parallel programming research, being 50,000 times faster than software simulation.
Resumo:
The paper presents IPPro which is a high performance, scalable soft-core processor targeted for image processing applications. It has been based on the Xilinx DSP48E1 architecture using the ZYNQ Field Programmable Gate Array and is a scalar 16-bit RISC processor that operates at 526MHz, giving 526MIPS of performance. Each IPPro core uses 1 DSP48, 1 Block RAM and 330 Kintex-7 slice-registers, thus making the processor as compact as possible whilst maintaining flexibility and programmability. A key aspect of the approach is in reducing the application design time and implementation effort by using multiple IPPro processors in a SIMD mode. For different applications, this allows us to exploit different levels of parallelism and mapping for the specified processing architecture with the supported instruction set. In this context, a Traffic Sign Recognition (TSR) algorithm has been prototyped on a Zedboard with the colour and morphology operations accelerated using multiple IPPros. Simulation and experimental results demonstrate that the processing platform is able to achieve a speedup of 15 to 33 times for colour filtering and morphology operations respectively, with a reduced design effort and time.
Resumo:
Low-power processors and accelerators that were originally designed for the embedded systems market are emerging as building blocks for servers. Power capping has been actively explored as a technique to reduce the energy footprint of high-performance processors. The opportunities and limitations of power capping on the new low-power processor and accelerator ecosystem are less understood. This paper presents an efficient power capping and management infrastructure for heterogeneous SoCs based on hybrid ARM/FPGA designs. The infrastructure coordinates dynamic voltage and frequency scaling with task allocation on a customised Linux system for the Xilinx Zynq SoC. We present a compiler-assisted power model to guide voltage and frequency scaling, in conjunction with workload allocation between the ARM cores and the FPGA, under given power caps. The model achieves less than 5% estimation bias to mean power consumption. In an FFT case study, the proposed power capping schemes achieve on average 97.5% of the performance of the optimal execution and match the optimal execution in 87.5% of the cases, while always meeting power constraints.
Resumo:
With security and surveillance, there is an increasing need to be able to process image data efficiently and effectively either at source or in a large data networks. Whilst Field Programmable Gate Arrays have been seen as a key technology for enabling this, they typically use high level and/or hardware description language synthesis approaches; this provides a major disadvantage in terms of the time needed to design or program them and to verify correct operation; it considerably reduces the programmability capability of any technique based on this technology. The work here proposes a different approach of using optimised soft-core processors which can be programmed in software. In particular, the paper proposes a design tool chain for programming such processors that uses the CAL Actor Language as a starting point for describing an image processing algorithm and targets its implementation to these custom designed, soft-core processors on FPGA. The main purpose is to exploit the task and data parallelism in order to achieve the same parallelism as a previous HDL implementation but avoiding the design time, verification and debugging steps associated with such approaches.
Resumo:
In this paper, a new field-programmable gate array (FPGA) identification generator circuit is introduced based on physically unclonable function (PUF) technology. The new identification generator is able to convert flip-flop delay path variations to unique n-bit digital identifiers (IDs), while requiring only a single slice per ID bit by using 1-bit ID cells formed as hard-macros. An exemplary 128-bit identification generator is implemented on ten Xilinx Spartan-6 FPGA devices. Experimental results show an uniqueness of 48.52%, and reliability of 92.41% over a 25°C to 70°C temperature range and 10% fluctuation in supply voltage
Resumo:
The Field Programmable Gate Array (FPGA) implementation of the commonly used Histogram of Oriented Gradients (HOG) algorithm is explored. The HOG algorithm is employed to extract features for object detection. A key focus has been to explore the use of a new FPGA-based processor which has been targeted at image processing. The paper gives details of the mapping and scheduling factors that influence the performance and the stages that were undertaken to allow the algorithm to be deployed on FPGA hardware, whilst taking into account the specific IPPro architecture features. We show that multi-core IPPro performance can exceed that of against state-of-the-art FPGA designs by up to 3.2 times with reduced design and implementation effort and increased flexibility all on a low cost, Zynq programmable system.
Resumo:
Physically Unclonable Functions (PUFs), exploit inherent manufacturing variations and present a promising solution for hardware security. They can be used for key storage, authentication and ID generations. Low power cryptographic design is also very important for security applications. However, research to date on digital PUF designs, such as Arbiter PUFs and RO PUFs, is not very efficient. These PUF designs are difficult to implement on Field Programmable Gate Arrays (FPGAs) or consume many FPGA hardware resources. In previous work, a new and efficient PUF identification generator was presented for FPGA. The PUF identification generator is designed to fit in a single slice per response bit by using a 1-bit PUF identification generator cell formed as a hard-macro. In this work, we propose an ultra-compact PUF identification generator design. It is implemented on ten low-cost Xilinx Spartan-6 FPGA LX9 microboards. The resource utilization is only 2.23%, which, to the best of the authors' knowledge, is the most compact and robust FPGA-based PUF identification generator design reported to date. This PUF identification generator delivers a stable range of uniqueness of around 50% and good reliability between 85% and 100%.
Resumo:
Field programmable gate array devices boast abundant resources with which custom accelerator components for signal, image and data processing may be realised; however, realising high performance, low cost accelerators currently demands manual register transfer level design. Software-programmable ’soft’ processors have been proposed as a way to reduce this design burden but they are unable to support performance and cost comparable to custom circuits. This paper proposes a new soft processing approach for FPGA which promises to overcome this barrier. A high performance, fine-grained streaming processor, known as a Streaming Accelerator Element, is proposed which realises accelerators as large scale custom multicore networks. By adopting a streaming execution approach with advanced program control and memory addressing capabilities, typical program inefficiencies can be almost completely eliminated to enable performance and cost which are unprecedented amongst software-programmable solutions. When used to realise accelerators for fast fourier transform, motion estimation, matrix multiplication and sobel edge detection it is shown how the proposed architecture enables real-time performance and with performance and cost comparable with hand-crafted custom circuit accelerators and up to two orders of magnitude beyond existing soft processors.
Resumo:
There is demand for an easily programmable, high performance image processing platform based on FPGAs. In previous work, a novel, high performance processor - IPPro was developed and a Histogram of Orientated Gradients (HOG) algorithm study undertaken on a Xilinx Zynq platform. Here, we identify and explore a number of mapping strategies to improve processing efficiency for soft-cores and a number of options for creation of a division coprocessor. This is demonstrated for the revised high definition HOG implementation on a Zynq platform, resulting in a performance of 328 fps which represents a 146% speed improvement over the original realization and a tenfold reduction in energy.