941 resultados para Field-Programmable Gate Array (FPGA)
Resumo:
This paper develops cycle-level FPGA circuits of an organization for a fast path-based neural branch predictor Our results suggest that practical sizes of prediction tables are limited to around 32 KB to 64 KB in current FPGA technology due mainly to FPGA area of logic resources to maintain the tables. However the predictor scales well in terms of prediction speed. Table sizes alone should not be used as the only metric for hardware budget when comparing neural-based predictor to predictors of totally different organizations. This paper also gives early evidence to shift the attention on to the recovery from mis-prediction latency rather than on prediction latency as the most critical factor impacting accuracy of predictions for this class of branch predictors.
Resumo:
This thesis develops high performance real-time signal processing modules for direction of arrival (DOA) estimation for localization systems. It proposes highly parallel algorithms for performing subspace decomposition and polynomial rooting, which are otherwise traditionally implemented using sequential algorithms. The proposed algorithms address the emerging need for real-time localization for a wide range of applications. As the antenna array size increases, the complexity of signal processing algorithms increases, making it increasingly difficult to satisfy the real-time constraints. This thesis addresses real-time implementation by proposing parallel algorithms, that maintain considerable improvement over traditional algorithms, especially for systems with larger number of antenna array elements. Singular value decomposition (SVD) and polynomial rooting are two computationally complex steps and act as the bottleneck to achieving real-time performance. The proposed algorithms are suitable for implementation on field programmable gated arrays (FPGAs), single instruction multiple data (SIMD) hardware or application specific integrated chips (ASICs), which offer large number of processing elements that can be exploited for parallel processing. The designs proposed in this thesis are modular, easily expandable and easy to implement. Firstly, this thesis proposes a fast converging SVD algorithm. The proposed method reduces the number of iterations it takes to converge to correct singular values, thus achieving closer to real-time performance. A general algorithm and a modular system design are provided making it easy for designers to replicate and extend the design to larger matrix sizes. Moreover, the method is highly parallel, which can be exploited in various hardware platforms mentioned earlier. A fixed point implementation of proposed SVD algorithm is presented. The FPGA design is pipelined to the maximum extent to increase the maximum achievable frequency of operation. The system was developed with the objective of achieving high throughput. Various modern cores available in FPGAs were used to maximize the performance and details of these modules are presented in detail. Finally, a parallel polynomial rooting technique based on Newton’s method applicable exclusively to root-MUSIC polynomials is proposed. Unique characteristics of root-MUSIC polynomial’s complex dynamics were exploited to derive this polynomial rooting method. The technique exhibits parallelism and converges to the desired root within fixed number of iterations, making this suitable for polynomial rooting of large degree polynomials. We believe this is the first time that complex dynamics of root-MUSIC polynomial were analyzed to propose an algorithm. In all, the thesis addresses two major bottlenecks in a direction of arrival estimation system, by providing simple, high throughput, parallel algorithms.
Resumo:
This paper proposes an automatic framework for the seamless integration of hardware accelerators, starting from an OpenMP-based application and an XML file describing the HW/SW partitioning. It extends a fully software architecture by generating and integrating the cores, along with the proper interfaces, and the code for scheduling and synchronization. Experimental results show that it is possible to validate different solutions only by varying the input code.
Resumo:
Boron nitride is a promising material for nanotechnology applications due to its two-dimensional graphene-like, insulating, and highly-resistant structure. Recently it has received a lot of attention as a substrate to grow and isolate graphene as well as for its intrinsic UV lasing response. Similar to carbon, one-dimensional boron nitride nanotubes (BNNTs) have been theoretically predicted and later synthesised. Here we use first principles simulations to unambiguously demonstrate that i) BN nanotubes inherit the highly efficient UV luminescence of hexagonal BN; ii) the application of an external perpendicular field closes the electronic gap keeping the UV lasing with lower yield; iii) defects in BNNTS are responsible for tunable light emission from the UV to the visible controlled by a transverse electric field (TEF). Our present findings pave the road towards optoelectronic applications of BN-nanotube-based devices that are simple to implement because they do not require any special doping or complex growth
Resumo:
Bloch modes can be excited in planar array due to its periodic lateral refractive index. The power coupled into each eigenmode of the array waveguides is calculated through the overlap integrals of the input field with the eigenmode fields of the coupled infinite array waveguides projected onto the x-axis. Low losses can be obtained if the transition from the array to the free propagation region is adiabatic. Due to the finite resolution of lithographic process the gap between the waveguides will stop abruptly, however, when the waveguides come into too close together. Calculation results show that losses will occur at this discontinuity, which are dependent on the ratio of the gap between the waveguides and grating pitch and on the confinement of field in the array waveguides. Tapered waveguides and low index contrast between the core and cladding layers can lower the transmitted losses.
Resumo:
In DSP applications such as fixed transforms and filtering, the full flexibility of a general-purpose multiplier is not required and only a limited range of values is needed on one of the multiplier inputs. A new design technique has been developed for deriving multipliers that operate on a limited range of multiplicands. This can be used to produce FPGA implementations of DSP systems where area is dramatically improved. The paper describes the technique and its application to the design of a poly-phase filter on a Virtex FPGA. A 62% area reduction and 7% speed increase is gained when compared to an equivalent design using general purpose multipliers. It is also compared favourably to other known fixed coefficient approaches.
Resumo:
Security devices are vulnerable to Differential Power Analysis (DPA) that reveals the key by monitoring the power consumption of the circuits. In this paper, we present the first DPA attack against an FPGA implementation of the Camellia encryption algorithm with all key sizes and evaluate the DPA resistance of the algorithm. The Camellia cryptographic algorithm involves several different key-dependent intermediate operations including S-Box operations. In previous research, it was believed that the Camellia is stronger than AES due to the additional Whitening phase protecting the S-Box operation. However, we propose an attack that bypasses the Whitening phase and targets the S-Box. In this paper, we also discuss a lowcost countermeasure strategy to protect the Pre-whitening / Post-whitening and FL function of Camellia using Dual-rail Precharged Logic and to protect against attacks of the S-Box using Random Delay Insertion. © 2009 IEEE.
Resumo:
Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on OpenCL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. OpenCL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified OpenCL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3x faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.
Resumo:
A crescente evolução dos dispositivos contendo circuitos integrados, em especial os FPGAs (Field Programmable Logic Arrays) e atualmente os System on a chip (SoCs) baseados em FPGAs, juntamente com a evolução das ferramentas, tem deixado um espaço entre o lançamento e a produção de materiais didáticos que auxiliem os engenheiros no Co- Projecto de hardware/software a partir dessas tecnologias. Com o intuito de auxiliar na redução desse intervalo temporal, o presente trabalho apresenta o desenvolvimento de documentos (tutoriais) direcionados a duas tecnologias recentes: a ferramenta de desenvolvimento de hardware/software VIVADO; e o SoC Zynq-7000, Z-7010, ambos desenvolvidos pela Xilinx. Os documentos produzidos são baseados num projeto básico totalmente implementado em lógica programável e do mesmo projeto implementado através do processador programável embarcado, para que seja possível avaliar o fluxo de projeto da ferramenta para um projeto totalmente implementado em hardware e o fluxo de projeto para o mesmo projeto implementado numa estrutura de harware/software.
Resumo:
Adaptive hardware requires some reconfiguration capabilities. FPGAs with native dynamic partial reconfiguration (DPR) support pose a dilemma for system designers: whether to use native DPR or to build a virtual reconfigurable circuit (VRC) on top of the FPGA which allows selecting alternative functions by a multiplexing scheme. This solution allows much faster reconfiguration, but with higher resource overhead. This paper discusses the advantages of both implementations for a 2D image processing matrix. Results show how higher operating frequency is obtained for the matrix using DPR. However, this is compensated in the VRC during evolution due to the comparatively negligible reconfiguration time. Regarding area, the DPR implementation consumes slightly more resources due to the reconfiguration engine, but adds further more capabilities to the system.
Resumo:
Evolvable Hardware (EH) is a technique that consists of using reconfigurable hardware devices whose configuration is controlled by an Evolutionary Algorithm (EA). Our system consists of a fully-FPGA implemented scalable EH platform, where the Reconfigurable processing Core (RC) can adaptively increase or decrease in size. Figure 1 shows the architecture of the proposed System-on-Programmable-Chip (SoPC), consisting of a MicroBlaze processor responsible of controlling the whole system operation, a Reconfiguration Engine (RE), and a Reconfigurable processing Core which is able to change its size in both height and width. This system is used to implement image filters, which are generated autonomously thanks to the evolutionary process. The system is complemented with a camera that enables the usage of the platform for real time applications.
Resumo:
As the development of a viable quantum computer nears, existing widely used public-key cryptosystems, such as RSA, will no longer be secure. Thus, significant effort is being invested into post-quantum cryptography (PQC). Lattice-based cryptography (LBC) is one such promising area of PQC, which offers versatile, efficient, and high performance security services. However, the vulnerabilities of these implementations against side-channel attacks (SCA) remain significantly understudied. Most, if not all, lattice-based cryptosystems require noise samples generated from a discrete Gaussian distribution, and a successful timing analysis attack can render the whole cryptosystem broken, making the discrete Gaussian sampler the most vulnerable module to SCA. This research proposes countermeasures against timing information leakage with FPGA-based designs of the CDT-based discrete Gaussian samplers with constant response time, targeting encryption and signature scheme parameters. The proposed designs are compared against the state-of-the-art and are shown to significantly outperform existing implementations. For encryption, the proposed sampler is 9x faster in comparison to the only other existing time-independent CDT sampler design. For signatures, the first time-independent CDT sampler in hardware is proposed.
Resumo:
The problem of determining a minimal number of control inputs for converting a programmable logic array (PLA) with undetectable faults to crosspoint-irredundant PLA for testing has been formulated as a nonstandard set covering problem. By representing subsets of sets as cubes, this problem has been reformulated as familiar problems. It is noted that this result has significance because a crosspoint-irredundant PLA can be converted to a completely testable PLA in a straightforward fashion, thus achieving very good fault coverage and easy testability.
Resumo:
In this paper we propose the architecture of a SoC fabric onto which applications described in a HLL are synthesized. The fabric is a homogeneous layout of computation, storage and communication resources on silicon. Through a process of composition of resources (as opposed to decomposition of applications), application specific computational structures are defined on the fabric at runtime to realize different modules of the applications in hardware. Applications synthesized on this fabric offers performance comparable to ASICs while retaining the programmability of processing cores. We outline the application synthesis methodology through examples, and compare our results with software implementations on traditional platforms with unbounded resources.