142 resultados para High performance processors
Resumo:
Low-power processors and accelerators that were originally designed for the embedded systems market are emerging as building blocks for servers. Power capping has been actively explored as a technique to reduce the energy footprint of high-performance processors. The opportunities and limitations of power capping on the new low-power processor and accelerator ecosystem are less understood. This paper presents an efficient power capping and management infrastructure for heterogeneous SoCs based on hybrid ARM/FPGA designs. The infrastructure coordinates dynamic voltage and frequency scaling with task allocation on a customised Linux system for the Xilinx Zynq SoC. We present a compiler-assisted power model to guide voltage and frequency scaling, in conjunction with workload allocation between the ARM cores and the FPGA, under given power caps. The model achieves less than 5% estimation bias to mean power consumption. In an FFT case study, the proposed power capping schemes achieve on average 97.5% of the performance of the optimal execution and match the optimal execution in 87.5% of the cases, while always meeting power constraints.
Resumo:
Massively parallel networks of highly efficient, high performance Single Instruction Multiple Data (SIMD) processors have been shown to enable FPGA-based implementation of real-time signal processing applications with performance and
cost comparable to dedicated hardware architectures. This is achieved by exploiting simple datapath units with deep processing pipelines. However, these architectures are highly susceptible to pipeline bubbles resulting from data and control hazards; the only way to mitigate against these is manual interleaving of
application tasks on each datapath, since no suitable automated interleaving approach exists. In this paper we describe a new automated integrated mapping/scheduling approach to map algorithm tasks to processors and a new low-complexity list scheduling technique to generate the interleaved schedules. When applied to a spatial Fixed-Complexity Sphere Decoding (FSD) detector
for next-generation Multiple-Input Multiple-Output (MIMO) systems, the resulting schedules achieve real-time performance for IEEE 802.11n systems on a network of 16-way SIMD processors on FPGA, enable better performance/complexity balance than current approaches and produce results comparable to handcrafted implementations.
Resumo:
There is demand for an easily programmable, high performance image processing platform based on FPGAs. In previous work, a novel, high performance processor - IPPro was developed and a Histogram of Orientated Gradients (HOG) algorithm study undertaken on a Xilinx Zynq platform. Here, we identify and explore a number of mapping strategies to improve processing efficiency for soft-cores and a number of options for creation of a division coprocessor. This is demonstrated for the revised high definition HOG implementation on a Zynq platform, resulting in a performance of 328 fps which represents a 146% speed improvement over the original realization and a tenfold reduction in energy.
Resumo:
Presented is a study that expands the body of knowledge on the effect of in-cycle speed fluctuations on performance of small engines. It uses the methods developed previously by Callahan, et al. (1) to examine a variety of two-stroke engines and one four-stroke engine. The two-stroke engines were: a high performance single-cylinder, a low performance single-cylinder, a high performance multi-cylinder, and a medium performance multi-cylinder. The four-stroke engine was a high performance single-cylinder unit. Each engine was modeled in Virtual Engines, which is a fully detailed one-dimensional thermodynamic engine simulator. Measured or predicted in-cycle speed data were input into the engine models. Predicted performance changes due to drivetrain effects are shown in each case, and conclusions are drawn from those results. The simulations for the high performance single-cylinder two-stroke engine predicted significant in-cycle crankshaft speed fluctuation amplitudes and significant changes in performance when the fluctuations were input into the engine model. This was validated experimentally on a firing test engine based on a Yamaha YZ250. The four-stroke engine showed significant changes in predicted performance compared to the prediction with zero speed fluctuation assumed in the model. Measured speed fluctuations from a firing Yamaha YZ400F engine were applied to the simulation in addition to data from a simple free mass model. Both methods predicted similar fluctuation profiles and changes in performance. It is shown that the gear reduction between the crankshaft and clutch allowed for this similar behavior. The multi-cylinder, high performance two-stroke engine also showed significant changes in performance, in this case depending on the firing configuration. The low output two-stroke engine simulation showed only a negligible change in performance in spite of high amplitude speed fluctuations. This was due to its flat torque versus speed characteristic. The medium performance multi-cylinder two-stroke engine also showed only a negligible change in performance, in this case due to a relatively high inertia rotating assembly and multiple cylinder firing events within the revolution. These smoothed the net torque pulsations and reduced the amplitude of the speed fluctuation itself.
Resumo:
This is the first paper to describe performance assessment of triple and double gate FinFETs for High Performance (HP), Low Operating Power (LOP) and Low Standby Power (LSTP) logic technologies is investigated. The impact of gate work-function, spacer width, lateral source/drain doping gradient, fin aspect ratio, fin thickness on device performance, has been analysed in detail and guidelines are presented to meet ITRS specification at 65 and 45 nm nodes. Optimal design of lateral source/drain doping profile can not only effectively control short channel effects, yielding low off-current, but also achieve low values of intrinsic gate delay.
Resumo:
In this paper a novel scalable public-key processor architecture is presented that supports modular exponentiation and Elliptic Curve Cryptography over both prime GF(p) and binary GF(2) extension fields. This is achieved by a high performance instruction set that provides a comprehensive range of integer and polynomial basis field arithmetic. The instruction set and associated hardware are generic in nature and do not specifically support any cryptographic algorithms or protocols. Firmware within the device is used to efficiently implement complex and data intensive arithmetic. A firmware library has been developed in order to demonstrate support for numerous exponentiation and ECC approaches, such as different coordinate systems and integer recoding methods. The processor has been developed as a high-performance asymmetric cryptography platform in the form of a scalable Verilog RTL core. Various features of the processor may be scaled, such as the pipeline width and local memory subsystem, in order to suit area, speed and power requirements. The processor is evaluated and compares favourably with previous work in terms of performance while offering an unparalleled degree of flexibility. © 2006 IEEE.
Resumo:
In this paper, by investigating the influence of source/drain extension region engineering (also known as gate-source/drain underlap) in nanoscale planar double gate (DG) SOI MOSFETs, we offer new insights into the design of future nanoscale gate-underlap DG devices to achieve ITRS projections for high performance (HP), low standby power (LSTP) and low operating power (LOP) logic technologies. The impact of high-kappa gate dielectric, silicon film thickness, together with parameters associated with the lateral source/drain doping profile, is investigated in detail. The results show that spacer width along with lateral straggle can not only effectively control short-channel effects, thus presenting low off-current in a gate underlap device, but can also be optimized to achieve lower intrinsic delay and higher on-off current ratio (I-on/I-off). Based on the investigation of on-current (I-on), off-current (I-off), I-on/I-off, intrinsic delay (tau), energy delay product and static power dissipation, we present design guidelines to select key device parameters to achieve ITRS projections. Using nominal gate lengths for different technologies, as recommended from ITRS specification, optimally designed gate-underlap DG MOSFETs with a spacer-to-straggle (s/sigma) ratio of 2.3 for HP/LOP and 3.2 for LSTP logic technologies will meet ITRS projection. However, a relatively narrow range of lateral straggle lying between 7 to 8 nm is recommended. A sensitivity analysis of intrinsic delay, on-current and off-current to important parameters allows a comparative analysis of the various design options and shows that gate workfunction appears to be the most crucial parameter in the design of DG devices for all three technologies. The impact of back gate misalignment on I-on, I-off and tau is also investigated for optimized underlap devices.
Resumo:
The development of high performance, low computational complexity detection algorithms is a key challenge for real-time Multiple-Input Multiple-Output (MIMO) communication system design. The Fixed-Complexity Sphere Decoder (FSD) algorithm is one of the most promising approaches, enabling quasi-ML decoding accuracy and high performance implementation due to its deterministic, highly parallel structure. However, it suffers from exponential growth in computational complexity as the number of MIMO transmit antennas increases, critically limiting its scalability to larger MIMO system topologies. In this paper, we present a solution to this problem by applying a novel cutting protocol to the decoding tree of a real-valued FSD algorithm. The new Real-valued Fixed-Complexity Sphere Decoder (RFSD) algorithm derived achieves similar quasi-ML decoding performance as FSD, but with an average 70% reduction in computational complexity, as we demonstrate from both theoretical and implementation perspectives for Quadrature Amplitude Modulation (QAM)-MIMO systems.
Resumo:
Multicore computational accelerators such as GPUs are now commodity components for highperformance computing at scale. While such accelerators have been studied in some detail as stand-alone computational engines, their integration in large-scale distributed systems raises new challenges and trade-offs. In this paper, we present an exploration of resource management alternatives for building asymmetric accelerator-based distributed systems. We present these alternatives in the context of a capabilities-aware framework for data-intensive computing, which uses an enhanced implementation of the MapReduce programming model for accelerator-based clusters, compared to the state of the art. The framework can transparently utilize heterogeneous accelerators for deriving high performance with low programming effort. Our work is the first to compare heterogeneous types of accelerators, GPUs and a Cell processors, in the same environment and the first to explore the trade-offs between compute-efficient and control-efficient accelerators on data-intensive systems. Our investigation shows that our framework scales well with the number of different compute nodes. Furthermore, it runs simultaneously on two different types of accelerators, successfully adapts to the resource capabilities, and performs 26.9% better on average than a static execution approach.
Resumo:
The Cell Broadband Engine (BE) Architecture is a new heterogeneous multi-core architecture targeted at compute-intensive workloads. The architecture of the Cell BE has several features that are unique in high-performance general-purpose processors, most notably the extensive support for vectorization, scratch pad memories and explicit programming of direct memory accesses (DMAs) and mailbox communication. While these features strongly increase programming complexity, it is generally claimed that significant speedups can be obtained by using Cell BE processors. This paper presents our experiences with using the Cell BE architecture to accelerate Clustal W, a bio-informatics program for multiple sequence alignment. We report on how we apply the unique features of the Cell BE to Clustal W and how important each is in obtaining high performance. By making extensive use of vectorization and by parallelizing the application across all cores, we demonstrate a speedup of 24.4 times when using 16 synergistic processor units on a QS21 Cell Blade compared to single-thread execution on the power processing unit. As the Cell BE exploits a large number of slim cores, our highly optimized implementation is just 3.8 times faster than a 3-thread version running on an Intel Core2 Duo, as the latter processor exploits a small number of fat cores.