47 resultados para High performance processors
Resumo:
Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non-linear modeling techniques to assist processor architects in making design decisions and resolving complex trade-offs. We propose a procedure for building accurate non-linear models that consists of the following steps: (i) selection of a small set of representative design points spread across processor design space using latin hypercube sampling, (ii) obtaining performance measures at the selected design points using detailed simulation, (iii) building non-linear models for performance using the function approximation capabilities of radial basis function networks, and (iv) validating the models using an independently and randomly generated set of design points. We evaluate our model building procedure by constructing non-linear performance models for programs from the SPEC CPU2000 benchmark suite with a microarchitectural design space that consists of 9 key parameters. Our results show that the models, built using a relatively small number of simulations, achieve high prediction accuracy (only 2.8% error in CPI estimates on average) across a large processor design space. Our models can potentially replace detailed simulation for common tasks such as the analysis of key microarchitectural trends or searches for optimal processor design points.
Resumo:
High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have implemented a Firewall with this architecture in reconflgurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results in both speed and area improvement when it is implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields. High throughput classification invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly in terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for the worst case packet size. The Firewall rule update involves only memory re-initialization in software without any hardware change.
Resumo:
An artificial neural network (ANN) is presented to predict a 28-day compressive strength of a normal and high strength self compacting concrete (SCC) and high performance concrete (HPC) with high volume fly ash. The ANN is trained by the data available in literature on normal volume fly ash because data on SCC with high volume fly ash is not available in sufficient quantity. Further, while predicting the strength of HPC the same data meant for SCC has been used to train in order to economise on computational effort. The compressive strengths of SCC and HPC as well as slump flow of SCC estimated by the proposed neural network are validated by experimental results.
Resumo:
High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have Implemented a Firewall with this architecture in reconfigurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using, our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results In both speed and area Improvement when It is Implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields.High throughput classification Invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly In terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware Implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for file worst case packet size. The Firewall rule update Involves only memory re-initialiization in software without any hardware change.
Resumo:
Conducting and semiconducting polymers are important materials in the development of printed, flexible, large-area electronics such as flat-panel displays and photovoltaic cells. There has been rapid progress in developing conjugated polymers with high transport mobility required for high-performance field-effect transistors (FETs), beginning(1) with mobilities around 10(-4) cm(2) V-1 s(-1) to a recent report(2) of 1 cm(2) V-1 s(-1) for poly(2,5-bis(3-tetradecylthiophen-2-yl) thieno[3,2-b] thiophene) (PBTTT). Here, the electrical properties of PBTTT are studied at high charge densities both as the semiconductor layer in FETs and in electrochemically doped films to determine the transport mechanism. We show that data obtained using a wide range of parameters (temperature, gate-induced carrier density, source-drain voltage and doping level) scale onto the universal curve predicted for transport in the Luttinger liquid description of the one-dimensional `metal'.
Resumo:
he growth of high-performance application in computer graphics, signal processing and scientific computing is a key driver for high performance, fixed latency; pipelined floating point dividers. Solutions available in the literature use large lookup table for double precision floating point operations.In this paper, we propose a cost effective, fixed latency pipelined divider using modified Taylor-series expansion for double precision floating point operations. We reduce chip area by using a smaller lookup table. We show that the latency of the proposed divider is 49.4 times the latency of a full-adder. The proposed divider reduces chip area by about 81% than the pipelined divider in [9] which is based on modified Taylor-series.
Resumo:
Processor architects have a challenging task of evaluating a large design space consisting of several interacting parameters and optimizations. In order to assist architects in making crucial design decisions, we build linear regression models that relate Processor performance to micro-architecture parameters, using simulation based experiments. We obtain good approximate models using an iterative process in which Akaike's information criteria is used to extract a good linear model from a small set of simulations, and limited further simulation is guided by the model using D-optimal experimental designs. The iterative process is repeated until desired error bounds are achieved. We used this procedure to establish the relationship of the CPI performance response to 26 key micro-architectural parameters using a detailed cycle-by-cycle superscalar processor simulator The resulting models provide a significance ordering on all micro-architectural parameters and their interactions, and explain the performance variations of micro-architectural techniques.
Resumo:
High performance video standards use prediction techniques to achieve high picture quality at low bit rates. The type of prediction decides the bit rates and the image quality. Intra Prediction achieves high video quality with significant reduction in bit rate. This paper present an area optimized architecture for Intra prediction, for H.264 decoding at HDTV resolution with a target of achieving 60 fps. The architecture was validated on Virtex-5 FPGA based platform. The architecture achieves a frame rate of 64 fps. The architecture is based on multi-level memory hierarchy to reduce latency and ensure optimum resources utilization. It removes redundancy by reusing same functional blocks across different modes. The proposed architecture uses only 13% of the total LUTs available on the Xilinx FPGA XC5VLX50T.
Resumo:
Computational grids are increasingly being used for executing large multi-component scientific applications. The most widely reported advantages of application execution on grids are the performance benefits, in terms of speeds, problem sizes or quality of solutions, due to increased number of processors. We explore the possibility of improved performance on grids without increasing the application’s processor space. For this, we consider grids with multiple batch systems. We explore the challenges involved in and the advantages of executing long-running multi-component applications on multiple batch sites with a popular multi-component climate simulation application, CCSM, as the motivation.We have performed extensive simulation studies to estimate the single and multi-site execution rates of the applications for different system characteristics.Our experiments show that in many cases, multiple batch executions can have better execution rates than a single site execution.
Resumo:
The paper presents an adaptive Fourier filtering technique and a relaying scheme based on a combination of a digital band-pass filter along with a three-sample algorithm, for applications in high-speed numerical distance protection. To enhance the performance of above-mentioned technique, a high-speed fault detector has been used. MATLAB based simulation studies show that the adaptive Fourier filtering technique provides fast tripping for near faults and security for farther faults. The digital relaying scheme based on a combination of digital band-pass filter along with three-sample data window algorithm also provides accurate and high-speed detection of faults. The paper also proposes a high performance 16-bit fixed point DSP (Texas Instruments TMS320LF2407A) processor based hardware scheme suitable for implementation of the above techniques. To evaluate the performance of the proposed relaying scheme under steady state and transient conditions, PC based menu driven relay test procedures are developed using National Instruments LabVIEW software. The test signals are generated in real time using LabVIEW compatible analog output modules. The results obtained from the simulation studies as well as hardware implementations are also presented.
Resumo:
Abstract—A new breed of processors like the Cell Broadband Engine, the Imagine stream processor and the various GPU processors emphasize data-level parallelism (DLP) and threadlevel parallelism (TLP) as opposed to traditional instructionlevel parallelism (ILP). This allows them to achieve order-ofmagnitude improvements over conventional superscalar processors for many workloads. However, it is unclear as to how much parallelism of these types exists in current programs. Most earlier studies have largely concentrated on the amount of ILP in a program, without differentiating DLP or TLP. In this study, we investigate the extent of data-level parallelism available in programs in the MediaBench suite. By packing instructions in a SIMD fashion, we observe reductions of up to 91 % (84 % on average) in the number of dynamic instructions, indicating a very high degree of DLP in several applications. I.
Resumo:
Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups up to 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs apart of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64
Resumo:
Critical applications like cyclone tracking and earthquake modeling require simultaneous high-performance simulations and online visualization for timely analysis. Faster simulations and simultaneous visualization enable scientists provide real-time guidance to decision makers. In this work, we have developed an integrated user-driven and automated steering framework that simultaneously performs numerical simulations and efficient online remote visualization of critical weather applications in resource-constrained environments. It considers application dynamics like the criticality of the application and resource dynamics like the storage space, network bandwidth and available number of processors to adapt various application and resource parameters like simulation resolution, simulation rate and the frequency of visualization. We formulate the problem of finding an optimal set of simulation parameters as a linear programming problem. This leads to 30% higher simulation rate and 25-50% lesser storage consumption than a naive greedy approach. The framework also provides the user control over various application parameters like region of interest and simulation resolution. We have also devised an adaptive algorithm to reduce the lag between the simulation and visualization times. Using experiments with different network bandwidths, we find that our adaptive algorithm is able to reduce lag as well as visualize the most representative frames.
Resumo:
Data Prefetchers identify and make use of any regularity present in the history/training stream to predict future references and prefetch them into the cache. The training information used is typically the primary misses seen at a particular cache level, which is a filtered version of the accesses seen by the cache. In this work we demonstrate that extending the training information to include secondary misses and hits along with primary misses helps improve the performance of prefetchers. In addition to empirical evaluation, we use the information theoretic metric entropy, to quantify the regularity present in extended histories. Entropy measurements indicate that extended histories are more regular than the default primary miss only training stream. Entropy measurements also help corroborate our empirical findings. With extended histories, further benefits can be achieved by triggering prefetches during secondary misses also. In this paper we explore the design space of extended prefetch histories and alternative prefetch trigger points for delta correlation prefetchers. We observe that different prefetch schemes benefit to a different extent with extended histories and alternative trigger points. Also the best performing design point varies on a per-benchmark basis. To meet these requirements, we propose a simple adaptive scheme that identifies the best performing design point for a benchmark-prefetcher combination at runtime. In SPEC2000 benchmarks, using all the L2 accesses as history for prefetcher improves the performance in terms of both IPC and misses reduced over techniques that use only primary misses as history. The adaptive scheme improves the performance of CZone prefetcher over Baseline by 4.6% on an average. These performance gains are accompanied by a moderate reduction in the memory traffic requirements.
Resumo:
Hafnium dioxide (HfO2) films, deposited using electron beam evaporation, are optimized for high performance back-gated graphene transistors. Bilayer graphene is identified on HfO2/Si substrate using optical microscope and subsequently confirmed with Raman spectroscopy. Back-gated graphene transistor, with 32 nm thick HfO2 gate dielectric, has been fabricated with very high transconductance value of 60 mu S. From the hysteresis of the current-voltage characteristics, we estimate the trap density in HfO2 to be in the mid 10(11)/cm(2) range, comparable to SiO2.