109 resultados para intel processor
Resumo:
Abstract—A new breed of processors like the Cell Broadband Engine, the Imagine stream processor and the various GPU processors emphasize data-level parallelism (DLP) and threadlevel parallelism (TLP) as opposed to traditional instructionlevel parallelism (ILP). This allows them to achieve order-ofmagnitude improvements over conventional superscalar processors for many workloads. However, it is unclear as to how much parallelism of these types exists in current programs. Most earlier studies have largely concentrated on the amount of ILP in a program, without differentiating DLP or TLP. In this study, we investigate the extent of data-level parallelism available in programs in the MediaBench suite. By packing instructions in a SIMD fashion, we observe reductions of up to 91 % (84 % on average) in the number of dynamic instructions, indicating a very high degree of DLP in several applications. I.
Resumo:
Community Climate System Model (CCSM) is a Multiple Program Multiple Data (MPMD) parallel global climate model comprising atmosphere, ocean, land, ice and coupler components. The simulations have a time-step of the order of tens of minutes and are typically performed for periods of the order of centuries. These climate simulations are highly computationally intensive and can take several days to weeks to complete on most of today’s multi-processor systems. ExecutingCCSM on grids could potentially lead to a significant reduction in simulation times due to the increase in number of processors. However, in order to obtain performance gains on grids, several challenges have to be met. In this work,we describe our load balancing efforts in CCSM to make it suitable for grid enabling.We also identify the various challenges in executing CCSM on grids. Since CCSM is an MPI application, we also describe our current work on building a MPI implementation for grids to grid-enable CCSM.
Resumo:
Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional large monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals which address energy efficiency issues by avoiding only the unnecessary tag-comparisons do not reduce broadcasts. These schemes also increase the issue latency.To address both these issues comprehensively, we propose the Scalable Lowpower Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to a monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ significantly reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively on average.
Resumo:
Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups up to 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs apart of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64
Resumo:
Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non-linear modeling techniques to assist processor architects in making design decisions and resolving complex trade-offs. We propose a procedure for building accurate non-linear models that consists of the following steps: (i) selection of a small set of representative design points spread across processor design space using latin hypercube sampling, (ii) obtaining performance measures at the selected design points using detailed simulation, (iii) building non-linear models for performance using the function approximation capabilities of radial basis function networks, and (iv) validating the models using an independently and randomly generated set of design points. We evaluate our model building procedure by constructing non-linear performance models for programs from the SPEC CPU2000 benchmark suite with a microarchitectural design space that consists of 9 key parameters. Our results show that the models, built using a relatively small number of simulations, achieve high prediction accuracy (only 2.8% error in CPI estimates on average) across a large processor design space. Our models can potentially replace detailed simulation for common tasks such as the analysis of key microarchitectural trends or searches for optimal processor design points.
Resumo:
We propose the design and implementation of hardware architecture for spatial prediction based image compression scheme, which consists of prediction phase and quantization phase. In prediction phase, the hierarchical tree structure obtained from the test image is used to predict every central pixel of an image by its four neighboring pixels. The prediction scheme generates an error image, to which the wavelet/sub-band coding algorithm can be applied to obtain efficient compression. The software model is tested for its performance in terms of entropy, standard deviation. The memory and silicon area constraints play a vital role in the realization of the hardware for hand-held devices. The hardware architecture is constructed for the proposed scheme, which involves the aspects of parallelism in instructions and data. The processor consists of pipelined functional units to obtain the maximum throughput and higher speed of operation. The hardware model is analyzed for performance in terms throughput, speed and power. The results of hardware model indicate that the proposed architecture is suitable for power constrained implementations with higher data rate
Resumo:
Miniaturization of devices and the ensuing decrease in the threshold voltage has led to a substantial increase in the leakage component of the total processor energy consumption. Relatively simpler issue logic and the presence of a large number of function units in the VLIW and the clustered VLIW architectures attribute a large fraction of this leakage energy consumption in the functional units. However, functional units are not fully utilized in the VLIW architectures because of the inherent variations in the ILP of the programs. This underutilization is even more pronounced in the context of clustered VLIW architectures because of the contentions for the limited number of slow intercluster communication channels which lead to many short idle cycles.In the past, some architectural schemes have been proposed to obtain leakage energy bene .ts by aggressively exploiting the idleness of functional units. However, presence of many short idle cycles cause frequent transitions from the active mode to the sleep mode and vice-versa and adversely a ffects the energy benefits of a purely hardware based scheme. In this paper, we propose and evaluate a compiler instruction scheduling algorithm that assist such a hardware based scheme in the context of VLIW and clustered VLIW architectures. The proposed scheme exploits the scheduling slacks of instructions to orchestrate the functional unit mapping with the objective of reducing the number of transitions in functional units thereby keeping them off for a longer duration. The proposed compiler-assisted scheme obtains a further 12% reduction of energy consumption of functional units with negligible performance degradation over a hardware-only scheme for a VLIW architecture. The benefits are 15% and 17% in the context of a 2-clustered and a 4-clustered VLIW architecture respectively. Our test bed uses the Trimaran compiler infrastructure.
Resumo:
Past studies of memory interference in multiprocessor systems have generally assumed that the references of each processor are uniformly distributed among the memory modules. In this paper we develop a model with local referencing, which reflects more closely the behavior of real-life programs. This model is analyzed using Markov chain techniques and expressions are derived for the multiprocessor performance. New expressions are also obtained for the performance in the traditional uniform reference model and are compared with other expressions-available in the literature. Results of a simulation study are given to show the accuracy of the expressions for both models.
Resumo:
ASICs offer the best realization of DSP algorithms in terms of performance, but the cost is prohibitive, especially when the volumes involved are low. However, if the architecture synthesis trajectory for such algorithms is such that the target architecture can be identified as an interconnection of elementary parameterized computational structures, then it is possible to attain a close match, both in terms of performance and power with respect to an ASIC, for any algorithmic parameters of the given algorithm. Such an architecture is weakly programmable (configurable) and can be viewed as an application specific instruction-set processor (ASIP). In this work, we present a methodology to synthesize ASIPs for DSP algorithms.
Resumo:
Unending quest for performance improvement coupled with the advancements in integrated circuit technology have led to the development of new architectural paradigm. Speculative multithreaded architecture (SpMT) philosophy relies on aggressive speculative execution for improved performance. However, aggressive speculative execution comes with a mixed flavor of improving performance, when successful, and adversely affecting the energy consumption (and performance) because of useless computation in the event of mis-speculation. Dynamic instruction criticality information can be usefully applied to control and guide such an aggressive speculative execution. In this paper, we present a model of micro-execution for SpMT architecture that we have developed to determine the dynamic instruction criticality. We have also developed two novel techniques utilizing the criticality information namely delaying the non-critical loads and the criticality based thread-prediction for reducing useless computations and energy consumption. Experimental results showing break-up of critical instructions and effectiveness of proposed techniques in reducing energy consumption are presented in the context of multiscalar processor that implements SpMT architecture. Our experiments show 17.7% and 11.6% reduction in dynamic energy for criticality based thread prediction and criticality based delayed load scheme respectively while the improvement in dynamic energy delay product is 13.9% and 5.5%, respectively. (c) 2012 Published by Elsevier B.V.
Resumo:
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy-performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy-delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure. (C) 2012 Elsevier Inc. All rights reserved.
Resumo:
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
Resumo:
Computational grids with multiple batch systems (batch grids) can be powerful infrastructures for executing long-running multi-component parallel applications. In this paper, we evaluate the potential improvements in throughput of long-running multi-component applications when the different components of the applications are executed on multiple batch systems of batch grids. We compare the multiple batch executions with executions of the components on a single batch system without increasing the number of processors used for executions. We perform our analysis with a foremost long-running multi-component application for climate modeling, the Community Climate System Model (CCSM). We have built a robust simulator that models the characteristics of both the multi-component application and the batch systems. By conducting large number of simulations with different workload characteristics and queuing policies of the systems, processor allocations to components of the application, distributions of the components to the batch systems and inter-cluster bandwidths, we show that multiple batch executions lead to 55% average increase in throughput over single batch executions for long-running CCSM. We also conducted real experiments with a practical middleware infrastructure and showed that multi-site executions lead to effective utilization of batch systems for executions of CCSM and give higher simulation throughput than single-site executions. Copyright (c) 2011 John Wiley & Sons, Ltd.
Resumo:
Introduction of processor based instruments in power systems is resulting in the rapid growth of the measured data volume. The present practice in most of the utilities is to store only some of the important data in a retrievable fashion for a limited period. Subsequently even this data is either deleted or stored in some back up devices. The investigations presented here explore the application of lossless data compression techniques for the purpose of archiving all the operational data - so that they can be put to more effective use. Four arithmetic coding methods suitably modified for handling power system steady state operational data are proposed here. The performance of the proposed methods are evaluated using actual data pertaining to the Southern Regional Grid of India. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
High-level loop transformations are a key instrument in mapping computational kernels to effectively exploit the resources in modern processor architectures. Nevertheless, selecting required compositions of loop transformations to achieve this remains a significantly challenging task; current compilers may be off by orders of magnitude in performance compared to hand-optimized programs. To address this fundamental challenge, we first present a convex characterization of all distinct, semantics-preserving, multidimensional affine transformations. We then bring together algebraic, algorithmic, and performance analysis results to design a tractable optimization algorithm over this highly expressive space. Our framework has been implemented and validated experimentally on a representative set of benchmarks running on state-of-the-art multi-core platforms.