942 resultados para 291605 Processor Architectures
Resumo:
Most of the existing WCET estimation methods directly estimate execution time, ET, in cycles. We propose to study ET as a product of two factors, ET = IC * CPI, where IC is instruction count and CPI is cycles per instruction. Considering directly the estimation of ET may lead to a highly pessimistic estimate since implicitly these methods may be using worst case IC and worst case CPI. We hypothesize that there exists a functional relationship between CPI and IC such that CPI=f(IC). This is ascertained by computing the covariance matrix and studying the scatter plots of CPI versus IC. IC and CPI values are obtained by running benchmarks with a large number of inputs using the cycle accurate architectural simulator, Simplescalar on two different architectures. It is shown that the benchmarks can be grouped into different classes based on the CPI versus IC relationship. For some benchmarks like FFT, FIR etc., both IC and CPI are almost a constant irrespective of the input. There are other benchmarks that exhibit a direct or an inverse relationship between CPI and IC. In such a case, one can predict CPI for a given IC as CPI=f(IC). We derive the theoretical worst case IC for a program, denoted as SWIC, using integer linear programming(ILP) and estimate WCET as SWIC*f(SWIC). However, if CPI decreases sharply with IC then measured maximum cycles is observed to be a better estimate. For certain other benchmarks, it is observed that the CPI versus IC relationship is either random or CPI remains constant with varying IC. In such cases, WCET is estimated as the product of SWIC and measured maximum CPI. It is observed that use of the proposed method results in tighter WCET estimates than Chronos, a static WCET analyzer, for most benchmarks for the two architectures considered in this paper.
Resumo:
Data Prefetchers identify and make use of any regularity present in the history/training stream to predict future references and prefetch them into the cache. The training information used is typically the primary misses seen at a particular cache level, which is a filtered version of the accesses seen by the cache. In this work we demonstrate that extending the training information to include secondary misses and hits along with primary misses helps improve the performance of prefetchers. In addition to empirical evaluation, we use the information theoretic metric entropy, to quantify the regularity present in extended histories. Entropy measurements indicate that extended histories are more regular than the default primary miss only training stream. Entropy measurements also help corroborate our empirical findings. With extended histories, further benefits can be achieved by triggering prefetches during secondary misses also. In this paper we explore the design space of extended prefetch histories and alternative prefetch trigger points for delta correlation prefetchers. We observe that different prefetch schemes benefit to a different extent with extended histories and alternative trigger points. Also the best performing design point varies on a per-benchmark basis. To meet these requirements, we propose a simple adaptive scheme that identifies the best performing design point for a benchmark-prefetcher combination at runtime. In SPEC2000 benchmarks, using all the L2 accesses as history for prefetcher improves the performance in terms of both IPC and misses reduced over techniques that use only primary misses as history. The adaptive scheme improves the performance of CZone prefetcher over Baseline by 4.6% on an average. These performance gains are accompanied by a moderate reduction in the memory traffic requirements.
Resumo:
In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.
Resumo:
Network life time maximization is becoming an important design goal in wireless sensor networks. Energy harvesting has recently become a preferred choice for achieving this goal as it provides near perpetual operation. We study such a sensor node with an energy harvesting source and compare various architectures by which the harvested energy is used. We find its Shannon capacity when it is transmitting its observations over a fading AWGN channel with perfect/no channel state information provided at the transmitter. We obtain an achievable rate when there are inefficiencies in energy storage and the capacity when energy is spent in activities other than transmission.
Resumo:
Decoherence as an obstacle in quantum computation is viewed as a struggle between two forces [1]: the computation which uses the exponential dimension of Hilbert space, and decoherence which destroys this entanglement by collapse. In this model of decohered quantum computation, a sequential quantum computer loses the battle, because at each time step, only a local operation is carried out but g*(t) number of gates collapse. With quantum circuits computing in parallel way the situation is different- g(t) number of gates can be applied at each time step and number gates collapse because of decoherence. As g(t) ≈ g*(t) competition here is even [1]. Our paper improves on this model by slowing down g*(t) by encoding the circuit in parallel computing architectures and running it in Single Instruction Multiple Data (SIMD) paradigm. We have proposed a parallel ion trap architecture for single-bit rotation of a qubit.
Resumo:
Sensor nodes with energy harvesting sources are gaining popularity due to their ability to improve the network life time and are becoming a preferred choice supporting `green communication'. We study such a sensor node with an energy harvesting source and compare various architectures by which the harvested energy is used. We find its Shannon capacity when it is transmitting its observations over an AWGN channel and show that the capacity achieving energy management policies are related to the throughput optimal policies. We also obtain the capacity when energy conserving sleep-wake modes are supported and an achievable rate for the system with inefficiencies in energy storage.
Resumo:
Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.
Resumo:
An in situ seeding growth methodology towards the preparation of core-shell nanoparticles composed of noble metals has been developed by employing trimethylamine borane (TMAB) as the reducing agent. Being a weak reducing agent, TMAB is able to distinguish the smallest reduction potential window of any two metals which renders selective reduction of metal ions thus affording a core-shell architecture of the nanoparticles. A dramatic effect of solvent was noted during the reduction of Ag+ ions: an immediate reduction took place at room temperature when dry THF was used as solvent however, usage of wet THF (THF used directly from the bottle) brings out the reduction only at reflux conditions. In the case of Au and Pd nanoparticles, preparation was found to be independent of the quality of solvent used. Au nanoparticles are realized at room temperature whereas reflux conditions are required in the case of Pd nanoparticles. This difference in behavior of the monometallic nanoparticles was successfully exploited to construct different noble metal nanoparticles with core-shell architectures such as Au@Ag, Ag@Au, and Ag@Pd. Transformation of these core-shell nanoparticles to their thermodynamically stable alloy counterparts is also demonstrated under very mild conditions reported to date.
Resumo:
Managing heat produced by computer processors is an important issue today, especially when the size of processors is decreasing rapidly while the number of transistors in the processor is increasing rapidly. This poster describes a preliminary study of the process of adding carbon nanotubes (CNTs) to a standard silicon paste covering a CPU. Measurements were made in two rounds of tests to compare the rate of cool-down with and without CNTs present. The silicon paste acts as an interface between the CPU and the heat sink, increasing the heat transfer rate away from the CPU. To the silicon paste was added 0.05% by weight of CNTs. These were not aligned. A series of K-type thermocouples was used to measure the temperature as a function of time in the vicinity of the CPU, following its shut-off. An Omega data acquisition system was attached to the thermocouples. The CPU temperature was not measured directly because attachment of a thermocouple would have prevented its automatic shut-off A thermocouple in the paste containing the CNTs actually reached a higher temperature than the standard paste, an effect easily explained. But the rate of cooling with the CNTs was about 4.55% better.
Resumo:
Single-carrier frequency division multiple access (SC-FDMA) has become a popular alternative to orthogonal frequency division multiple access (OFDMA) in multiuser communication on the uplink. This is mainly due to the low peak-to-average power ratio (PAPR) of SC-FDMA compared to that of OFDMA. Long-term evolution (LTE) uses SC-FDMA on the uplink to exploit this PAPR advantage to reduce transmit power amplifier backoff in user terminals. In this paper, we show that SC-FDMA can be beneficially used for multiuser communication on the downlink as well. We present SC-FDMA transmit and receive signaling architectures for multiuser communication on the downlink. The benefits of using SC-FDMA on the downlink are that SC-FDMA can achieve i) significantly better bit error rate (BER) performance at the user terminal compared to OFDMA, and ii) improved PAPR compared to OFDMA which reduces base station (BS) power amplifier backoff (making BSs more green). SC-FDMA receiver needs to do joint equalization, which can be carried out using low complexity equalization techniques. For this, we present a local neighborhood search based equalization algorithm for SC-FDMA. This algorithm is very attractive both in complexity as well as performance. We present simulation results that establish the PAPR and BER performance advantage of SC-FDMA over OFDMA in multiuser SISO/MIMO downlink as well as in large-scale multiuser MISO downlink with tens to hundreds of antennas at the BS.
Resumo:
Estimating program worst case execution time(WCET) accurately and efficiently is a challenging task. Several programs exhibit phase behavior wherein cycles per instruction (CPI) varies in phases during execution. Recent work has suggested the use of phases in such programs to estimate WCET with minimal instrumentation. However the suggested model uses a function of mean CPI that has no probabilistic guarantees. We propose to use Chebyshev's inequality that can be applied to any arbitrary distribution of CPI samples, to probabilistically bound CPI of a phase. Applying Chebyshev's inequality to phases that exhibit high CPI variation leads to pessimistic upper bounds. We propose a mechanism that refines such phases into sub-phases based on program counter(PC) signatures collected using profiling and also allows the user to control variance of CPI within a sub-phase. We describe a WCET analyzer built on these lines and evaluate it with standard WCET and embedded benchmark suites on two different architectures for three chosen probabilities, p={0.9, 0.95 and 0.99}. For p= 0.99, refinement based on PC signatures alone, reduces average pessimism of WCET estimate by 36%(77%) on Arch1 (Arch2). Compared to Chronos, an open source static WCET analyzer, the average improvement in estimates obtained by refinement is 5%(125%) on Arch1 (Arch2). On limiting variance of CPI within a sub-phase to {50%, 10%, 5% and 1%} of its original value, average accuracy of WCET estimate improves further to {9%, 11%, 12% and 13%} respectively, on Arch1. On Arch2, average accuracy of WCET improves to 159% when CPI variance is limited to 50% of its original value and improvement is marginal beyond that point.
Resumo:
Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy.
Resumo:
Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an error-prone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2x to 13.3x less data on average.
Resumo:
The twin demands of energy-efficiency and higher performance on DRAM are highly emphasized in multicore architectures. A variety of schemes have been proposed to address either the latency or the energy consumption of DRAMs. These schemes typically require non-trivial hardware changes and end up improving latency at the cost of energy or vice-versa. One specific DRAM performance problem in multicores is that interleaved accesses from different cores can potentially degrade row-buffer locality. In this paper, based on the temporal and spatial locality characteristics of memory accesses, we propose a reorganization of the existing single large row-buffer in a DRAM bank into multiple sub-row buffers (MSRB). This re-organization not only improves row hit rates, and hence the average memory latency, but also brings down the energy consumed by the DRAM. The first major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves weighted speedup by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. The proposed MSRB organization enables opportunities for the management of multiple row-buffers at the memory controller level. As the memory controller is aware of the behaviour of individual cores it allows us to implement coordinated buffer allocation schemes for different cores that take into account program behaviour. We demonstrate two such schemes, namely Fairness Oriented Allocation and Performance Oriented Allocation, which show the flexibility that memory controllers can now exploit in our MSRB organization to improve overall performance and/or fairness. Further, the MSRB organization enables additional opportunities for DRAM intra-bank parallelism and selective early precharging of the LRU row-buffer to further improve memory access latencies. These two optimizations together provide an additional 5.9% performance improvement.
Resumo:
An organometallic building block 1,3,5-tris(4-trans-Pt(PEt3)(2)I(ethynyl)phenyl)benzene (1) incorporating Pt-ethynyl functionality has been synthesized and characterized. 2 + 3] self-assembly of its nitrate analogue 1,3,5-tris(4-trans-Pt(PEt3)(2)(ONO2)(ethynyl)phenyl)benzene (2) with ``clip'' type bidentate donors (L-1-L-3) separately afforded three trigonal prismatic architectures (3a-3c), respectively. All these prisms were characterized and their shapes/sizes are predicted through geometry optimization employing molecular mechanics universal force field (MMUFF) simulation. The extended p-conjugation including the presence of Pt-ethynyl functionality makes them electron rich as well as luminescent in nature. Macrocycles 3b and 3c exhibit fluorescence quenching in solution upon addition of picric acid PA], which is a common constituent of many explosives. Interestingly, the non-responsive nature of fluorescent intensity towards other electron-deficient nitro-aromatic explosives (NAEs) makes them promising selective sensors for PA with a detection limit predicted to be ppb level. Furthermore, solid-state quenching of fluorescent intensity of the thin film of 3b upon exposure to saturated vapor of picric acid has drawn special attention for infield applications.