16 resultados para multicore

em Indian Institute of Science - Bangalore - Índia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this work, we propose a new organization for the last level shared cache of a rnulticore system. Our design is based on the observation that the Next-Use distance, measured in terms of intervening misses between the eviction of a line and its next use, for lines brought in by a given delinquent PC falls within a predictable range of values. We exploit this correlation to improve the performance of shared caches in multi-core architectures by proposing the NUcache organization.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a novel technique for reducing the power consumed by the on-chip cache in SNUCA chip multicore platform. This is achieved by what we call a "remap table", which maps accesses to the cache banks that are as close as possible to the cores, on which the processes are scheduled. With this technique, instead of using all the available cache, we use a portion of the cache and allocate lesser cache to the application. We formulate the problem as an energy-delay (ED) minimization problem and solve it offline using a scalable genetic algorithm approach. Our experiments show up to 40% of savings in the memory sub-system power consumption and 47% savings in energy-delay product (ED).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a novel technique for reducing the power consumed by the on-chip cache in SNUCA chip multicore platform. This is achieved by what we call a "remap table", which maps accesses to the cache banks that are as close as possible to the cores, on which the processes are scheduled. With this technique, instead of using all the available cache, we use a portion of the cache and allocate lesser cache to the application. We formulate the problem as an energy-delay (ED) minimization problem and solve it offline using a scalable genetic algorithm approach. Our experiments show up to 40% of savings in the memory sub-system power consumption and 47% savings in energy-delay product (ED).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC - Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage power consumption of on-chip caches has already become a major power consuming component of the memory subsystem. We propose to reduce leakage power consumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unused cache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communication costs to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, there map policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The correctness of a hard real-time system depends its ability to meet all its deadlines. Existing real-time systems use either a pure real-time scheduler or a real-time scheduler embedded as a real-time scheduling class in the scheduler of an operating system (OS). Existing implementations of schedulers in multicore systems that support real-time and non-real-time tasks, permit the execution of non-real-time tasks in all the cores with priorities lower than those of real-time tasks, but interrupts and softirqs associated with these non-real-time tasks can execute in any core with priorities higher than those of real-time tasks. As a result, the execution overhead of real-time tasks is quite large in these systems, which, in turn, affects their runtime. In order that the hard real-time tasks can be executed in such systems with minimal interference from other Linux tasks, we propose, in this paper, an integrated scheduler architecture, called SchedISA, which aims to considerably reduce the execution overhead of real-time tasks in these systems. In order to test the efficacy of the proposed scheduler, we implemented partitioned earliest deadline first (P-EDF) scheduling algorithm in SchedISA on Linux kernel, version 3.8, and conducted experiments on Intel core i7 processor with eight logical cores. We compared the execution overhead of real-time tasks in the above implementation of SchedISA with that in SCHED_DEADLINE's P-EDF implementation, which concurrently executes real-time and non-real-time tasks in Linux OS in all the cores. The experimental results show that the execution overhead of real-time tasks in the above implementation of SchedISA is considerably less than that in SCHED_DEADLINE. We believe that, with further refinement of SchedISA, the execution overhead of real-time tasks in SchedISA can be reduced to a predictable maximum, making it suitable for scheduling hard real-time tasks without affecting the CPU share of Linux tasks.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Morse-Smale complex is a useful topological data structure for the analysis and visualization of scalar data. This paper describes an algorithm that processes all mesh elements of the domain in parallel to compute the Morse-Smale complex of large two-dimensional data sets at interactive speeds. We employ a reformulation of the Morse-Smale complex using Forman's Discrete Morse Theory and achieve scalability by computing the discrete gradient using local accesses only. We also introduce a novel approach to merge gradient paths that ensures accurate geometry of the computed complex. We demonstrate that our algorithm performs well on both multicore environments and on massively parallel architectures such as the GPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Effective sharing of the last level cache has a significant influence on the overall performance of a multicore system. We observe that existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and in some cases lack the flexibility to support a variety of performance goals. In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals. We demonstrate the flexibility of PriSM, by computing the eviction probabilities needed to achieve goals like hit-maximization, fairness and QOS. PriSM-HitMax improves performance by 18.7% over LRU and 11.8% over previously proposed schemes in a sixteen core machine. PriSM-Fairness improves fairness over existing solutions by 23.3% along with a performance improvement of 19.0%. PriSM-QOS successfully achieves the desired QOS targets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The twin demands of energy-efficiency and higher performance on DRAM are highly emphasized in multicore architectures. A variety of schemes have been proposed to address either the latency or the energy consumption of DRAMs. These schemes typically require non-trivial hardware changes and end up improving latency at the cost of energy or vice-versa. One specific DRAM performance problem in multicores is that interleaved accesses from different cores can potentially degrade row-buffer locality. In this paper, based on the temporal and spatial locality characteristics of memory accesses, we propose a reorganization of the existing single large row-buffer in a DRAM bank into multiple sub-row buffers (MSRB). This re-organization not only improves row hit rates, and hence the average memory latency, but also brings down the energy consumed by the DRAM. The first major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves weighted speedup by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. The proposed MSRB organization enables opportunities for the management of multiple row-buffers at the memory controller level. As the memory controller is aware of the behaviour of individual cores it allows us to implement coordinated buffer allocation schemes for different cores that take into account program behaviour. We demonstrate two such schemes, namely Fairness Oriented Allocation and Performance Oriented Allocation, which show the flexibility that memory controllers can now exploit in our MSRB organization to improve overall performance and/or fairness. Further, the MSRB organization enables additional opportunities for DRAM intra-bank parallelism and selective early precharging of the LRU row-buffer to further improve memory access latencies. These two optimizations together provide an additional 5.9% performance improvement.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Moore's Law has driven the semiconductor revolution enabling over four decades of scaling in frequency, size, complexity, and power. However, the limits of physics are preventing further scaling of speed, forcing a paradigm shift towards multicore computing and parallelization. In effect, the system is taking over the role that the single CPU was playing: high-speed signals running through chips but also packages and boards connect ever more complex systems. High-speed signals making their way through the entire system cause new challenges in the design of computing hardware. Inductance, phase shifts and velocity of light effects, material resonances, and wave behavior become not only prevalent but need to be calculated accurately and rapidly to enable short design cycle times. In essence, to continue scaling with Moore's Law requires the incorporation of Maxwell's equations in the design process. Incorporating Maxwell's equations into the design flow is only possible through the combined power that new algorithms, parallelization and high-speed computing provide. At the same time, incorporation of Maxwell-based models into circuit and system-level simulation presents a massive accuracy, passivity, and scalability challenge. In this tutorial, we navigate through the often confusing terminology and concepts behind field solvers, show how advances in field solvers enable integration into EDA flows, present novel methods for model generation and passivity assurance in large systems, and demonstrate the power of cloud computing in enabling the next generation of scalable Maxwell solvers and the next generation of Moore's Law scaling of systems. We intend to show the truly symbiotic growing relationship between Maxwell and Moore!

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical tradeoffs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This article is aimed to delineate groundwater sources in Holocene deposits area in the Gulf of Mannar Coast from Southern India. For this purpose 2-D electrical resistivity tomography (ERT), hydrochemical and granulomerical studies were carried out and integrated to identify hydrogeological structures and portable groundwater resource in shallow depths which in general appears in the coastal tracts. The 2-D ERT was used to determine the two-dimensional subsurface geological formations by multicore cable with Wenner array. Low resistivity of 1-5 Omega m for saline water appeared due to calcite at the depth of about 5 m below the ground level (bgl). Sea water intrusion was observed around the maximum resistivity as 5 Omega m at the 8 m depth, bgl in the calcite environs, but the calcareous sandstone layer shows around 15-64 Omega m at the 6 m depth, bgl. The hydrochemical variation of TDS, HCO3-, Cl-, Na+, K+, Ca2+, and Mg2+ concentrations was observed for the saline and sea water intrusion in the groundwater system. The granulometic analysis shows that the study area was under the sea between 5400 and 3000 year ago. The events of ice melting an unnatural ice-stone rain/hail among 5000-4000 years ago resulted in the inundation of sea over the area and deposits of late Holocene marine transgression formation up to Puthukottai quartzite region for a stretch of around 17 km.