11 resultados para Multiprocessors.

em Indian Institute of Science - Bangalore - Índia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Utilization bounds for Earliest Deadline First(EDF) and Rate Monotonic(RM) scheduling are known and well understood for uniprocessor systems. In this paper, we derive limits on similar bounds for the multiprocessor case, when the individual processors need not be identical. Tasks are partitioned among the processors and RM scheduling is assumed to be the policy used in individual processors. A minimum limit on the bounds for a 'greedy' class of algorithms is given and proved, since the actual value of the bound depends on the algorithm that allocates the tasks. We also derive the utilization bound of an algorithm which allocates tasks in decreasing order of utilization factors. Knowledge of such bounds allows us to carry out very fast schedulability tests although we are constrained by the fact that the tests are sufficient but not necessary to ensure schedulability.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper we introduce a new energy-efficient fault-tolerant CMP architecture known as Redundant Execution using Critical Value Forwarding (RECVF). RECVF is based on two observations: (i) forwarding critical instruction results from the leading to the trailing core enables the latter to execute faster, and (ii) this speedup can be exploited to reduce energy consumption by operating the trailing core at a lower voltage-frequency level. Our evaluation shows that RECVF consumes 37% less energy than conventional dual modular redundant (DMR) execution of a program. It consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a performance overhead of just 1.2%.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we present a cache coherence protocol for multistage interconnection network (MIN)-based multiprocessors with two distinct private caches: private-blocks caches (PCache) containing blocks private to a process and shared-blocks caches (SCache) containing data accessible by all processes. The architecture is extended by a coherence control bus connecting all shared-block cache controllers. Timing problems due to variable transit delays through the MIN are dealt with by introducing Transient states in the proposed cache coherence protocol. The impact of the coherence protocol on system performance is evaluated through a performance study of three phases. Assuming homogeneity of all nodes, a single-node queuing model (phase 3) is developed to analyze system performance. This model is solved for processor and coherence bus utilizations using the mean value analysis (MVA) technique with shared-blocks steady state probabilities (phase 1) and communication delays (phase 2) as input parameters. The performance of our system is compared to that of a system with an equivalent-sized unified cache and with a multiprocessor implementing a directory-based coherence protocol. System performance measures are verified through simulation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper, we describe a power-efficient architecture for redundant execution on chip multiprocessors (CMPs) which when coupled with our per-core dynamic voltage and frequency scaling (DVFS) algorithm significantly reduces the energy overhead of redundant execution without sacrificing performance. Our evaluation shows that this architecture has a performance overhead of only 0.3% and consumes only 1.48 times the energy of a non-fault-tolerant baseline.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

It is essential to accurately estimate the working set size (WSS) of an application for various optimizations such as to partition cache among virtual machines or reduce leakage power dissipated in an over-allocated cache by switching it OFF. However, the state-of-the-art heuristics such as average memory access latency (AMAL) or cache miss ratio (CMR) are poorly correlated to the WSS of an application due to 1) over-sized caches and 2) their dispersed nature. Past studies focus on estimating WSS of an application executing on a uniprocessor platform. Estimating the same for a chip multiprocessor (CMP) with a large dispersed cache is challenging due to the presence of concurrently executing threads/processes. Hence, we propose a scalable, highly accurate method to estimate WSS of an application. We call this method ``tagged WSS (TWSS)'' estimation method. We demonstrate the use of TWSS to switch-OFF the over-allocated cache ways in Static and Dynamic NonUniform Cache Architectures (SNUCA, DNUCA) on a tiled CMP. In our implementation of adaptable way SNUCA and DNUCA caches, decision of altering associativity is taken by each L2 controller. Hence, this approach scales better with the number of cores present on a CMP. It gives overall (geometric mean) 26% and 19% higher energy-delay product savings compared to AMAL and CMR heuristics on SNUCA, respectively.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Many novel computer architectures like array and multiprocessors which achieve high performance through the use of concurrency exploit variations of the von Neumann model of computation. The effective utilization of the machines makes special demands on programmers and their programming languages, such as the structuring of data into vectors or the partitioning of programs into concurrent processes. In comparison, the data flow model of computation demands only that the principle of structured programming be followed. A data flow program, often represented as a data flow graph, is a program that expresses a computation by indicating the data dependencies among operators. A data flow computer is a machine designed to take advantage of concurrency in data flow graphs by executing data independent operations in parallel. In this paper, we discuss the design of a high level language (DFL: Data Flow Language) suitable for data flow computers. Some sample procedures in DFL are presented. The implementation aspects have not been discussed in detail since there are no new problems encountered. The language DFL embodies the concepts of functional programming, but in appearance closely resembles Pascal. The language is a better vehicle than the data flow graph for expressing a parallel algorithm. The compiler has been implemented on a DEC 1090 system in Pascal.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, we introduce an analytical technique based on queueing networks and Petri nets for making a performance analysis of dataflow computations when executed on the Manchester machine. This technique is also applicable for the analysis of parallel computations on multiprocessors. We characterize the parallelism in dataflow computations through a four-parameter characterization, namely, the minimum parallelism, the maximum parallelism, the average parallelism and the variance in parallelism. We observe through detailed investigation of our analytical models that the average parallelism is a good characterization of the dataflow computations only as long as the variance in parallelism is small. However, significant difference in performance measures will result when the variance in parallelism is comparable to or higher than the average parallelism.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes the design of a power efficient microarchitecture for transient fault detection in chip multiprocessors (CMPs) We introduce a new per-core dynamic voltage and frequency scaling (DVFS) algorithm for our architecture that significantly reduces power dissipation for redundant execution with a minimal performance overhead. Using cycle accurate simulation combined with a simple first order power model, we estimate that our architecture reduces dynamic power dissipation in the redundant core by an mean value of 79% and a maximum of 85% with an associated mean performance overhead of only 1:2%

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Streaming applications demand hard bandwidth and throughput guarantees in a multiprocessor environment amidst resource competing processes. We present a Label Switching based Network-on-Chip (LS-NoC) motivated by throughput guarantees offered by bandwidth reservation. Label switching is a packet relaying technique in which individual packets carry route information in the form of labels. A centralized LS-NoC Management framework engineers traffic into Quality of Service (QoS) guaranteed routes. LS-NoC caters to the requirements of streaming applications where communication channels are fixed over the lifetime of the application. The proposed NoC framework inherently supports heterogeneous and ad hoc system-on-chips. The LS-NoC can be used in conjunction with conventional best effort NoC as a QoS guaranteed communication network or as a replacement to the conventional NoC. A multicast, broadcast capable label switched router for the LS-NoC has been designed. A 5 port, 256 bit data bus, 4 bit label router occupies 0.431 mm(2) in 130 nm and delivers peak bandwidth of 80 Gbits/s per link at 312.5 MHz. Bandwidth and latency guarantees of LS-NoC have been demonstrated on traffic from example streaming applications and on constant and variable bit rate traffic patterns. LS-NoC was found to have a competitive AreaxPower/Throughput figure of merit with state-of-the-art NoCs providing QoS. Circuit switching with link sharing abilities and support for asynchronous operation make LS-NoC a desirable choice for QoS servicing in chip multiprocessors. (C) 2013 Elsevier B.V. All rights reserved.