8 resultados para Individualized instruction.

em Indian Institute of Science - Bangalore - Índia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Traditionally, an instruction decoder is designed as a monolithic structure that inhibit the leakage energy optimization. In this paper, we consider a split instruction decoder that enable the leakage energy optimization. We also propose a compiler scheduling algorithm that exploits instruction slack to increase the simultaneous active and idle duration in instruction decoder. The proposed compiler-assisted scheme obtains a further 14.5% reduction of energy consumption of instruction decoder over a hardware-only scheme for a VLIW architecture. The benefits are 17.3% and 18.7% in the context of a 2-clustered and a 4-clustered VLIW architecture respectively.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional large monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals which address energy efficiency issues by avoiding only the unnecessary tag-comparisons do not reduce broadcasts. These schemes also increase the issue latency.To address both these issues comprehensively, we propose the Scalable Lowpower Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to a monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ significantly reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively on average.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular,register renaming a large number of instructions per cycle is diDcult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the registerjle into a global file and several local jles, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Instruction reuse is a microarchitectural technique that improves the execution time of a program by removing redundant computations at run-time. Although this is the job of an optimizing compiler, they do not succeed many a time due to limited knowledge of run-time data. In this paper we examine instruction reuse of integer ALU and load instructions in network processing applications. Specifically, this paper attempts to answer the following questions: (1) How much of instruction reuse is inherent in network processing applications?, (2) Can reuse be improved by reducing interference in the reuse buffer?, (3) What characteristics of network applications can be exploited to improve reuse?, and (4) What is the effect of reuse on resource contention and memory accesses? We propose an aggregation scheme that combines the high-level concept of network traffic i.e. "flows" with a low level microarchitectural feature of programs i.e. repetition of instructions and data along with an architecture that exploits temporal locality in incoming packet data to improve reuse. We find that for the benchmarks considered, 1% to 50% of instructions are reused while the speedup achieved varies between 1% and 24%. As a side effect, instruction reuse reduces memory traffic and can therefore be considered as a scheme for low power.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most of the existing WCET estimation methods directly estimate execution time, ET, in cycles. We propose to study ET as a product of two factors, ET = IC * CPI, where IC is instruction count and CPI is cycles per instruction. Considering directly the estimation of ET may lead to a highly pessimistic estimate since implicitly these methods may be using worst case IC and worst case CPI. We hypothesize that there exists a functional relationship between CPI and IC such that CPI=f(IC). This is ascertained by computing the covariance matrix and studying the scatter plots of CPI versus IC. IC and CPI values are obtained by running benchmarks with a large number of inputs using the cycle accurate architectural simulator, Simplescalar on two different architectures. It is shown that the benchmarks can be grouped into different classes based on the CPI versus IC relationship. For some benchmarks like FFT, FIR etc., both IC and CPI are almost a constant irrespective of the input. There are other benchmarks that exhibit a direct or an inverse relationship between CPI and IC. In such a case, one can predict CPI for a given IC as CPI=f(IC). We derive the theoretical worst case IC for a program, denoted as SWIC, using integer linear programming(ILP) and estimate WCET as SWIC*f(SWIC). However, if CPI decreases sharply with IC then measured maximum cycles is observed to be a better estimate. For certain other benchmarks, it is observed that the CPI versus IC relationship is either random or CPI remains constant with varying IC. In such cases, WCET is estimated as the product of SWIC and measured maximum CPI. It is observed that use of the proposed method results in tighter WCET estimates than Chronos, a static WCET analyzer, for most benchmarks for the two architectures considered in this paper.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path. (C) 2014 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we present HyperCell as a reconfigurable datapath for Instruction Extensions (IEs). HyperCell comprises an array of compute units laid over a switch network. We present an IE synthesis methodology that enables post-silicon realization of IE datapaths on HyperCell. The synthesis methodology optimally exploits hardware resources in HyperCell to enable software pipelined execution of IEs. Exploitation of temporal reuse of data in HyperCell results in significant reduction of input/output bandwidth requirements of HyperCell.