686 resultados para Processors
Resumo:
In this paper, we present an improved load distribution strategy, for arbitrarily divisible processing loads, to minimize the processing time in a distributed linear network of communicating processors by an efficient utilization of their front-ends. Closed-form solutions are derived, with the processing load originating at the boundary and at the interior of the network, under some important conditions on the arrangement of processors and links in the network. Asymptotic analysis is carried out to explore the ultimate performance limits of such networks. Two important theorems are stated regarding the optimal load sequence and the optimal load origination point. Comparative study of this new strategy with an earlier strategy is also presented.
Resumo:
The microcommands constituting the microprogram of the control memory of a microprogrammed processor can be partitioned into a number of disjoint sets. Some of these sets are then encoded to minimize the word width of the ROM storing the microprogram. A further reduction in the width of the ROM words can be achieved by a technique known as bit steering where one or more bits are shared by two or more sets of microcommands. These sets are called the steerable sets. This correspondence presents a simple method for the detection and encoding of steerable sets. It has been shown that the concurrency matrix of two steerable sets exhibits definite patterns of clusters which can be easily recognized. A relation "connection" has been defined which helps in the detection of three-set steerability. Once steerable sets are identified, their encoding becomes a straightforward procedure following the location of the identifying clusters on the concurrency matrix or matrices.
Resumo:
A major concern of embedded system architects is the design for low power. We address one aspect of the problem in this paper, namely the effect of executable code compression. There are two benefits of code compression – firstly, a reduction in the memory footprint of embedded software, and secondly, potential reduction in memory bus traffic and power consumption. Since decompression has to be performed at run time it is achieved by hardware. We describe a tool called COMPASS which can evaluate a range of strategies for any given set of benchmarks and display compression ratios. Also, given an execution trace, it can compute the effect on bus toggles, and cache misses for a range of compression strategies. The tool is interactive and allows the user to vary a set of parameters, and observe their effect on performance. We describe an implementation of the tool and demonstrate its effectiveness. To the best of our knowledge this is the first tool proposed for such a purpose.
Resumo:
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires which leads to delay in execution and significantly high energy consumption.In this paper, we propose a new instruction scheduling algorithm that exploits scheduling slacks of instructions and communication slacks of data values together to achieve better energy-performance trade-offs for clustered architectures with heterogeneous interconnect. Our instruction scheduling algorithm achieves 35% and 40% reduction in communication energy, whereas the overall energy-delay product improves by 4.5% and 6.5% respectively for 2 cluster and 4 cluster machines with marginal increase (1.6% and 1.1%) in execution time. Our test bed uses the Trimaran compiler infrastructure.
Resumo:
The present work concerns with the static scheduling of jobs to parallel identical batch processors with incompatible job families for minimizing the total weighted tardiness. This scheduling problem is applicable in burn-in operations and wafer fabrication in semiconductor manufacturing. We decompose the problem into two stages: batch formation and batch scheduling, as in the literature. The Ant Colony Optimization (ACO) based algorithm called ATC-BACO algorithm is developed in which ACO is used to solve the batch scheduling problems. Our computational experimentation shows that the proposed ATC-BACO algorithm performs better than the available best traditional dispatching rule called ATC-BATC rule.
Resumo:
Catalytic combustion of H-2 was carried out over combustion synthesized noble metal (Pd or Pt) ion-substituted CeO2 based catalysts using a feed stream that simulated exhaust gases from a fuel cell processor The catalysts showed a high activity for H-2-combustion and complete conversion was achieved below 200 C over all the catalysts when O-2 was used in a stoichiometric amount With higher amounts of O-2 the reaction rates Increased and complete conversions were possible below 100 C The reaction was also carried out over Pd-impregnated CeO2 The conversions of H-2 with stoichiometric amount of O-2 were found to be higher over Pd-substituted compound The mechanism of the reaction over noble metal-substituted compounds was proposed on the basis of X-ray photoelectron spectroscopy studies The redox couples between Ce and metal ions were established and a dual site redox mechanism was pi posed for the reaction (C) 2010 Elsevier B V All rights reserved
Resumo:
Trace processors rely on hierarchy, replication, and prediction to dramatically increase the execution speed of ordinary sequential programs. The authors describe some of the processors will meet future technology demands.
Resumo:
Digest caches have been proposed as an effective method tospeed up packet classification in network processors. In this paper, weshow that the presence of a large number of small flows and a few largeflows in the Internet has an adverse impact on the performance of thesedigest caches. In the Internet, a few large flows transfer a majority ofthe packets whereas the contribution of several small flows to the totalnumber of packets transferred is small. In such a scenario, the LRUcache replacement policy, which gives maximum priority to the mostrecently accessed digest, tends to evict digests belonging to the few largeflows. We propose a new cache management algorithm called SaturatingPriority (SP) which aims at improving the performance of digest cachesin network processors by exploiting the disparity between the number offlows and the number of packets transferred. Our experimental resultsdemonstrate that SP performs better than the widely used LRU cachereplacement policy in size constrained caches. Further, we characterizethe misses experienced by flow identifiers in digest caches.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
Network processors today consist of multiple parallel processors (micro engines) with support for multiple threads to exploit packet level parallelism inherent in network workloads. With such concurrency, packet ordering at the output of the network processor cannot be guaranteed. This paper studies the effect of concurrency in network processors on packet ordering. We use a validated Petri net model of a commercial network processor, Intel IXP 2400, to determine the extent of packet reordering for IPv4 forwarding application. Our study indicates that in addition to the parallel processing in the network processor, the allocation scheme for the transmit buffer also adversely impacts packet ordering. In particular, our results reveal that these packet reordering results in a packet retransmission rate of up to 61%. We explore different transmit buffer allocation schemes namely, contiguous, strided, local, and global which reduces the packet retransmission to 24%. We propose an alternative scheme, packet sort, which guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Further, packet sort outperforms the in-built packet ordering schemes in the IXP processor by up to 35%.
Resumo:
Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional large monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals which address energy efficiency issues by avoiding only the unnecessary tag-comparisons do not reduce broadcasts. These schemes also increase the issue latency.To address both these issues comprehensively, we propose the Scalable Lowpower Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to a monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ significantly reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively on average.