942 resultados para 291605 Processor Architectures
Resumo:
Network processors today consist of multiple parallel processors (micro engines) with support for multiple threads to exploit packet level parallelism inherent in network workloads. With such concurrency, packet ordering at the output of the network processor cannot be guaranteed. This paper studies the effect of concurrency in network processors on packet ordering. We use a validated Petri net model of a commercial network processor, Intel IXP 2400, to determine the extent of packet reordering for IPv4 forwarding application. Our study indicates that in addition to the parallel processing in the network processor, the allocation scheme for the transmit buffer also adversely impacts packet ordering. In particular, our results reveal that these packet reordering results in a packet retransmission rate of up to 61%. We explore different transmit buffer allocation schemes namely, contiguous, strided, local, and global which reduces the packet retransmission to 24%. We propose an alternative scheme, packet sort, which guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Further, packet sort outperforms the in-built packet ordering schemes in the IXP processor by up to 35%.
Resumo:
We propose a new abstract domain for static analysis of executable code. Concrete states are abstracted using circular linear progressions (CLPs). CLPs model computations using a finite word length as is seen in any real life processor. The finite abstraction allows handling overflow scenarios in a natural and straight-forward manner. Abstract transfer functions have been defined for a wide range of operations which makes this domain easily applicable for analyzing code for a wide range of ISAs. CLPs combine the scalability of interval domains with the discreteness of linear congruence domains. We also present a novel, lightweight method to track linear equality relations between static objects that is used by the analysis to improve precision. The analysis is efficient, the total space and time overhead being quadratic in the number of static objects being tracked.
Resumo:
This paper describes techniques to estimate the worst case execution time of executable code on architectures with data caches. The underlying mechanism is Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior. A simultaneous numeric and pointer analysis using an abstraction for discrete sets of values computes safe approximations of access addresses which are then used to predict cache behavior using Must Analysis. A heuristic is also proposed which generates likely worst case estimates. It can be used in soft real time systems and also for reasoning about the tightness of the safe estimate. The analysis methods can handle programs with non-affine access patterns, for which conventional Presburger Arithmetic formulations or Cache Miss Equations do not apply. The precision of the estimates is user-controlled and can be traded off against analysis time. Executables are analyzed directly, which, apart from enhancing precision, renders the method language independent.
INTACTE: An Interconnect Area, Delay, and Energy Estimation Tool for Microarchitectural Explorations
Resumo:
Prior work on modeling interconnects has focused on optimizing the wire and repeater design for trading off energy and delay, and is largely based on low level circuit parameters. Hence these models are hard to use directly to make high level microarchitectural trade-offs in the initial exploration phase of a design. In this paper, we propose INTACTE, a tool that can be used by architects toget reasonably accurate interconnect area, delay, and power estimates based on a few architecture level parameters for the interconnect such as length, width (in number of bits), frequency, and latency for a specified technology and voltage. The tool uses well known models of interconnect delay and energy taking into account the wire pitch, repeater size, and spacing for a range of voltages and technologies.It then solves an optimization problem of finding the lowest energy interconnect design in terms of the low level circuit parameters, which meets the architectural constraintsgiven as inputs. In addition, the tool also provides the area, energy, and delay for a range of supply voltages and degrees of pipelining, which can be used for micro-architectural exploration of a chip. The delay and energy models used by the tool have been validated against low level circuit simulations. We discuss several potential applications of the tool and present an example of optimizing interconnect design in the context of clustered VLIW architectures. Copyright 2007 ACM.
Resumo:
The paper presents an adaptive Fourier filtering technique and a relaying scheme based on a combination of a digital band-pass filter along with a three-sample algorithm, for applications in high-speed numerical distance protection. To enhance the performance of above-mentioned technique, a high-speed fault detector has been used. MATLAB based simulation studies show that the adaptive Fourier filtering technique provides fast tripping for near faults and security for farther faults. The digital relaying scheme based on a combination of digital band-pass filter along with three-sample data window algorithm also provides accurate and high-speed detection of faults. The paper also proposes a high performance 16-bit fixed point DSP (Texas Instruments TMS320LF2407A) processor based hardware scheme suitable for implementation of the above techniques. To evaluate the performance of the proposed relaying scheme under steady state and transient conditions, PC based menu driven relay test procedures are developed using National Instruments LabVIEW software. The test signals are generated in real time using LabVIEW compatible analog output modules. The results obtained from the simulation studies as well as hardware implementations are also presented.
Resumo:
The memory subsystem is a major contributor to the performance, power, and area of complex SoCs used in feature rich multimedia products. Hence, memory architecture of the embedded DSP is complex and usually custom designed with multiple banks of single-ported or dual ported on-chip scratch pad memory and multiple banks of off-chip memory. Building software for such large complex memories with many of the software components as individually optimized software IPs is a big challenge. In order to obtain good performance and a reduction in memory stalls, the data buffers of the application need to be placed carefully in different types of memory. In this paper we present a unified framework (MODLEX) that combines different data layout optimizations to address the complex DSP memory architectures. Our method models the data layout problem as multi-objective genetic algorithm (GA) with performance and power being the objectives and presents a set of solution points which is attractive from a platform design viewpoint. While most of the work in the literature assumes that performance and power are non-conflicting objectives, our work demonstrates that there is significant trade-off (up to 70%) that is possible between power and performance.
Resumo:
Abstract—A new breed of processors like the Cell Broadband Engine, the Imagine stream processor and the various GPU processors emphasize data-level parallelism (DLP) and threadlevel parallelism (TLP) as opposed to traditional instructionlevel parallelism (ILP). This allows them to achieve order-ofmagnitude improvements over conventional superscalar processors for many workloads. However, it is unclear as to how much parallelism of these types exists in current programs. Most earlier studies have largely concentrated on the amount of ILP in a program, without differentiating DLP or TLP. In this study, we investigate the extent of data-level parallelism available in programs in the MediaBench suite. By packing instructions in a SIMD fashion, we observe reductions of up to 91 % (84 % on average) in the number of dynamic instructions, indicating a very high degree of DLP in several applications. I.
Resumo:
Community Climate System Model (CCSM) is a Multiple Program Multiple Data (MPMD) parallel global climate model comprising atmosphere, ocean, land, ice and coupler components. The simulations have a time-step of the order of tens of minutes and are typically performed for periods of the order of centuries. These climate simulations are highly computationally intensive and can take several days to weeks to complete on most of today’s multi-processor systems. ExecutingCCSM on grids could potentially lead to a significant reduction in simulation times due to the increase in number of processors. However, in order to obtain performance gains on grids, several challenges have to be met. In this work,we describe our load balancing efforts in CCSM to make it suitable for grid enabling.We also identify the various challenges in executing CCSM on grids. Since CCSM is an MPI application, we also describe our current work on building a MPI implementation for grids to grid-enable CCSM.
Resumo:
Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional large monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals which address energy efficiency issues by avoiding only the unnecessary tag-comparisons do not reduce broadcasts. These schemes also increase the issue latency.To address both these issues comprehensively, we propose the Scalable Lowpower Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to a monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ significantly reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively on average.
Resumo:
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions use a one-level mapping that naively partitions trie-nodes into cache sets. However, as a significant percentage of trie nodes are not useful, these schemes suffer from a non-uniform distribution of useful nodes to sets. This in turn results in increased conflict misses. Newer organizations such as variable associativity caches achieve flexibility in placement at the expense of increased hit-latency. This makes them unsuitable for L1 caches.We propose a novel two-level mapping framework that retains the hit-latency of one-level mapping yet incurs fewer conflict misses. This is achieved by introducing a secondlevel mapping which reorganizes the nodes in the naive initial partitions into refined partitions with near-uniform distribution of nodes. Further as this remapping is accomplished by simply adapting the index bits to a given routing table the hit-latency is not affected. We propose three new schemes which result in up to 16% reduction in the number of misses and 13% speedup in memory access time. In comparison, an XOR-based placement scheme known to perform extremely well for general purpose architectures, can obtain up to 2% speedup in memory access time.
Resumo:
Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups up to 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs apart of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64
Resumo:
Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non-linear modeling techniques to assist processor architects in making design decisions and resolving complex trade-offs. We propose a procedure for building accurate non-linear models that consists of the following steps: (i) selection of a small set of representative design points spread across processor design space using latin hypercube sampling, (ii) obtaining performance measures at the selected design points using detailed simulation, (iii) building non-linear models for performance using the function approximation capabilities of radial basis function networks, and (iv) validating the models using an independently and randomly generated set of design points. We evaluate our model building procedure by constructing non-linear performance models for programs from the SPEC CPU2000 benchmark suite with a microarchitectural design space that consists of 9 key parameters. Our results show that the models, built using a relatively small number of simulations, achieve high prediction accuracy (only 2.8% error in CPI estimates on average) across a large processor design space. Our models can potentially replace detailed simulation for common tasks such as the analysis of key microarchitectural trends or searches for optimal processor design points.
Resumo:
We propose the design and implementation of hardware architecture for spatial prediction based image compression scheme, which consists of prediction phase and quantization phase. In prediction phase, the hierarchical tree structure obtained from the test image is used to predict every central pixel of an image by its four neighboring pixels. The prediction scheme generates an error image, to which the wavelet/sub-band coding algorithm can be applied to obtain efficient compression. The software model is tested for its performance in terms of entropy, standard deviation. The memory and silicon area constraints play a vital role in the realization of the hardware for hand-held devices. The hardware architecture is constructed for the proposed scheme, which involves the aspects of parallelism in instructions and data. The processor consists of pipelined functional units to obtain the maximum throughput and higher speed of operation. The hardware model is analyzed for performance in terms throughput, speed and power. The results of hardware model indicate that the proposed architecture is suitable for power constrained implementations with higher data rate
Resumo:
H.264 is a video codec standard which delivers high resolution video even at low bit rates. To provide high throughput at low bit rates hardware implementations are essential. In this paper, we propose hardware implementations for speed and area optimized DCT and quantizer modules. To target above criteria we propose two architectures. First architecture is speed optimized which gives a high throughput and can meet requirements of 4096x2304 frame at 30 frames/sec. Second architecture is area optimized and occupies 2009 LUTs in Altera’s stratix-II and can meet the requirements of 1080HD at 30 frames/sec.