986 resultados para Parallel Architectures


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Conventional parallel computer architectures do not provide support for non-uniformly distributed objects. In this thesis, I introduce sparsely faceted arrays (SFAs), a new low-level mechanism for naming regions of memory, or facets, on different processors in a distributed, shared memory parallel processing system. Sparsely faceted arrays address the disconnect between the global distributed arrays provided by conventional architectures (e.g. the Cray T3 series), and the requirements of high-level parallel programming methods that wish to use objects that are distributed over only a subset of processing elements. A sparsely faceted array names a virtual globally-distributed array, but actual facets are lazily allocated. By providing simple semantics and making efficient use of memory, SFAs enable efficient implementation of a variety of non-uniformly distributed data structures and related algorithms. I present example applications which use SFAs, and describe and evaluate simple hardware mechanisms for implementing SFAs. Keeping track of which nodes have allocated facets for a particular SFA is an important task that suggests the need for automatic memory management, including garbage collection. To address this need, I first argue that conventional tracing techniques such as mark/sweep and copying GC are inherently unscalable in parallel systems. I then present a parallel memory-management strategy, based on reference-counting, that is capable of garbage collecting sparsely faceted arrays. I also discuss opportunities for hardware support of this garbage collection strategy. I have implemented a high-level hardware/OS simulator featuring hardware support for sparsely faceted arrays and automatic garbage collection. I describe the simulator and outline a few of the numerous details associated with a "real" implementation of SFAs and SFA-aware garbage collection. Simulation results are used throughout this thesis in the evaluation of hardware support mechanisms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain specific" programming models. Experimental results demonstrating the feasibility of the approach are presented. © 2012 World Scientific Publishing Company.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper investigates the implementation of a number of circuits used to perform a high speed closest value match lookup. The design is targeted particularly for use in a search trie, as used in various networking lookup applications, but can be applied to many other areas where such a match is required. A range of different designs have been considered and implemented on FPGA. A detailed description of the architectures investigated is followed by an analysis of the synthesis results. © 2006 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recently, a number of most significant digit (msd) first bit parallel multipliers for recursive filtering have been reported. However, the design approach which has been used has, in general, been heuristic and consequently, optimality has not always been assured. In this paper, msd first multiply accumulate algorithms are described and important relationships governing the dependencies between latency, number representations, etc are derived. A more systematic approach to designing recursive filters is illustrated by applying the algorithms and associated relationships to the design of cascadable modules for high sample rate IIR filtering and wave digital filtering.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The real time implementation of an efficient signal compression technique, Vector Quantization (VQ), is of great importance to many digital signal coding applications. In this paper, we describe a new family of bit level systolic VLSI architectures which offer an attractive solution to this problem. These architectures are based on a bit serial, word parallel approach and high performance and efficiency can be achieved for VQ applications of a wide range of bandwidths. Compared with their bit parallel counterparts, these bit serial circuits provide better alternatives for VQ implementations in terms of performance and cost. © 1995 Kluwer Academic Publishers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A number of high-performance VLSI architectures for real-time image coding applications are described. In particular, attention is focused on circuits for computing the 2-D DCT (discrete cosine transform) and for 2-D vector quantization. The former circuits are based on Winograd algorithms and comprise a number of bit-level systolic arrays with a bit-serial, word-parallel input. The latter circuits exhibit a similar data organization and consist of a number of inner product array circuits. Both circuits are highly regular and allow extremely high data rates to be achieved through extensive use of parallelism.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Several novel systolic architectures for implementing densely pipelined bit parallel IIR filter sections are presented. The fundamental problem of latency in the feedback loop is overcome by employing redundant arithmetic in combination with bit-level feedback, allowing a basic first-order section to achieve a wordlength-independent latency of only two clock cycles. This is extended to produce a building block from which higher order sections can be constructed. The architecture is then refined by combining the use of both conventional and redundant arithmetic, resulting in two new structures offering substantial hardware savings over the original design. In contrast to alternative techniques, bit-level pipelinability is achieved with no net cost in hardware. © 1989 Kluwer Academic Publishers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Optimized circuits for implementing high-performance bit-parallel IIR filters are presented. Circuits constructed mainly from simple carry save adders and based on most-significant-bit (MSB) first arithmetic are described. Two methods resulting in systems which are 100% efficient in that they are capable of sampling data every cycle are presented. In the first approach the basic circuit is modified so that the level of pipelining used is compatible with the small, but fixed, latency associated with the computation in question. This is achieved through insertion of pipeline delays (half latches) on every second row of cells. This produces an area-efficient solution in which the throughput rate is determined by a critical path of 76 gate delays. A second approach combines the MSB first arithmetic methods with the scattered look-ahead methods. Important design issues are addressed, including wordlength truncation, overflow detection, and saturation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We describe recent progress of an ongoing research programme aimed at producing computational science software that can exploit high performance architectures in the atomic physics application domain. We examine the computational bottleneck of matrix construction in a suite of two-dimensional R-matrix propagation programs, 2DRMP, that are aimed at creating virtual electron collision experiments on HPC architectures. We build on Ixaru's extended frequency dependent quadrature rules (EFDQR) for Slater integrals and examine the challenge of constructing Hamiltonian matrices in parallel across an m-processor compute node in a block cyclic distribution for subsequent diagonalization by ScaLAPACK.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An approach to the management of non-functional concerns in massively parallel and/or distributed architectures that marries parallel programming patterns with autonomic computing is presented. The necessity and suitability of the adoption of autonomic techniques are evidenced. Issues arising in the implementation of autonomic managers taking care of multiple concerns and of coordination among hierarchies of such autonomic managers are discussed. Experimental results are presented that demonstrate the feasibility of the approach.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data flow techniques have been around since the early '70s when they were used in compilers for sequential languages. Shortly after their introduction they were also consideredas a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow "macro" instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting "object" code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented. © 2012 IEEE.