960 resultados para middleware architectures


Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe recent progress of an ongoing research programme aimed at producing computational science software that can exploit high performance architectures in the atomic physics application domain. We examine the computational bottleneck of matrix construction in a suite of two-dimensional R-matrix propagation programs, 2DRMP, that are aimed at creating virtual electron collision experiments on HPC architectures. We build on Ixaru's extended frequency dependent quadrature rules (EFDQR) for Slater integrals and examine the challenge of constructing Hamiltonian matrices in parallel across an m-processor compute node in a block cyclic distribution for subsequent diagonalization by ScaLAPACK.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents a thorough investigation of the combined allocator design for Networks-on-Chip (NoC). Particularly, we discuss the interlock of the combined NoC allocator, which is caused by the lock mechanism of priority updating between the local and global arbiters. Architectures and implementations of three interlock-free combined allocators are presented in detail. Their cost, critical path, as well as network level performance are demonstrated based on 65-nm standard cell technology.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a dynamic verification approach for large-scale message passing programs to locate correctness bugs caused by unforeseen nondeterministic interactions. This approach hinges on an efficient protocol to track the causality between nondeterministic message receive operations and potentially matching send operations. We show that causality tracking protocols that rely solely on logical clocks fail to capture all nuances of MPI program behavior, including the variety of ways in which nonblocking calls can complete. Our approach is hinged on formally defining the matches-before relation underlying the MPI standard, and devising lazy update logical clock based algorithms that can correctly discover all potential outcomes of nondeterministic receives in practice. can achieve the same coverage as a vector clock based algorithm while maintaining good scalability. LLCP allows us to analyze realistic MPI programs involving a thousand MPI processes, incurring only modest overheads in terms of communication bandwidth, latency, and memory consumption. © 2011 IEEE.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Chapter eleven on Mm-wave broadband wireless systems and enabling MMIC technologies, is contributed by Jian Zhang, Mury Thian, Guochi Huang, George Goussetis and Vincent F. Fusco, from Queen's University Belfast, UK. Millimeter wave bands provide large available bandwidths for high data rate wireless communication systems, which are envisaged to shift data throughput well in the GBps range. This capability has over past few years driven rapid developments in the technology underpinning broadband wireless systems as well as in the standardisation activity from various non-governmental consortia and the band allocation from spectrum regulators globally. This chapter provides an overview of the recent developments on V-band broadband wireless systems with the emphasis placed on enabling MMIC technologies. An overview of the key applications and available standards is presented. System-level architectures for broadband wireless applications are being reviewed. Examples of analysis, design and testing on MMIC components in SiGe BiCMOS are presented and the outlook of the technology is discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

An approach to the management of non-functional concerns in massively parallel and/or distributed architectures that marries parallel programming patterns with autonomic computing is presented. The necessity and suitability of the adoption of autonomic techniques are evidenced. Issues arising in the implementation of autonomic managers taking care of multiple concerns and of coordination among hierarchies of such autonomic managers are discussed. Experimental results are presented that demonstrate the feasibility of the approach.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

With the rapid expansion of the internet and the increasing demand on Web servers, many techniques were developed to overcome the servers' hardware performance limitation. Mirrored Web Servers is one of the techniques used where a number of servers carrying the same "mirrored" set of services are deployed. Client access requests are then distributed over the set of mirrored servers to even up the load. In this paper we present a generic reference software architecture for load balancing over mirrored web servers. The architecture was designed adopting the latest NaSr architectural style [1] and described using the ADLARS [2] architecture description language. With minimal effort, different tailored product architectures can be generated from the reference architecture to serve different network protocols and server operating systems. An example product system is described and a sample Java implementation is presented.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

FastFlow is a structured parallel programming framework targeting shared memory multi-core architectures. In this paper we introduce a FastFlow extension aimed at supporting also a network of multi-core workstations. The extension supports the execution of FastFlow programs by coordinating-in a structured way-the fine grain parallel activities running on a single workstation. We discuss the design and the implementation of this extension presenting preliminary experimental results validating it on state-of-the-art networked multi-core nodes. © 2013 Springer-Verlag.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single- Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications. © 2012 Springer-Verlag.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data flow techniques have been around since the early '70s when they were used in compilers for sequential languages. Shortly after their introduction they were also consideredas a possible model for parallel computing, although the impact here was limited. Recently, however, data flow has been identified as a candidate for efficient implementation of various programming models on multi-core architectures. In most cases, however, the burden of determining data flow "macro" instructions is left to the programmer, while the compiler/run time system manages only the efficient scheduling of these instructions. We discuss a structured parallel programming approach supporting automatic compilation of programs to macro data flow and we show experimental results demonstrating the feasibility of the approach and the efficiency of the resulting "object" code on different classes of state-of-the-art multi-core architectures. The experimental results use different base mechanisms to implement the macro data flow run time support, from plain pthreads with condition variables to more modern and effective lock- and fence-free parallel frameworks. Experimental results comparing efficiency of the proposed approach with those achieved using other, more classical, parallel frameworks are also presented. © 2012 IEEE.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.

This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes the deployment on GPUs of PROP, a program of the 2DRMP suite which models electron collisions with H-like atoms and ions. Because performance on GPUs is better in single precision than in double precision, the numerical stability of the PROP program in single precision has been studied. The numerical quality of PROP results computed in single precision and their impact on the next program of the 2DRMP suite has been analyzed. Successive versions of the PROP program on GPUs have been developed in order to improve its performance. Particular attention has been paid to the optimization of data transfers and of linear algebra operations. Performance obtained on several architectures (including NVIDIA Fermi) are presented.