951 resultados para distributed shared memory
Resumo:
This paper introduces the stochastic version of the Geometric Machine Model for the modelling of sequential, alternative, parallel (synchronous) and nondeterministic computations with stochastic numbers stored in a (possibly infinite) shared memory. The programming language L(D! 1), induced by the Coherence Space of Processes D! 1, can be applied to sequential and parallel products in order to provide recursive definitions for such processes, together with a domain-theoretic semantics of the Stochastic Arithmetic. We analyze both the spacial (ordinal) recursion, related to spacial modelling of the stochastic memory, and the temporal (structural) recursion, given by the inclusion relation modelling partial objects in the ordered structure of process construction.
Resumo:
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Resumo:
A dissertation submitted in fulfillment of the requirements to the degree of Master in Computer Science and Computer Engineering
Resumo:
The symbolic and improvisational nature of Livecoding requires a shared networking framework to be flexible and extensible, while at the same time providing support for synchronisation, persistence and redundancy. Above all the framework should be robust and available across a range of platforms. This paper proposes tuple space as a suitable framework for network communication in ensemble livecoding contexts. The role of tuple space as a concurrency framework and the associated timing aspects of the tuple space model are explored through Spaces, an implementation of tuple space for the Impromptu environment.
Resumo:
Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.
Resumo:
Background Failure to convey time-critical information to team members during surgery diminishes members’ perception of the dynamic information relevant to their task, and compromises shared situational awareness. This research reports the dialog around clinical decisions made by team members in the time-pressured and high-risk context of surgery, and the impact of these communications on shared situational awareness. Methods Fieldwork methods were used to capture the dynamic integration of individual and situational elements in surgery that provided the backdrop for clinical decisions. Nineteen semi structured interviews were performed with 24 participants from anaesthesia, surgery, and nursing in the operating rooms of a large metropolitan hospital in Queensland, Australia. Thematic analysis was used. Results: The domain “coordinating decisions in surgery” was generated from textual data. Within this domain, three themes illustrated the dialog of clinical decisions, i.e., synchronizing and strategizing actions, sharing local knowledge, and planning contingency decisions based on priority. Conclusion Strategies used to convey decisions that enhanced shared situational awareness included the use of “self-talk”, closed-loop communications, and “overhearing” conversations that occurred at the operating table. Behaviours’ that compromised a team’s shared situational awareness included tunnelling and fixating on one aspect of the situation.
Resumo:
Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi- core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs.
Resumo:
An efficient one-step digit-set-restricted modified signed-digit (MSD) adder based on symbolic substitution is presented. In this technique, carry propagation is avoided by introducing reference digits to restrict the intermediate carry and sum digits to {1,0} and {0,1}, respectively. The proposed technique requires significantly fewer minterms and simplifies system complexity compared to the reported one-step MSD addition techniques. An incoherent correlator based on an optoelectronic shared content-addressable memory processor is suggested to perform the addition operation. In this technique, only one set of minterms needs to be stored, independent of the operand length. (C) 2002 society or Photo-Optical Instrumentation Engineers.
Resumo:
A two-step digit-set-restricted modified signed-digit (MSD) adder based on symbolic substitution is presented. In the proposed addition algorithm, carry propagation is avoided by using reference digits to restrict the intermediate MSD carry and sum digits into {(1) over bar ,0} and {0, 1}, respectively. The algorithm requires only 12 minterms to generate the final results, and no complementarity operations for nonzero outputs are involved, which simplifies the system complexity significantly. An optoelectronic shared content-addressable memory based on an incoherent correlator is used for experimental demonstration. (c) 2005 Society of Photo-Optical Instrumentation Engineers.
Resumo:
A two-step digit-set-restricted modified signed-digit (MSD) adder based on symbolic substitution is presented. In the proposed addition algorithm, carry propagation is avoided by using reference digits to restrict the intermediate MSD carry and sum digits into {(1) over bar ,0} and {0, 1}, respectively. The algorithm requires only 12 minterms to generate the final results, and no complementarity operations for nonzero outputs are involved, which simplifies the system complexity significantly. An optoelectronic shared content-addressable memory based on an incoherent correlator is used for experimental demonstration. (c) 2005 Society of Photo-Optical Instrumentation Engineers.
Resumo:
Neurodevelopmental disruptions caused by obstetric complications play a role in the etiology of several phenotypes associated with neuropsychiatric diseases and cognitive dysfunctions. Importantly, it has been noticed that epigenetic processes occurring early in life may mediate these associations. Here, DNA methylation signatures at IGF2 (insulin-like growth factor 2) and IGF2BP1-3 (IGF2-binding proteins 1-3) were examined in a sample consisting of 34 adult monozygotic (MZ) twins informative for obstetric complications and cognitive performance. Multivariate linear regression analysis of twin data was implemented to test for associations between methylation levels and both birth weight (BW) and adult working memory (WM) performance. Familial and unique environmental factors underlying these potential relationships were evaluated. A link was detected between DNA methylation levels of two CpG sites in the IGF2BP1 gene and both BW and adult WM performance. The BW-IGF2BP1 methylation association seemed due to non-shared environmental factors influencing BW, whereas the WM-IGF2BP1 methylation relationship seemed mediated by both genes and environment. Our data is in agreement with previous evidence indicating that DNA methylation status may be related to prenatal stress and later neurocognitive phenotypes. While former reports independently detected associations between DNA methylation and either BW or WM, current results suggest that these relationships are not confounded by each other.
Resumo:
The proliferation of inexpensive workstations and networks has created a new era in distributed computing. At the same time, non-traditional applications such as computer-aided design (CAD), computer-aided software engineering (CASE), geographic-information systems (GIS), and office-information systems (OIS) have placed increased demands for high-performance transaction processing on database systems. The combination of these factors gives rise to significant challenges in the design of modern database systems. In this thesis, we propose novel techniques whose aim is to improve the performance and scalability of these new database systems. These techniques exploit client resources through client-based transaction management. Client-based transaction management is realized by providing logging facilities locally even when data is shared in a global environment. This thesis presents several recovery algorithms which utilize client disks for storing recovery related information (i.e., log records). Our algorithms work with both coarse and fine-granularity locking and they do not require the merging of client logs at any time. Moreover, our algorithms support fine-granularity locking with multiple clients permitted to concurrently update different portions of the same database page. The database state is recovered correctly when there is a complex crash as well as when the updates performed by different clients on a page are not present on the disk version of the page, even though some of the updating transactions have committed. This thesis also presents the implementation of the proposed algorithms in a memory-mapped storage manager as well as a detailed performance study of these algorithms using the OO1 database benchmark. The performance results show that client-based logging is superior to traditional server-based logging. This is because client-based logging is an effective way to reduce dependencies on server CPU and disk resources and, thus, prevents the server from becoming a performance bottleneck as quickly when the number of clients accessing the database increases.
Resumo:
The availability of a very accurate dependence graph for a scalar code is the basis for the automatic generation of an efficient parallel implementation. The strategy for this task which is encapsulated in a comprehensive data partitioning code generation algorithm is described. This algorithm involves the data partition, calculation of assignment ranges for partitioned arrays, addition of a comprehensive set of execution control masks, altering loop limits, addition and optimisation of communications for all data. In this context, the development and implementation of strategies to merge communications wherever possible has proved an important feature in producing efficient parallel implementations for numerical mesh based codes. The code generation strategies described here are embedded within the Computer Aided Parallelisation tools (CAPTools) software as a key part of a toolkit for automating as much as possible of the parallelisation process for mesh based numerical codes. The algorithms used enables parallelisation of real computational mechanics codes with only minor user interaction and without any prior manual customisation of the serial code to suit the parallelisation tool.
Resumo:
Realizing scalable performance on high performance computing systems is not straightforward for single-phenomenon codes (such as computational fluid dynamics [CFD]). This task is magnified considerably when the target software involves the interactions of a range of phenomena that have distinctive solution procedures involving different discretization methods. The problems of addressing the key issues of retaining data integrity and the ordering of the calculation procedures are significant. A strategy for parallelizing this multiphysics family of codes is described for software exploiting finite-volume discretization methods on unstructured meshes using iterative solution procedures. A mesh partitioning-based SPMD approach is used. However, since different variables use distinct discretization schemes, this means that distinct partitions are required; techniques for addressing this issue are described using the mesh-partitioning tool, JOSTLE. In this contribution, the strategy is tested for a variety of test cases under a wide range of conditions (e.g., problem size, number of processors, asynchronous / synchronous communications, etc.) using a variety of strategies for mapping the mesh partition onto the processor topology.