937 resultados para Distributed non-coherent shared memory
Resumo:
The past few decades have seen a considerable increase in the number of parallel and distributed systems. With the development of more complex applications, the need for more powerful systems has emerged and various parallel and distributed environments have been designed and implemented. Each of the environments, including hardware and software, has unique strengths and weaknesses. There is no single parallel environment that can be identified as the best environment for all applications with respect to hardware and software properties. The main goal of this thesis is to provide a novel way of performing data-parallel computation in parallel and distributed environments by utilizing the best characteristics of difference aspects of parallel computing. For the purpose of this thesis, three aspects of parallel computing were identified and studied. First, three parallel environments (shared memory, distributed memory, and a network of workstations) are evaluated to quantify theirsuitability for different parallel applications. Due to the parallel and distributed nature of the environments, networks connecting the processors in these environments were investigated with respect to their performance characteristics. Second, scheduling algorithms are studied in order to make them more efficient and effective. A concept of application-specific information scheduling is introduced. The application- specific information is data about the workload extractedfrom an application, which is provided to a scheduling algorithm. Three scheduling algorithms are enhanced to utilize the application-specific information to further refine their scheduling properties. A more accurate description of the workload is especially important in cases where the workunits are heterogeneous and the parallel environment is heterogeneous and/or non-dedicated. The results obtained show that the additional information regarding the workload has a positive impact on the performance of applications. Third, a programming paradigm for networks of symmetric multiprocessor (SMP) workstations is introduced. The MPIT programming paradigm incorporates the Message Passing Interface (MPI) with threads to provide a methodology to write parallel applications that efficiently utilize the available resources and minimize the overhead. The MPIT allows for communication and computation to overlap by deploying a dedicated thread for communication. Furthermore, the programming paradigm implements an application-specific scheduling algorithm. The scheduling algorithm is executed by the communication thread. Thus, the scheduling does not affect the execution of the parallel application. Performance results achieved from the MPIT show that considerable improvements over conventional MPI applications are achieved.
Resumo:
As the number of processors in distributed-memory multiprocessors grows, efficiently supporting a shared-memory programming model becomes difficult. We have designed the Protocol for Hierarchical Directories (PHD) to allow shared-memory support for systems containing massive numbers of processors. PHD eliminates bandwidth problems by using a scalable network, decreases hot-spots by not relying on a single point to distribute blocks, and uses a scalable amount of space for its directories. PHD provides a shared-memory model by synthesizing a global shared memory from the local memories of processors. PHD supports sequentially consistent read, write, and test- and-set operations. This thesis also introduces a method of describing locality for hierarchical protocols and employs this method in the derivation of an abstract model of the protocol behavior. An embedded model, based on the work of Johnson[ISCA19], describes the protocol behavior when mapped to a k-ary n-cube. The thesis uses these two models to study the average height in the hierarchy that operations reach, the longest path messages travel, the number of messages that operations generate, the inter-transaction issue time, and the protocol overhead for different locality parameters, degrees of multithreading, and machine sizes. We determine that multithreading is only useful for approximately two to four threads; any additional interleaving does not decrease the overall latency. For small machines and high locality applications, this limitation is due mainly to the length of the running threads. For large machines with medium to low locality, this limitation is due mainly to the protocol overhead being too large. Our study using the embedded model shows that in situations where the run length between references to shared memory is at least an order of magnitude longer than the time to process a single state transition in the protocol, applications exhibit good performance. If separate controllers for processing protocol requests are included, the protocol scales to 32k processor machines as long as the application exhibits hierarchical locality: at least 22% of the global references must be able to be satisfied locally; at most 35% of the global references are allowed to reach the top level of the hierarchy.
Resumo:
The foreseen evolution of chip architectures to higher number of, heterogeneous, cores, with non-uniform memory and non-coherent caches, brings renewed attention to the use of Software Transactional Memory (STM) as an alternative to lock-based synchronisation. However, STM relies on the possibility of aborting conflicting transactions to maintain data consistency, which impacts on the responsiveness and timing guarantees required by real-time systems. In these systems, contention delays must be (efficiently) limited so that the response times of tasks executing transactions are upperbounded and task sets can be feasibly scheduled. In this paper we defend the role of the transaction contention manager to reduce the number of transaction retries and to help the real-time scheduler assuring schedulability. For such purpose, the contention management policy should be aware of on-line scheduling information.
Resumo:
An approximate analytic model of a shared memory multiprocessor with a Cache Only Memory Architecture (COMA), the busbased Data Difussion Machine (DDM), is presented and validated. It describes the timing and interference in the system as a function of the hardware, the protocols, the topology and the workload. Model results have been compared to results from an independent simulator. The comparison shows good model accuracy specially for non-saturated systems, where the errors in response times and device utilizations are independent of the number of processors and remain below 10% in 90% of the simulations. Therefore, the model can be used as an average performance prediction tool that avoids expensive simulations in the design of systems with many processors.
Resumo:
The last decade has witnessed a major shift towards the deployment of embedded applications on multi-core platforms. However, real-time applications have not been able to fully benefit from this transition, as the computational gains offered by multi-cores are often offset by performance degradation due to shared resources, such as main memory. To efficiently use multi-core platforms for real-time systems, it is hence essential to tightly bound the interference when accessing shared resources. Although there has been much recent work in this area, a remaining key problem is to address the diversity of memory arbiters in the analysis to make it applicable to a wide range of systems. This work handles diverse arbiters by proposing a general framework to compute the maximum interference caused by the shared memory bus and its impact on the execution time of the tasks running on the cores, considering different bus arbiters. Our novel approach clearly demarcates the arbiter-dependent and independent stages in the analysis of these upper bounds. The arbiter-dependent phase takes the arbiter and the task memory-traffic pattern as inputs and produces a model of the availability of the bus to a given task. Then, based on the availability of the bus, the arbiter-independent phase determines the worst-case request-release scenario that maximizes the interference experienced by the tasks due to the contention for the bus. We show that the framework addresses the diversity problem by applying it to a memory bus shared by a fixed-priority arbiter, a time-division multiplexing (TDM) arbiter, and an unspecified work-conserving arbiter using applications from the MediaBench test suite. We also experimentally evaluate the quality of the analysis by comparison with a state-of-the-art TDM analysis approach and consistently showing a considerable reduction in maximum interference.
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Biomédica
Resumo:
La gestión de recursos en los procesadores multi-core ha ganado importancia con la evolución de las aplicaciones y arquitecturas. Pero esta gestión es muy compleja. Por ejemplo, una misma aplicación paralela ejecutada múltiples veces con los mismos datos de entrada, en un único nodo multi-core, puede tener tiempos de ejecución muy variables. Hay múltiples factores hardware y software que afectan al rendimiento. La forma en que los recursos hardware (cómputo y memoria) se asignan a los procesos o threads, posiblemente de varias aplicaciones que compiten entre sí, es fundamental para determinar este rendimiento. La diferencia entre hacer la asignación de recursos sin conocer la verdadera necesidad de la aplicación, frente a asignación con una meta específica es cada vez mayor. La mejor manera de realizar esta asignación és automáticamente, con una mínima intervención del programador. Es importante destacar, que la forma en que la aplicación se ejecuta en una arquitectura no necesariamente es la más adecuada, y esta situación puede mejorarse a través de la gestión adecuada de los recursos disponibles. Una apropiada gestión de recursos puede ofrecer ventajas tanto al desarrollador de las aplicaciones, como al entorno informático donde ésta se ejecuta, permitiendo un mayor número de aplicaciones en ejecución con la misma cantidad de recursos. Así mismo, esta gestión de recursos no requeriría introducir cambios a la aplicación, o a su estrategia operativa. A fin de proponer políticas para la gestión de los recursos, se analizó el comportamiento de aplicaciones intensivas de cómputo e intensivas de memoria. Este análisis se llevó a cabo a través del estudio de los parámetros de ubicación entre los cores, la necesidad de usar la memoria compartida, el tamaño de la carga de entrada, la distribución de los datos dentro del procesador y la granularidad de trabajo. Nuestro objetivo es identificar cómo estos parámetros influyen en la eficiencia de la ejecución, identificar cuellos de botella y proponer posibles mejoras. Otra propuesta es adaptar las estrategias ya utilizadas por el Scheduler con el fin de obtener mejores resultados.
Resumo:
En el entorno actual, diversas ramas de las ciencias, tienen la necesidad de auxiliarse de la computación de altas prestaciones para la obtención de resultados a relativamente corto plazo. Ello es debido fundamentalmente, al alto volumen de información que necesita ser procesada y también al costo computacional que demandan dichos cálculos. El beneficio al realizar este procesamiento de manera distribuida y paralela, logra acortar los tiempos de espera en la obtención de los resultados y de esta forma posibilita una toma decisiones con mayor anticipación. Para soportar ello, existen fundamentalmente dos modelos de programación ampliamente extendidos: el modelo de paso de mensajes a través de librerías basadas en el estándar MPI, y el de memoria compartida con la utilización de OpenMP. Las aplicaciones híbridas son aquellas que combinan ambos modelos con el fin de aprovechar en cada caso, las potencialidades específicas del paralelismo en cada uno. Lamentablemente, la práctica ha demostrado que la utilización de esta combinación de modelos, no garantiza necesariamente una mejoría en el comportamiento de las aplicaciones. Por lo tanto, un análisis de los factores que influyen en el rendimiento de las mismas, nos beneficiaría a la hora de implementarlas pero también, sería un primer paso con el fin de llegar a predecir su comportamiento. Adicionalmente, supondría una vía para determinar que parámetros de la aplicación modificar con el fin de mejorar su rendimiento. En el trabajo actual nos proponemos definir una metodología para la identificación de factores de rendimiento en aplicaciones híbridas y en congruencia, la identificación de algunos factores que influyen en el rendimiento de las mismas.
Resumo:
In fear conditioning, an animal learns to associate an unconditioned stimulus (US), such as a shock, and a conditioned stimulus (CS), such as a tone, so that the presentation of the CS alone can trigger conditioned responses. Recent research on the lateral amygdala has shown that following cued fear conditioning, only a subset of higher-excitable neurons are recruited in the memory trace. Their selective deletion after fear conditioning results in a selective erasure of the fearful memory. I hypothesize that the recruitment of highly excitable neurons depends on responsiveness to stimuli, intrinsic excitability and local connectivity. In addition, I hypothesize that neurons recruited for an initial memory also participate in subsequent memories, and that changes in neuronal excitability affect secondary fear learning. To address these hypotheses, I will show that A) a rat can learn to associate two successive short-term fearful memories; B) neuronal populations in the LA are competitively recruited in the memory traces depending on individual neuronal advantages, as well as advantages granted by the local network. By performing two successive cued fear conditioning experiments, I found that rats were able to learn and extinguish the two successive short-term memories, when tested 1 hour after learning for each memory. These rats were equipped with a system of stable extracellular recordings that I developed, which allowed to monitor neuronal activity during fear learning. 233 individual putative pyramidal neurons could modulate their firing rate in response to the conditioned tone (conditioned neurons) and/or non- conditioned tones (generalizing neurons). Out of these recorded putative pyramidal neurons 86 (37%) neurons were conditioned to one or both tones. More precisely, one population of neurons encoded for a shared memory while another group of neurons likely encoded the memories' new features. Notably, in spite of a successful behavioral extinction, the firing rate of those conditioned neurons in response to the conditioned tone remained unchanged throughout memory testing. Furthermore, by analyzing the pre-conditioning characteristics of the conditioned neurons, I determined that it was possible to predict neuronal recruitment based on three factors: 1) initial sensitivity to auditory inputs, with tone-sensitive neurons being more easily recruited than tone- insensitive neurons; 2) baseline excitability levels, with more highly excitable neurons being more likely to become conditioned; and 3) the number of afferent connections received from local neurons, with neurons destined to become conditioned receiving more connections than non-conditioned neurons. - En conditionnement de la peur, un animal apprend à associer un stimulus inconditionnel (SI), tel un choc électrique, et un stimulus conditionné (SC), comme un son, de sorte que la présentation du SC seul suffit pour déclencher des réflexes conditionnés. Des recherches récentes sur l'amygdale latérale (AL) ont montré que, suite au conditionnement à la peur, seul un sous-ensemble de neurones plus excitables sont recrutés pour constituer la trace mnésique. Pour apprendre à associer deux sons au même SI, je fais l'hypothèse que les neurones entrent en compétition afin d'être sélectionnés lors du recrutement pour coder la trace mnésique. Ce recrutement dépendrait d'un part à une activation facilité des neurones ainsi qu'une activation facilité de réseaux de neurones locaux. En outre, je fais l'hypothèse que l'activation de ces réseaux de l'AL, en soi, est suffisante pour induire une mémoire effrayante. Pour répondre à ces hypothèses, je vais montrer que A) selon un processus de mémoire à court terme, un rat peut apprendre à associer deux mémoires effrayantes apprises successivement; B) des populations neuronales dans l'AL sont compétitivement recrutées dans les traces mnésiques en fonction des avantages neuronaux individuels, ainsi que les avantages consentis par le réseau local. En effectuant deux expériences successives de conditionnement à la peur, des rats étaient capables d'apprendre, ainsi que de subir un processus d'extinction, pour les deux souvenirs effrayants. La mesure de l'efficacité du conditionnement à la peur a été effectuée 1 heure après l'apprentissage pour chaque souvenir. Ces rats ont été équipés d'un système d'enregistrements extracellulaires stables que j'ai développé, ce qui a permis de suivre l'activité neuronale pendant l'apprentissage de la peur. 233 neurones pyramidaux individuels pouvaient moduler leur taux d'activité en réponse au son conditionné (neurones conditionnés) et/ou au son non conditionné (neurones généralisant). Sur les 233 neurones pyramidaux putatifs enregistrés 86 (37%) d'entre eux ont été conditionnés à un ou deux tons. Plus précisément, une population de neurones code conjointement pour un souvenir partagé, alors qu'un groupe de neurones différent code pour de nouvelles caractéristiques de nouveaux souvenirs. En particulier, en dépit d'une extinction du comportement réussie, le taux de décharge de ces neurones conditionné en réponse à la tonalité conditionnée est resté inchangée tout au long de la mesure d'apprentissage. En outre, en analysant les caractéristiques de pré-conditionnement des neurones conditionnés, j'ai déterminé qu'il était possible de prévoir le recrutement neuronal basé sur trois facteurs : 1) la sensibilité initiale aux entrées auditives, avec les neurones sensibles aux sons étant plus facilement recrutés que les neurones ne répondant pas aux stimuli auditifs; 2) les niveaux d'excitabilité des neurones, avec les neurones plus facilement excitables étant plus susceptibles d'être conditionnés au son ; et 3) le nombre de connexions reçues, puisque les neurones conditionné reçoivent plus de connexions que les neurones non-conditionnés. Enfin, nous avons constaté qu'il était possible de remplacer de façon satisfaisante le SI lors d'un conditionnement à la peur par des injections bilatérales de bicuculline, un antagoniste des récepteurs de l'acide y-Aminobutirique.
Resumo:
We present a method to compute, quickly and efficiently, the mutual information achieved by an IID (independent identically distributed) complex Gaussian signal on a block Rayleigh-faded channel without side information at the receiver. The method accommodates both scalar and MIMO (multiple-input multiple-output) settings. Operationally, this mutual information represents the highest spectral efficiency that can be attained using Gaussiancodebooks. Examples are provided that illustrate the loss in spectral efficiency caused by fast fading and how that loss is amplified when multiple transmit antennas are used. These examples are further enriched by comparisons with the channel capacity under perfect channel-state information at the receiver, and with the spectral efficiency attained by pilot-based transmission.
Resumo:
We present a method to compute, quickly and efficiently, the mutual information achieved by an IID (independent identically distributed) complex Gaussian signal on a block Rayleigh-faded channel without side information at the receiver. The method accommodates both scalar and MIMO (multiple-input multiple-output) settings. Operationally, this mutual information represents the highest spectral efficiency that can be attained using Gaussiancodebooks. Examples are provided that illustrate the loss in spectral efficiency caused by fast fading and how that loss is amplified when multiple transmit antennas are used. These examples are further enriched by comparisons with the channel capacity under perfect channel-state information at the receiver, and with the spectral efficiency attained by pilot-based transmission.
Resumo:
MOTIVATION: The detection of positive selection is widely used to study gene and genome evolution, but its application remains limited by the high computational cost of existing implementations. We present a series of computational optimizations for more efficient estimation of the likelihood function on large-scale phylogenetic problems. We illustrate our approach using the branch-site model of codon evolution. RESULTS: We introduce novel optimization techniques that substantially outperform both CodeML from the PAML package and our previously optimized sequential version SlimCodeML. These techniques can also be applied to other likelihood-based phylogeny software. Our implementation scales well for large numbers of codons and/or species. It can therefore analyse substantially larger datasets than CodeML. We evaluated FastCodeML on different platforms and measured average sequential speedups of FastCodeML (single-threaded) versus CodeML of up to 5.8, average speedups of FastCodeML (multi-threaded) versus CodeML on a single node (shared memory) of up to 36.9 for 12 CPU cores, and average speedups of the distributed FastCodeML versus CodeML of up to 170.9 on eight nodes (96 CPU cores in total).Availability and implementation: ftp://ftp.vital-it.ch/tools/FastCodeML/. CONTACT: selectome@unil.ch or nicolas.salamin@unil.ch.
Resumo:
The motion instability is an important issue that occurs during the operation of towed underwater vehicles (TUV), which considerably affects the accuracy of high precision acoustic instrumentations housed inside the same. Out of the various parameters responsible for this, the disturbances from the tow-ship are the most significant one. The present study focus on the motion dynamics of an underwater towing system with ship induced disturbances as the input. The study focus on an innovative system called two-part towing. The methodology involves numerical modeling of the tow system, which consists of modeling of the tow-cables and vehicles formulation. Previous study in this direction used a segmental approach for the modeling of the cable. Even though, the model was successful in predicting the heave response of the tow-body, instabilities were observed in the numerical solution. The present study devises a simple approach called lumped mass spring model (LMSM) for the cable formulation. In this work, the traditional LMSM has been modified in two ways. First, by implementing advanced time integration procedures and secondly, use of a modified beam model which uses only translational degrees of freedoms for solving beam equation. A number of time integration procedures, such as Euler, Houbolt, Newmark and HHT-α were implemented in the traditional LMSM and the strength and weakness of each scheme were numerically estimated. In most of the previous studies, hydrodynamic forces acting on the tow-system such as drag and lift etc. are approximated as analytical expression of velocities. This approach restricts these models to use simple cylindrical shaped towed bodies and may not be applicable modern tow systems which are diversed in shape and complexity. Hence, this particular study, hydrodynamic parameters such as drag and lift of the tow-system are estimated using CFD techniques. To achieve this, a RANS based CFD code has been developed. Further, a new convection interpolation scheme for CFD simulation, called BNCUS, which is blend of cell based and node based formulation, was proposed in the study and numerically tested. To account for the fact that simulation takes considerable time in solving fluid dynamic equations, a dedicated parallel computing setup has been developed. Two types of computational parallelisms are explored in the current study, viz; the model for shared memory processors and distributed memory processors. In the present study, shared memory model was used for structural dynamic analysis of towing system, distributed memory one was devised in solving fluid dynamic equations.
Resumo:
With the transition to multicore processors almost complete, the parallel processing community is seeking efficient ways to port legacy message passing applications on shared memory and multicore processors. MPJ Express is our reference implementation of Message Passing Interface (MPI)-like bindings for the Java language. Starting with the current release, the MPJ Express software can be configured in two modes: the multicore and the cluster mode. In the multicore mode, parallel Java applications execute on shared memory or multicore processors. In the cluster mode, Java applications parallelized using MPJ Express can be executed on distributed memory platforms like compute clusters and clouds. The multicore device has been implemented using Java threads in order to satisfy two main design goals of portability and performance. We also discuss the challenges of integrating the multicore device in the MPJ Express software. This turned out to be a challenging task because the parallel application executes in a single JVM in the multicore mode. On the contrary in the cluster mode, the parallel user application executes in multiple JVMs. Due to these inherent architectural differences between the two modes, the MPJ Express runtime is modified to ensure correct semantics of the parallel program. Towards the end, we compare performance of MPJ Express (multicore mode) with other C and Java message passing libraries---including mpiJava, MPJ/Ibis, MPICH2, MPJ Express (cluster mode)---on shared memory and multicore processors. We found out that MPJ Express performs signicantly better in the multicore mode than in the cluster mode. Not only this but the MPJ Express software also performs better in comparison to other Java messaging libraries including mpiJava and MPJ/Ibis when used in the multicore mode on shared memory or multicore processors. We also demonstrate effectiveness of the MPJ Express multicore device in Gadget-2, which is a massively parallel astrophysics N-body siimulation code.
Resumo:
Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.