868 resultados para Parallel Processors
Resumo:
In this and a preceding paper, we provide an introduction to the Fujitsu VPP range of vector-parallel supercomputers and to some of the computational chemistry software available for the VPP. Here, we consider the implementation and performance of seven popular chemistry application packages. The codes discussed range from classical molecular dynamics to semiempirical and ab initio quantum chemistry. All have evolved from sequential codes, and have typically been parallelised using a replicated data approach. As such they are well suited to the large-memory/fast-processor architecture of the VPP. For one code, CASTEP, a distributed-memory data-driven parallelisation scheme is presented. (C) 2000 Published by Elsevier Science B.V. All rights reserved.
Resumo:
Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies.
Resumo:
"Supported in part by the Advanced Research Projects Agency ... under Contract no. US AF 30(602) 4144."
Resumo:
Extra t.p. with thesis statement inserted.
Resumo:
The difficulties encountered in implementing large scale CM codes on multiprocessor systems are now fairly well understood. Despite the claims of shared memory architecture manufacturers to provide effective parallelizing compilers, these have not proved to be adequate for large or complex programs. Significant programmer effort is usually required to achieve reasonable parallel efficiencies on significant numbers of processors. The paradigm of Single Program Multi Data (SPMD) domain decomposition with message passing, where each processor runs the same code on a subdomain of the problem, communicating through exchange of messages, has for some time been demonstrated to provide the required level of efficiency, scalability, and portability across both shared and distributed memory systems, without the need to re-author the code into a new language or even to support differing message passing implementations. Extension of the methods into three dimensions has been enabled through the engineering of PHYSICA, a framework for supporting 3D, unstructured mesh and continuum mechanics modeling. In PHYSICA, six inspectors are used. Part of the challenge for automation of parallelization is being able to prove the equivalence of inspectors so that they can be merged into as few as possible.
Resumo:
Application of optimization algorithm to PDE modeling groundwater remediation can greatly reduce remediation cost. However, groundwater remediation analysis requires a computational expensive simulation, therefore, effective parallel optimization could potentially greatly reduce computational expense. The optimization algorithm used in this research is Parallel Stochastic radial basis function. This is designed for global optimization of computationally expensive functions with multiple local optima and it does not require derivatives. In each iteration of the algorithm, an RBF is updated based on all the evaluated points in order to approximate expensive function. Then the new RBF surface is used to generate the next set of points, which will be distributed to multiple processors for evaluation. The criteria of selection of next function evaluation points are estimated function value and distance from all the points known. Algorithms created for serial computing are not necessarily efficient in parallel so Parallel Stochastic RBF is different algorithm from its serial ancestor. The application for two Groundwater Superfund Remediation sites, Umatilla Chemical Depot, and Former Blaine Naval Ammunition Depot. In the study, the formulation adopted treats pumping rates as decision variables in order to remove plume of contaminated groundwater. Groundwater flow and contamination transport is simulated with MODFLOW-MT3DMS. For both problems, computation takes a large amount of CPU time, especially for Blaine problem, which requires nearly fifty minutes for a simulation for a single set of decision variables. Thus, efficient algorithm and powerful computing resource are essential in both cases. The results are discussed in terms of parallel computing metrics i.e. speedup and efficiency. We find that with use of up to 24 parallel processors, the results of the parallel Stochastic RBF algorithm are excellent with speed up efficiencies close to or exceeding 100%.
Resumo:
One of the challenges in scientific visualization is to generate software libraries suitable for the large-scale data emerging from tera-scale simulations and instruments. We describe the efforts currently under way at SDSC and NPACI to address these challenges. The scope of the SDSC project spans data handling, graphics, visualization, and scientific application domains. Components of the research focus on the following areas: intelligent data storage, layout and handling, using an associated “Floor-Plan” (meta data); performance optimization on parallel architectures; extension of SDSC’s scalable, parallel, direct volume renderer to allow perspective viewing; and interactive rendering of fractional images (“imagelets”), which facilitates the examination of large datasets. These concepts are coordinated within a data-visualization pipeline, which operates on component data blocks sized to fit within the available computing resources. A key feature of the scheme is that the meta data, which tag the data blocks, can be propagated and applied consistently. This is possible at the disk level, in distributing the computations across parallel processors; in “imagelet” composition; and in feature tagging. The work reflects the emerging challenges and opportunities presented by the ongoing progress in high-performance computing (HPC) and the deployment of the data, computational, and visualization Grids.
Resumo:
Since its introduction in 1993, the Message Passing Interface (MPI) has become a de facto standard for writing High Performance Computing (HPC) applications on clusters and Massively Parallel Processors (MPPs). The recent emergence of multi-core processor systems presents a new challenge for established parallel programming paradigms, including those based on MPI. This paper presents a new Java messaging system called MPJ Express. Using this system, we exploit multiple levels of parallelism - messaging and threading - to improve application performance on multi-core processors. We refer to our approach as nested parallelism. This MPI-like Java library can support nested parallelism by using Java or Java OpenMP (JOMP) threads within an MPJ Express process. Practicality of this approach is assessed by porting to Java a massively parallel structure formation code from Cosmology called Gadget-2. We introduce nested parallelism in the Java version of the simulation code and report good speed-ups. To the best of our knowledge it is the first time this kind of hybrid parallelism is demonstrated in a high performance Java application. (C) 2009 Elsevier Inc. All rights reserved.
Resumo:
This paper describes the design, implementation and testing of a high speed controlled stereo “head/eye” platform which facilitates the rapid redirection of gaze in response to visual input. It details the mechanical device, which is based around geared DC motors, and describes hardware aspects of the controller and vision system, which are implemented on a reconfigurable network of general purpose parallel processors. The servo-controller is described in detail and higher level gaze and vision constructs outlined. The paper gives performance figures gained both from mechanical tests on the platform alone, and from closed loop tests on the entire system using visual feedback from a feature detector.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
Huge image collections are becoming available lately. In this scenario, the use of Content-Based Image Retrieval (CBIR) systems has emerged as a promising approach to support image searches. The objective of CBIR systems is to retrieve the most similar images in a collection, given a query image, by taking into account image visual properties such as texture, color, and shape. In these systems, the effectiveness of the retrieval process depends heavily on the accuracy of ranking approaches. Recently, re-ranking approaches have been proposed to improve the effectiveness of CBIR systems by taking into account the relationships among images. The re-ranking approaches consider the relationships among all images in a given dataset. These approaches typically demands a huge amount of computational power, which hampers its use in practical situations. On the other hand, these methods can be massively parallelized. In this paper, we propose to speedup the computation of the RL-Sim algorithm, a recently proposed image re-ranking approach, by using the computational power of Graphics Processing Units (GPU). GPUs are emerging as relatively inexpensive parallel processors that are becoming available on a wide range of computer systems. We address the image re-ranking performance challenges by proposing a parallel solution designed to fit the computational model of GPUs. We conducted an experimental evaluation considering different implementations and devices. Experimental results demonstrate that significant performance gains can be obtained. Our approach achieves speedups of 7x from serial implementation considering the overall algorithm and up to 36x on its core steps.
Resumo:
An often-overlooked aspect of neural plasticity is the plasticity of neuronal composition, in which the numbers of neurons of particular classes are altered in response to environment and experience. The Drosophila brain features several well-characterized lineages in which a single neuroblast gives rise to multiple neuronal classes in a stereotyped sequence during development. We find that in the intrinsic mushroom body neuron lineage, the numbers for each class are highly plastic, depending on the timing of temporal fate transitions and the rate of neuroblast proliferation. For example, mushroom body neuroblast cycling can continue under starvation conditions, uncoupled from temporal fate transitions that depend on extrinsic cues reflecting organismal growth and development. In contrast, the proliferation rates of antennal lobe lineages are closely associated with organismal development, and their temporal fate changes appear to be cell-cycle dependent, such that the same numbers and types of uniglomerular projection neurons innervate the antennal lobe following various perturbations. We propose that this surprising difference in plasticity for these brain lineages is adaptive, given their respective roles as parallel processors versus discrete carriers of olfactory information.
Resumo:
An often-overlooked aspect of neural plasticity is the plasticity of neuronal composition, in which the numbers of neurons of particular classes are altered in response to environment and experience. The Drosophila brain features several well-characterized lineages in which a single neuroblast gives rise to multiple neuronal classes in a stereotyped sequence during development [1]. We find that in the intrinsic mushroom body neuron lineage, the numbers for each class are highly plastic, depending on the timing of temporal fate transitions and the rate of neuroblast proliferation. For example, mushroom body neuroblast cycling can continue under starvation conditions, uncoupled from temporal fate transitions that depend on extrinsic cues reflecting organismal growth and development. In contrast, the proliferation rates of antennal lobe lineages are closely associated with organismal development, and their temporal fate changes appear to be cell cycle-dependent, such that the same numbers and types of uniglomerular projection neurons innervate the antennal lobe following various perturbations. We propose that this surprising difference in plasticity for these brain lineages is adaptive, given their respective roles as parallel processors versus discrete carriers of olfactory information.
Resumo:
Esta tesis se centra en el estudio de medios granulares blandos y atascados mediante la aplicación de la física estadística. Esta aproximación se sitúa entre los tradicionales enfoques macro y micromecánicos: trata de establecer cuáles son las propiedades macroscópicas esperables de un sistema granular en base a un análisis de las propiedades de las partículas y las interacciones que se producen entre ellas y a una consideración de las restricciones macroscópicas del sistema. Para ello se utiliza la teoría estadística junto con algunos principios, conceptos y definiciones de la teoría de los medios continuos (campo de tensiones y deformaciones, energía potencial elástica, etc) y algunas técnicas de homogeneización. La interacción entre las partículas es analizada mediante las aportaciones de la teoría del contacto y de las fuerzas capilares (producidas por eventuales meniscos de líquido cuando el medio está húmedo). La idea básica de la mecánica estadística es que entre todas soluciones de un problema físico (como puede ser el ensamblaje en equilibrio estático de partículas de un medio granular) existe un conjunto que es compatible con el conocimiento macroscópico que tenemos del sistema (por ejemplo, su volumen, la tensión a la que está sometido, la energía potencial elástica que almacena, etc.). Este conjunto todavía contiene un número enorme de soluciones. Pues bien, si no hay ninguna información adicional es razonable pensar que no existe ningún motivo para que alguna de estas soluciones sea más probable que las demás. Entonces parece natural asignarles a todas ellas el mismo peso estadístico y construir una función matemática compatible. Actuando de este modo se obtiene cuál es la función de distribución más probable de algunas cantidades asociadas a las soluciones, para lo cual es muy importante asegurarse de que todas ellas son igualmente accesibles por el procedimiento de ensamblaje o protocolo. Este enfoque se desarrolló en sus orígenes para el estudio de los gases ideales pero se puede extender para sistemas no térmicos como los analizados en esta tesis. En este sentido el primer intento se produjo hace poco más de veinte años y es la colectividad de volumen. Desde entonces esta ha sido empleada y mejorada por muchos investigadores en todo el mundo, mientras que han surgido otras, como la de la energía o la del fuerza-momento (tensión multiplicada por volumen). Cada colectividad describe, en definitiva, conjuntos de soluciones caracterizados por diferentes restricciones macroscópicas, pero de todos ellos resultan distribuciones estadísticas de tipo Maxwell-Boltzmann y controladas por dichas restricciones. En base a estos trabajos previos, en esta tesis se ha adaptado el enfoque clásico de la física estadística para el caso de medios granulares blandos. Se ha propuesto un marco general para estudiar estas colectividades que se basa en la comparación de todas las posibles soluciones en un espacio matemático definido por las componentes del fuerza-momento y en unas funciones de densidad de estados. Este desarrollo teórico se complementa con resultados obtenidos mediante simulación de la compresión cíclica de sistemas granulares bidimensionales. Se utilizó para ello un método de dinámica molecular, MD (o DEM). Las simulaciones consideran una interacción mecánica elástica, lineal y amortiguada a la que se ha añadido, en algunos casos, la fuerza cohesiva producida por meniscos de agua. Se realizaron cálculos en serie y en paralelo. Los resultados no solo prueban que las funciones de distribución de las componentes de fuerza-momento del sistema sometido a un protocolo específico parecen ser universales, sino que también revelan que existen muchos aspectos computacionales que pueden determinar cuáles son las soluciones accesibles. This thesis focuses on the application of statistical mechanics for the study of static and jammed packings of soft granular media. Such approach lies between micro and macromechanics: it tries to establish what the expected macroscopic properties of a granular system are, by starting from a micromechanical analysis of the features of the particles, and the interactions between them, and by considering the macroscopic constraints of the system. To do that, statistics together with some principles, concepts and definitions of continuum mechanics (e.g. stress and strain fields, elastic potential energy, etc.) as well as some homogenization techniques are used. The interaction between the particles of a granular system is examined too and theories on contact and capillary forces (when the media are wet) are revisited. The basic idea of statistical mechanics is that among the solutions of a physical problem (e.g. the static arrangement of particles in mechanical equilibrium) there is a class that is compatible with our macroscopic knowledge of the system (volume, stress, elastic potential energy,...). This class still contains an enormous number of solutions. In the absence of further information there is not any a priori reason for favoring one of these more than any other. Hence we shall naturally construct the equilibrium function by assigning equal statistical weights to all the functions compatible with our requirements. This procedure leads to the most probable statistical distribution of some quantities, but it is necessary to guarantee that all the solutions are likely accessed. This approach was originally set up for the study of ideal gases, but it can be extended to non-thermal systems too. In this connection, the first attempt for granular systems was the volume ensemble, developed about 20 years ago. Since then, this model has been followed and improved upon by many researchers around the world, while other two approaches have also been set up: energy and force-moment (i.e. stress multiplied by volume) ensembles. Each ensemble is described by different macroscopic constraints but all of them result on a Maxwell-Boltzmann statistical distribution, which is precisely controlled by the respective constraints. According to this previous work, in this thesis the classical statistical mechanics approach is introduced and adapted to the case of soft granular media. A general framework, which includes these three ensembles and uses a force-moment phase space and a density of states function, is proposed. This theoretical development is complemented by molecular dynamics (or DEM) simulations of the cyclic compression of 2D granular systems. Simulations were carried out by considering spring-dashpot mechanical interactions and attractive capillary forces in some cases. They were run on single and parallel processors. Results not only prove that the statistical distributions of the force-moment components obtained with a specific protocol seem to be universal, but also that there are many computational issues that can determine what the attained packings or solutions are.
Resumo:
La computación con membranas surge como una alternativa a la computación tradicional. Dentro de este campo se sitúan los denominados Sistemas P de Transición que se basan en la existencia de regiones que contienen recursos y reglas que hacen evolucionar a dichos recursos para poder llevar a cada una de las regiones a una nueva situación denominada configuración. La sucesión de las diferentes configuraciones conforman la computación. En este campo, el Grupo de Computación Natural de la Universidad Politécnica de Madrid lleva a cabo numerosas investigaciones al amparo de las cuales se han publicado numerosos artículos y realizado varias tesis doctorales. Las principales vías de investigación han sido, hasta el momento, el estudio del modelo teórico sobre el que se definen los Sistemas P, el estudio de los algoritmos que se utilizan para la aplicación de las reglas de evolución en las regiones, el diseño de nuevas arquitecturas que mejoren las comunicaciones entre las diferentes membranas (regiones) que componen el sistema y la implantación de estos sistemas en dispositivos hardware que pudiesen definir futuras máquinas basadas en este modelo. Dentro de este último campo, es decir, dentro del objetivo de construir finalmente máquinas que puedan llevar a cabo la funcionalidad de la computación con Sistemas P, la presente tesis doctoral se centra en el diseño de dos procesadores paralelos que, aplicando variantes de algoritmos existentes, favorezcan el crecimiento en el nivel de intra-paralelismo a la hora de aplicar las reglas. El diseño y creación de ambos procesadores presentan novedosas aportaciones al entorno de investigación de los Sistemas P de Transición en tanto en cuanto se utilizan conceptos que aunque previamente definidos de manera teórica, no habían sido introducidos en el hardware diseñado para estos sistemas. Así, los dos procesadores mantienen las siguientes características: - Presentan un alto rendimiento en la fase de aplicación de reglas, manteniendo por otro lado una flexibilidad y escalabilidad medias que son dependientes de la tecnología final sobre la que se sinteticen dichos procesadores. - Presentan un alto nivel de intra-paralelismo en las regiones al permitir la aplicación simultánea de reglas. - Tienen carácter universal en tanto en cuanto no depende del carácter de las reglas que componen el Sistema P. - Tienen un comportamiento indeterminista que es inherente a la propia naturaleza de estos sistemas. El primero de los circuitos utiliza el conjunto potencia del conjunto de reglas de aplicación así como el concepto de máxima aplicabilidad para favorecer el intra-paralelismo y el segundo incluye, además, el concepto de dominio de aplicabilidad para determinar el conjunto de reglas que son aplicables en cada momento con los recursos existentes. Ambos procesadores se diseñan y se prueban mediante herramientas de diseño electrónico y se preparan para ser sintetizados sobre FPGAs. ABSTRACT Membrane computing appears as an alternative to traditional computing. P Systems are placed inside this field and they are based upon the existence of regions called “membranes” that contain resources and rules that describe how the resources may vary to take each of these regions to a new situation called "configuration". Successive configurations conform computation. Inside this field, the Natural Computing Group of the Universidad Politécnica of Madrid develops a large number of works and researches that provide a lot of papers and some doctoral theses. Main research lines have been, by the moment, the study of the theoretical model over which Transition P Systems are defined, the study of the algorithms that are used for the evolution rules application in the regions, the design of new architectures that may improve communication among the different membranes (regions) that compose the whole system and the implementation of such systems over hardware devices that may define machines based upon this new model. Within this last research field, this is, within the objective of finally building machines that may accomplish the functionality of computation with P Systems, the present thesis is centered on the design of two parallel processors that, applying several variants of some known algorithms, improve the level of the internal parallelism at the evolution rule application phase. Design and creation of both processors present innovations to the field of Transition P Systems research because they use concepts that, even being known before, were never used for circuits that implement the applying phase of evolution rules. So, both processors present the following characteristics: - They present a very high performance during the application rule phase, keeping, on the other hand, a level of flexibility and scalability that, even known it is not very high, it seems to be acceptable. - They present a very high level of internal parallelism inside the regions, allowing several rule to be applied at the same time. - They present a universal character meaning this that they are not dependent upon the active rules that compose the P System. - They have a non-deterministic behavior that is inherent to this systems nature. The first processor uses the concept of "power set of the application rule set" and the concept of "maximal application" number to improve parallelism, and the second one includes, besides the previous ones, the concept of "applicability domain" to determine the set of rules that may be applied in each moment with the existing resources.. Both processors are designed and tested with the design software by Altera Corporation and they are ready to be synthetized over FPGAs.