991 resultados para Sparse distributed memory
Resumo:
Modelos de tomada de decisão necessitam refletir os aspectos da psi- cologia humana. Com este objetivo, este trabalho é baseado na Sparse Distributed Memory (SDM), um modelo psicologicamente e neuro- cientificamente plausível da memória humana, publicado por Pentti Kanerva, em 1988. O modelo de Kanerva possui um ponto crítico: um item de memória aquém deste ponto é rapidamente encontrado, e items além do ponto crítico não o são. Kanerva calculou este ponto para um caso especial com um seleto conjunto de parâmetros (fixos). Neste trabalho estendemos o conhecimento deste ponto crítico, através de simulações computacionais, e analisamos o comportamento desta “Critical Distance” sob diferentes cenários: em diferentes dimensões; em diferentes números de items armazenados na memória; e em diferentes números de armazenamento do item. Também é derivada uma função que, quando minimizada, determina o valor da “Critical Distance” de acordo com o estado da memória. Um objetivo secundário do trabalho é apresentar a SDM de forma simples e intuitiva para que pesquisadores de outras áreas possam imaginar como ela pode ajudá-los a entender e a resolver seus problemas.
Resumo:
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.
Resumo:
A parallel technique, for a distributed memory machine, based on domain decomposition for solving the Navier-Stokes equations in cartesian and cylindrical coordinates in two dimensions with free surfaces is described. It is based on the code by Tome and McKee (J. Comp. Phys. 110 (1994) 171-186) and Tome (Ph.D. Thesis, University of Strathclyde, Glasgow, 1993) which in turn is based on the SMAC method by Amsden and Harlow (Report LA-4370, Los Alamos Scientific Laboratory, 1971), which solves the Navier-Stokes equations in three steps: the momentum and Poisson equations and particle movement, These equations are discretized by explicit and 5-point finite differences. The parallelization is performed by splitting the computation domain into vertical panels and assigning each of these panels to a processor. All the computation can then be performed using nearest neighbour communication. Test runs comparing the performance of the parallel with the serial code, and a discussion of the load balancing question are presented. PVM is used for communication between processes. (C) 1999 Elsevier B.V. B.V. All rights reserved.
Resumo:
A large class of computational problems are characterised by frequent synchronisation, and computational requirements which change as a function of time. When such a problem is solved on a message passing multiprocessor machine [5], the combination of these characteristics leads to system performance which deteriorate in time. As the communication performance of parallel hardware steadily improves so load balance becomes a dominant factor in obtaining high parallel efficiency. Performance can be improved with periodic redistribution of computational load; however, redistribution can sometimes be very costly. We study the issue of deciding when to invoke a global load re-balancing mechanism. Such a decision policy must actively weigh the costs of remapping against the performance benefits, and should be general enough to apply automatically to a wide range of computations. This paper discusses a generic strategy for Dynamic Load Balancing (DLB) in unstructured mesh computational mechanics applications. The strategy is intended to handle varying levels of load changes throughout the run. The major issues involved in a generic dynamic load balancing scheme will be investigated together with techniques to automate the implementation of a dynamic load balancing mechanism within the Computer Aided Parallelisation Tools (CAPTools) environment, which is a semi-automatic tool for parallelisation of mesh based FORTRAN codes.
Resumo:
As the complexity of parallel applications increase, the performance limitations resulting from computational load imbalance become dominant. Mapping the problem space to the processors in a parallel machine in a manner that balances the workload of each processors will typically reduce the run-time. In many cases the computation time required for a given calculation cannot be predetermined even at run-time and so static partition of the problem returns poor performance. For problems in which the computational load across the discretisation is dynamic and inhomogeneous, for example multi-physics problems involving fluid and solid mechanics with phase changes, the workload for a static subdomain will change over the course of a computation and cannot be estimated beforehand. For such applications the mapping of loads to process is required to change dynamically, at run-time in order to maintain reasonable efficiency. The issue of dynamic load balancing are examined in the context of PHYSICA, a three dimensional unstructured mesh multi-physics continuum mechanics computational modelling code.
Resumo:
A method is outlined for optimising graph partitions which arise in mapping unstructured mesh calculations to parallel computers. The method employs a relative gain iterative technique to both evenly balance the workload and minimise the number and volume of interprocessor communications. A parallel graph reduction technique is also briefly described and can be used to give a global perspective to the optimisation. The algorithms work efficiently in parallel as well as sequentially and when combined with a fast direct partitioning technique (such as the Greedy algorithm) to give an initial partition, the resulting two-stage process proves itself to be both a powerful and flexible solution to the static graph-partitioning problem. Experiments indicate that the resulting parallel code can provide high quality partitions, independent of the initial partition, within a few seconds. The algorithms can also be used for dynamic load-balancing, reusing existing partitions and in this case the procedures are much faster than static techniques, provide partitions of similar or higher quality and, in comparison, involve the migration of a fraction of the data.
Resumo:
Who was the cowboy in Washington? What is the land of sushi? Most people would have answers to these questions readily available,yet, modern search engines, arguably the epitome of technology in finding answers to most questions, are completely unable to do so. It seems that people capture few information items to rapidly converge to a seemingly 'obvious' solution. We will study approaches for this problem, with two additional hard demands that constrain the space of possible theories: the sought model must be both psychologically and neuroscienti cally plausible. Building on top of the mathematical model of memory called Sparse Distributed Memory, we will see how some well-known methods in cryptography can point toward a promising, comprehensive, solution that preserves four crucial properties of human psychology.
Resumo:
The subject of this thesis is the n-tuple net.work (RAMnet). The major advantage of RAMnets is their speed and the simplicity with which they can be implemented in parallel hardware. On the other hand, this method is not a universal approximator and the training procedure does not involve the minimisation of a cost function. Hence RAMnets are potentially sub-optimal. It is important to understand the source of this sub-optimality and to develop the analytical tools that allow us to quantify the generalisation cost of using this model for any given data. We view RAMnets as classifiers and function approximators and try to determine how critical their lack of' universality and optimality is. In order to understand better the inherent. restrictions of the model, we review RAMnets showing their relationship to a number of well established general models such as: Associative Memories, Kamerva's Sparse Distributed Memory, Radial Basis Functions, General Regression Networks and Bayesian Classifiers. We then benchmark binary RAMnet. model against 23 other algorithms using real-world data from the StatLog Project. This large scale experimental study indicates that RAMnets are often capable of delivering results which are competitive with those obtained by more sophisticated, computationally expensive rnodels. The Frequency Weighted version is also benchmarked and shown to perform worse than the binary RAMnet for large values of the tuple size n. We demonstrate that the main issues in the Frequency Weighted RAMnets is adequate probability estimation and propose Good-Turing estimates in place of the more commonly used :Maximum Likelihood estimates. Having established the viability of the method numerically, we focus on providillg an analytical framework that allows us to quantify the generalisation cost of RAMnets for a given datasetL. For the classification network we provide a semi-quantitative argument which is based on the notion of Tuple distance. It gives a good indication of whether the network will fail for the given data. A rigorous Bayesian framework with Gaussian process prior assumptions is given for the regression n-tuple net. We show how to calculate the generalisation cost of this net and verify the results numerically for one dimensional noisy interpolation problems. We conclude that the n-tuple method of classification based on memorisation of random features can be a powerful alternative to slower cost driven models. The speed of the method is at the expense of its optimality. RAMnets will fail for certain datasets but the cases when they do so are relatively easy to determine with the analytical tools we provide.
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
Résumé: L'impact de la maladie d'Alzheimer (MA) est dévastateur pour la vie quotidienne de la personne affectée, avec perte progressive de la mémoire et d'autres facultés cognitives jusqu'à la démence. Il n'existe toujours pas de traitement contre cette maladie et il y a aussi une grande incertitude sur le diagnostic des premiers stades de la MA. La signature anatomique de la MA, en particulier l'atrophie du lobe temporal moyen (LTM) mesurée avec la neuroimagerie, peut être utilisée comme un biomarqueur précoce, in vivo, des premiers stades de la MA. Toutefois, malgré le rôle évident du LMT dans les processus de la mémoire, nous savons que les modèles anatomiques prédictifs de la MA basés seulement sur des mesures d'atrophie du LTM n'expliquent pas tous les cas cliniques. Au cours de ma thèse, j'ai conduit trois projets pour comprendre l'anatomie et le fonctionnement du LMT dans (1) les processus de la maladie et dans (2) les processus de mémoire ainsi que (3) ceux de l'apprentissage. Je me suis intéressée à une population avec déficit cognitif léger (« Mild Cognitive Impairment », MCI), à risque pour la MA. Le but du premier projet était de tester l'hypothèse que des facteurs, autres que ceux cognitifs, tels que les traits de personnalité peuvent expliquer les différences interindividuelles dans le LTM. De plus, la diversité phénotypique des manifestations précliniques de la MA provient aussi d'une connaissance limitée des processus de mémoire et d'apprentissage dans le cerveau sain. L'objectif du deuxième projet porte sur l'investigation des sous-régions du LTM, et plus particulièrement de leur contribution dans différentes composantes de la mémoire de reconnaissance chez le sujet sain. Pour étudier cela, j'ai utilisé une nouvelle méthode multivariée ainsi que l'IRM à haute résolution pour tester la contribution de ces sous-régions dans les processus de familiarité (« ou Know ») et de remémoration (ou « Recollection »). Finalement, l'objectif du troisième projet était de tester la contribution du LTM en tant que système de mémoire dans l'apprentissage et l'interaction dynamique entre différents systèmes de mémoire durant l'apprentissage. Les résultats du premier projet montrent que, en plus du déficit cognitif observé dans une population avec MCI, les traits de personnalité peuvent expliquer les différences interindividuelles du LTM ; notamment avec une plus grande contribution du neuroticisme liée à une vulnérabilité au stress et à la dépression. Mon étude a permis d'identifier un pattern d'anormalité anatomique dans le LTM associé à la personnalité avec des mesures de volume et de diffusion moyenne du tissu. Ce pattern est caractérisé par une asymétrie droite-gauche du LTM et un gradient antéro-postérieur dans le LTM. J'ai interprété ce résultat par des propriétés tissulaires et neurochimiques différemment sensibles au stress. Les résultats de mon deuxième projet ont contribué au débat actuel sur la contribution des sous-régions du LTM dans les processus de familiarité et de remémoration. Utilisant une nouvelle méthode multivariée, les résultats supportent premièrement une dissociation des sous-régions associées aux différentes composantes de la mémoire. L'hippocampe est le plus associé à la mémoire de type remémoration et le cortex parahippocampique, à la mémoire de type familiarité. Deuxièmement, l'activation correspondant à la trace mnésique pour chaque type de mémoire est caractérisée par une distribution spatiale distincte. La représentation neuronale spécifique, « sparse-distributed», associée à la mémoire de remémoration dans l'hippocampe serait la meilleure manière d'encoder rapidement des souvenirs détaillés sans interférer les souvenirs précédemment stockés. Dans mon troisième projet, j'ai mis en place une tâche d'apprentissage en IRM fonctionnelle pour étudier les processus d'apprentissage d'associations probabilistes basé sur le feedback/récompense. Cette étude m'a permis de mettre en évidence le rôle du LTM dans l'apprentissage et l'interaction entre différents systèmes de mémoire comme la mémoire procédurale, perceptuelle ou d'amorçage et la mémoire de travail. Nous avons trouvé des activations dans le LTM correspondant à un processus de mémoire épisodique; les ganglions de la base (GB), à la mémoire procédurale et la récompense; le cortex occipito-temporal (OT), à la mémoire de représentation perceptive ou l'amorçage et le cortex préfrontal, à la mémoire de travail. Nous avons également observé que ces régions peuvent interagir; le type de relation entre le LTM et les GB a été interprété comme une compétition, ce qui a déjà été reporté dans des études récentes. De plus, avec un modèle dynamique causal, j'ai démontré l'existence d'une connectivité effective entre des régions. Elle se caractérise par une influence causale de type « top-down » venant de régions corticales associées avec des processus de plus haut niveau venant du cortex préfrontal sur des régions corticales plus primaires comme le OT cortex. Cette influence diminue au cours du de l'apprentissage; cela pourrait correspondre à un mécanisme de diminution de l'erreur de prédiction. Mon interprétation est que cela est à l'origine de la connaissance sémantique. J'ai également montré que les choix du sujet et l'activation cérébrale associée sont influencés par les traits de personnalité et des états affectifs négatifs. Les résultats de cette thèse m'ont amenée à proposer (1) un modèle expliquant les mécanismes possibles liés à l'influence de la personnalité sur le LTM dans une population avec MCI, (2) une dissociation des sous-régions du LTM dans différents types de mémoire et une représentation neuronale spécifique à ces régions. Cela pourrait être une piste pour résoudre les débats actuels sur la mémoire de reconnaissance. Finalement, (3) le LTM est aussi un système de mémoire impliqué dans l'apprentissage et qui peut interagir avec les GB par une compétition. Nous avons aussi mis en évidence une interaction dynamique de type « top -down » et « bottom-up » entre le cortex préfrontal et le cortex OT. En conclusion, les résultats peuvent donner des indices afin de mieux comprendre certains dysfonctionnements de la mémoire liés à l'âge et la maladie d'Alzheimer ainsi qu'à améliorer le développement de traitement. Abstract: The impact of Alzheimer's disease is devastating for the daily life of the affected patients, with progressive loss of memory and other cognitive skills until dementia. We still lack disease modifying treatment and there is also a great amount of uncertainty regarding the accuracy of diagnostic classification in the early stages of AD. The anatomical signature of AD, in particular the medial temporal lobe (MTL) atrophy measured with neuroimaging, can be used as an early in vivo biomarker in early stages of AD. However, despite the evident role of MTL in memory, we know that the derived predictive anatomical model based only on measures of brain atrophy in MTL does not explain all clinical cases. Throughout my thesis, I have conducted three projects to understand the anatomy and the functioning of MTL on (1) disease's progression, (2) memory process and (3) learning process. I was interested in a population with mild cognitive impairment (MCI), at risk for AD. The objective of the first project was to test the hypothesis that factors, other than the cognitive ones, such as the personality traits, can explain inter-individual differences in the MTL. Moreover, the phenotypic diversity in the manifestations of preclinical AD arises also from the limited knowledge of memory and learning processes in healthy brain. The objective of the second project concerns the investigation of sub-regions of the MTL, and more particularly their contributions in the different components of recognition memory in healthy subjects. To study that, I have used a new multivariate method as well as MRI at high resolution to test the contribution of those sub-regions in the processes of familiarity and recollection. Finally, the objective of the third project was to test the contribution of the MTL as a memory system in learning and the dynamic interaction between memory systems during learning. The results of the first project show that, beyond cognitive state of impairment observed in the population with MCI, the personality traits can explain the inter-individual differences in the MTL; notably with a higher contribution of neuroticism linked to proneness to stress and depression. My study has allowed identifying a pattern of anatomical abnormality in the MTL related to personality with measures of volume and mean diffusion of the tissue. That pattern is characterized by right-left asymmetry in MTL and an anterior to posterior gradient within MTL. I have interpreted that result by tissue and neurochemical properties differently sensitive to stress. Results of my second project have contributed to the actual debate on the contribution of MTL sub-regions in the processes of familiarity and recollection. Using a new multivariate method, the results support firstly a dissociation of the subregions associated with different memory components. The hippocampus was mostly associated with recollection and the surrounding parahippocampal cortex, with familiarity type of memory. Secondly, the activation corresponding to the mensic trace for each type of memory is characterized by a distinct spatial distribution. The specific neuronal representation, "sparse-distributed", associated with recollection in the hippocampus would be the best way to rapidly encode detailed memories without overwriting previously stored memories. In the third project, I have created a learning task with functional MRI to sudy the processes of learning of probabilistic associations based on feedback/reward. That study allowed me to highlight the role of the MTL in learning and the interaction between different memory systems such as the procedural memory, the perceptual memory or priming and the working memory. We have found activations in the MTL corresponding to a process of episodic memory; the basal ganglia (BG), to a procedural memory and reward; the occipito-temporal (OT) cortex, to a perceptive memory or priming and the prefrontal cortex, to working memory. We have also observed that those regions can interact; the relation type between the MTL and the BG has been interpreted as a competition. In addition, with a dynamic causal model, I have demonstrated a "top-down" influence from cortical regions associated with high level cortical area such as the prefrontal cortex on lower level cortical regions such as the OT cortex. That influence decreases during learning; that could correspond to a mechanism linked to a diminution of prediction error. My interpretation is that this is at the origin of the semantic knowledge. I have also shown that the subject's choice and the associated brain activation are influenced by personality traits and negative affects. Overall results of this thesis have brought me to propose (1) a model explaining the possible mechanism linked to the influence of personality on the MTL in a population with MCI, (2) a dissociation of MTL sub-regions in different memory types and a neuronal representation specific to each region. This could be a cue to resolve the actual debates on recognition memory. Finally, (3) the MTL is also a system involved in learning and that can interact with the BG by a competition. We have also shown a dynamic interaction of « top -down » and « bottom-up » types between the pre-frontal cortex and the OT cortex. In conclusion, the results could give cues to better understand some memory dysfunctions in aging and Alzheimer's disease and to improve development of treatment.
Resumo:
Numerical methods related to Krylov subspaces are widely used in large sparse numerical linear algebra. Vectors in these subspaces are manipulated via their representation onto orthonormal bases. Nowadays, on serial computers, the method of Arnoldi is considered as a reliable technique for constructing such bases. However, although easily parallelizable, this technique is not as scalable as expected for communications. In this work we examine alternative methods aimed at overcoming this drawback. Since they retrieve upon completion the same information as Arnoldi's algorithm does, they enable us to design a wide family of stable and scalable Krylov approximation methods for various parallel environments. We present timing results obtained from their implementation on two distributed-memory multiprocessor supercomputers: the Intel Paragon and the IBM Scalable POWERparallel SP2. (C) 1997 by John Wiley & Sons, Ltd.
Resumo:
The past few decades have seen a considerable increase in the number of parallel and distributed systems. With the development of more complex applications, the need for more powerful systems has emerged and various parallel and distributed environments have been designed and implemented. Each of the environments, including hardware and software, has unique strengths and weaknesses. There is no single parallel environment that can be identified as the best environment for all applications with respect to hardware and software properties. The main goal of this thesis is to provide a novel way of performing data-parallel computation in parallel and distributed environments by utilizing the best characteristics of difference aspects of parallel computing. For the purpose of this thesis, three aspects of parallel computing were identified and studied. First, three parallel environments (shared memory, distributed memory, and a network of workstations) are evaluated to quantify theirsuitability for different parallel applications. Due to the parallel and distributed nature of the environments, networks connecting the processors in these environments were investigated with respect to their performance characteristics. Second, scheduling algorithms are studied in order to make them more efficient and effective. A concept of application-specific information scheduling is introduced. The application- specific information is data about the workload extractedfrom an application, which is provided to a scheduling algorithm. Three scheduling algorithms are enhanced to utilize the application-specific information to further refine their scheduling properties. A more accurate description of the workload is especially important in cases where the workunits are heterogeneous and the parallel environment is heterogeneous and/or non-dedicated. The results obtained show that the additional information regarding the workload has a positive impact on the performance of applications. Third, a programming paradigm for networks of symmetric multiprocessor (SMP) workstations is introduced. The MPIT programming paradigm incorporates the Message Passing Interface (MPI) with threads to provide a methodology to write parallel applications that efficiently utilize the available resources and minimize the overhead. The MPIT allows for communication and computation to overlap by deploying a dedicated thread for communication. Furthermore, the programming paradigm implements an application-specific scheduling algorithm. The scheduling algorithm is executed by the communication thread. Thus, the scheduling does not affect the execution of the parallel application. Performance results achieved from the MPIT show that considerable improvements over conventional MPI applications are achieved.
Resumo:
Traditional psychometric theory and practice classify people according to broad ability dimensions but do not examine how these mental processes occur. Hunt and Lansman (1975) proposed a 'distributed memory' model of cognitive processes with emphasis on how to describe individual differences based on the assumption that each individual possesses the same components. It is in the quality of these components ~hat individual differences arise. Carroll (1974) expands Hunt's model to include a production system (after Newell and Simon, 1973) and a response system. He developed a framework of factor analytic (FA) factors for : the purpose of describing how individual differences may arise from them. This scheme is to be used in the analysis of psychometric tes ts . Recent advances in the field of information processing are examined and include. 1) Hunt's development of differences between subjects designated as high or low verbal , 2) Miller's pursuit of the magic number seven, plus or minus two, 3) Ferguson's examination of transfer and abilities and, 4) Brown's discoveries concerning strategy teaching and retardates . In order to examine possible sources of individual differences arising from cognitive tasks, traditional psychometric tests were searched for a suitable perceptual task which could be varied slightly and administered to gauge learning effects produced by controlling independent variables. It also had to be suitable for analysis using Carroll's f ramework . The Coding Task (a symbol substitution test) found i n the Performance Scale of the WISe was chosen. Two experiments were devised to test the following hypotheses. 1) High verbals should be able to complete significantly more items on the Symbol Substitution Task than low verbals (Hunt, Lansman, 1975). 2) Having previous practice on a task, where strategies involved in the task may be identified, increases the amount of output on a similar task (Carroll, 1974). J) There should be a sUbstantial decrease in the amount of output as the load on STM is increased (Miller, 1956) . 4) Repeated measures should produce an increase in output over trials and where individual differences in previously acquired abilities are involved, these should differentiate individuals over trials (Ferguson, 1956). S) Teaching slow learners a rehearsal strategy would improve their learning such that their learning would resemble that of normals on the ,:same task. (Brown, 1974). In the first experiment 60 subjects were d.ivided·into high and low verbal, further divided randomly into a practice group and nonpractice group. Five subjects in each group were assigned randomly to work on a five, seven and nine digit code throughout the experiment. The practice group was given three trials of two minutes each on the practice code (designed to eliminate transfer effects due to symbol similarity) and then three trials of two minutes each on the actual SST task . The nonpractice group was given three trials of two minutes each on the same actual SST task . Results were analyzed using a four-way analysis of variance . In the second experiment 18 slow learners were divided randomly into two groups. one group receiving a planned strategy practioe, the other receiving random practice. Both groups worked on the actual code to be used later in the actual task. Within each group subjects were randomly assigned to work on a five, seven or nine digit code throughout. Both practice and actual tests consisted on three trials of two minutes each. Results were analyzed using a three-way analysis of variance . It was found in t he first experiment that 1) high or low verbal ability by itself did not produce significantly different results. However, when in interaction with the other independent variables, a difference in performance was noted . 2) The previous practice variable was significant over all segments of the experiment. Those who received previo.us practice were able to score significantly higher than those without it. J) Increasing the size of the load on STM severely restricts performance. 4) The effect of repeated trials proved to be beneficial. Generally, gains were made on each successive trial within each group. S) In the second experiment, slow learners who were allowed to practice randomly performed better on the actual task than subjeots who were taught the code by means of a planned strategy. Upon analysis using the Carroll scheme, individual differences were noted in the ability to develop strategies of storing, searching and retrieving items from STM, and in adopting necessary rehearsals for retention in STM. While these strategies may benef it some it was found that for others they may be harmful . Temporal aspects and perceptual speed were also found to be sources of variance within individuals . Generally it was found that the largest single factor i nfluencing learning on this task was the repeated measures . What e~ables gains to be made, varies with individuals . There are environmental factors, specific abilities, strategy development, previous learning, amount of load on STM , perceptual and temporal parameters which influence learning and these have serious implications for educational programs .
Resumo:
Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.
Resumo:
In this and a preceding paper, we provide an introduction to the Fujitsu VPP range of vector-parallel supercomputers and to some of the computational chemistry software available for the VPP. Here, we consider the implementation and performance of seven popular chemistry application packages. The codes discussed range from classical molecular dynamics to semiempirical and ab initio quantum chemistry. All have evolved from sequential codes, and have typically been parallelised using a replicated data approach. As such they are well suited to the large-memory/fast-processor architecture of the VPP. For one code, CASTEP, a distributed-memory data-driven parallelisation scheme is presented. (C) 2000 Published by Elsevier Science B.V. All rights reserved.