222 resultados para multicore


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Fueled by increasing human appetite for high computing performance, semiconductor technology has now marched into the deep sub-micron era. As transistor size keeps shrinking, more and more transistors are integrated into a single chip. This has increased tremendously the power consumption and heat generation of IC chips. The rapidly growing heat dissipation greatly increases the packaging/cooling costs, and adversely affects the performance and reliability of a computing system. In addition, it also reduces the processor's life span and may even crash the entire computing system. Therefore, dynamic thermal management (DTM) is becoming a critical problem in modern computer system design. Extensive theoretical research has been conducted to study the DTM problem. However, most of them are based on theoretically idealized assumptions or simplified models. While these models and assumptions help to greatly simplify a complex problem and make it theoretically manageable, practical computer systems and applications must deal with many practical factors and details beyond these models or assumptions. The goal of our research was to develop a test platform that can be used to validate theoretical results on DTM under well-controlled conditions, to identify the limitations of existing theoretical results, and also to develop new and practical DTM techniques. This dissertation details the background and our research efforts in this endeavor. Specifically, in our research, we first developed a customized test platform based on an Intel desktop. We then tested a number of related theoretical works and examined their limitations under the practical hardware environment. With these limitations in mind, we developed a new reactive thermal management algorithm for single-core computing systems to optimize the throughput under a peak temperature constraint. We further extended our research to a multicore platform and developed an effective proactive DTM technique for throughput maximization on multicore processor based on task migration and dynamic voltage frequency scaling technique. The significance of our research lies in the fact that our research complements the current extensive theoretical research in dealing with increasingly critical thermal problems and enabling the continuous evolution of high performance computing systems.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We attempt a reconstruction of salinity levels of the central Baltic Sea based on diatom assemblages, the isotopic composition of organic matter and sedimentological expression of anoxia over the last 10 000 years. We use the data to investigate the dependence of salinity levels on climate evolution and isostasy. Changes in salinity of surface and deep waters were most pronounced from 8400 to approximately 5000 cal. BP. Density stratification between salty deep and fresher surface waters caused the frequent development of anoxic conditions and deposition of laminated sediments on large parts of the sea floor in the central Baltic Sea, and dramatic changes in organic carbon-accumulation rates. From 5000 to 3100 cal. BP, the salinity of the basin decreased, oxygenation of deep sea floors was improved, and fertility of the sea surface was significantly reduced. This is reflected by low accumulation rates of organic carbon in bioturbated sediments. Since 2800 cal. BP, salinity rose again and anoxic periods were more common. Even though the major steps in environmental evolution in the Baltic Sea coincide with known patterns of climatic change of the North Atlantic realm over the last 10 000 years, we find no conclusive evidence for synchronous changes or linear responses on submillennial timescales. However, we note that major variations in our salinity records agree with temporal patterns of reconstructed summer warmth and winter precipitation in southern Scandinavia. Both types of record suggest that climate in the mid-Holocene was far from stable. Our data also confirm that climate evolution over the late Holocene had significant impact on environmental conditions in the Baltic Sea.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We analyze five high-resolution time series spanning the last 1.65 m.y.: benthic foraminiferal delta18O and delta13O, percent CaCO3, and estimated sea surface temperature (SST) at North Atlantic Deep Sea Drilling Project site 607 and percent CaCO3 at site 609. Each record is a multicore composite verified for continuity by splicing among multiple holes. These climatic indices portray changes in northern hemisphere ice sheet size and in North Atlantic surface and deep circulation. By tuning obliquity and precession components in the delta18O record to orbital variations, we have devised a time scale (TP607) for the entire Pleistocene that agrees in age with all K/Ar-dated magnetic reversals to within 1.5%. The Brunhes time scale is taken from Imbrie et al. [1984], except for differences near the stage 17/16 transition (0.70 to 0.64 Ma). All indicators show a similar evolution from the Matuyama to the Brunhes chrons: orbital eccentricity and precession responses increased in amplitude; those at orbital obliquity decreased. The change in dominance from obliquity to eccentricity occurred over several hundred thousand years, with fastest changes around 0.7 to 0.6 Ma. The coherent, in-phase responses of delta18O, delta13O, CaCO3 and SST at these rhythms indicate that northern hemisphere ice volume changes have controlled most of the North Atlantic surface-ocean and deep-ocean responses for the last 1.6 m.y. The delta13O, percent CaCO3, and SST records at site 607 also show prominent changes at low frequencies, including a prominent long-wavelength oscillation toward glacial conditions that is centered between 0.9 and 0.6 Ma. These changes appear to be associated neither with orbital forcing nor with changes in ice volume.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A scenario-based two-stage stochastic programming model for gas production network planning under uncertainty is usually a large-scale nonconvex mixed-integer nonlinear programme (MINLP), which can be efficiently solved to global optimality with nonconvex generalized Benders decomposition (NGBD). This paper is concerned with the parallelization of NGBD to exploit multiple available computing resources. Three parallelization strategies are proposed, namely, naive scenario parallelization, adaptive scenario parallelization, and adaptive scenario and bounding parallelization. Case study of two industrial natural gas production network planning problems shows that, while the NGBD without parallelization is already faster than a state-of-the-art global optimization solver by an order of magnitude, the parallelization can improve the efficiency by several times on computers with multicore processors. The adaptive scenario and bounding parallelization achieves the best overall performance among the three proposed parallelization strategies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In the highly competitive world of modern finance, new derivatives are continually required to take advantage of changes in financial markets, and to hedge businesses against new risks. The research described in this paper aims to accelerate the development and pricing of new derivatives in two different ways. Firstly, new derivatives can be specified mathematically within a general framework, enabling new mathematical formulae to be specified rather than just new parameter settings. This Generic Pricing Engine (GPE) is expressively powerful enough to specify a wide range of stand¬ard pricing engines. Secondly, the associated price simulation using the Monte Carlo method is accelerated using GPU or multicore hardware. The parallel implementation (in OpenCL) is automatically derived from the mathematical description of the derivative. As a test, for a Basket Option Pricing Engine (BOPE) generated using the GPE, on the largest problem size, an NVidia GPU runs the generated pricing engine at 45 times the speed of a sequential, specific hand-coded implementation of the same BOPE. Thus a user can more rapidly devise, simulate and experiment with new derivatives without actual programming.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop. It transparently targets (by using OpenCL) combinations of CPU cores and GPUs, and it makes it possible to simplify the deployment of a single stencil computation kernel on different GPUs. The paper discusses the implementation of Loop-of-stencil-reduce within the FastFlow parallel framework, considering a simple iterative data-parallel application as running example (Game of Life) and a highly effective parallel filter for visual data restoration to assess performance. Thanks to the high-level design of the Loop-of-stencil-reduce, it was possible to run the filter seamlessly on a multicore machine, on multi-GPUs, and on both.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, we develop a fast implementation of an hyperspectral coded aperture (HYCA) algorithm on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems, which includes a wide variety of devices, from dense multicore systems from major manufactures such as Intel or ARM to new accelerators such as graphics processing units (GPUs), field programmable gate arrays (FPGAs), the Intel Xeon Phi and other custom devices. Our proposed implementation of HYCA significantly reduces its computational cost. Our experiments have been conducted using simulated data and reveal considerable acceleration factors. This kind of implementations with the same descriptive language on different architectures are very important in order to really calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

L’augmentation exponentielle de la demande de bande passante pour les communications laisse présager une saturation prochaine de la capacité des réseaux de télécommunications qui devrait se matérialiser au cours de la prochaine décennie. En effet, la théorie de l’information prédit que les effets non linéaires dans les fibres monomodes limite la capacité de transmission de celles-ci et peu de gain à ce niveau peut être espéré des techniques traditionnelles de multiplexage développées et utilisées jusqu’à présent dans les systèmes à haut débit. La dimension spatiale du canal optique est proposée comme un nouveau degré de liberté qui peut être utilisé pour augmenter le nombre de canaux de transmission et, par conséquent, résoudre cette menace de «crise de capacité». Ainsi, inspirée par les techniques micro-ondes, la technique émergente appelée multiplexage spatial (SDM) est une technologie prometteuse pour la création de réseaux optiques de prochaine génération. Pour réaliser le SDM dans les liens de fibres optiques, il faut réexaminer tous les dispositifs intégrés, les équipements et les sous-systèmes. Parmi ces éléments, l’amplificateur optique SDM est critique, en particulier pour les systèmes de transmission pour les longues distances. En raison des excellentes caractéristiques de l’amplificateur à fibre dopée à l’erbium (EDFA) utilisé dans les systèmes actuels de pointe, l’EDFA est à nouveau un candidat de choix pour la mise en œuvre des amplificateurs SDM pratiques. Toutefois, étant donné que le SDM introduit une variation spatiale du champ dans le plan transversal de la fibre, les amplificateurs à fibre dopée à l’erbium spatialement intégrés (SIEDFA) nécessitent une conception soignée. Dans cette thèse, nous examinons tout d’abord les progrès récents du SDM, en particulier les amplificateurs optiques SDM. Ensuite, nous identifions et discutons les principaux enjeux des SIEDFA qui exigent un examen scientifique. Suite à cela, la théorie des EDFA est brièvement présentée et une modélisation numérique pouvant être utilisée pour simuler les SIEDFA est proposée. Sur la base d’un outil de simulation fait maison, nous proposons une nouvelle conception des profils de dopage annulaire des fibres à quelques-modes dopées à l’erbium (ED-FMF) et nous évaluons numériquement la performance d’un amplificateur à un étage, avec fibre à dopage annulaire, à ainsi qu’un amplificateur à double étage pour les communications sur des fibres ne comportant que quelques modes. Par la suite, nous concevons des fibres dopées à l’erbium avec une gaine annulaire et multi-cœurs (ED-MCF). Nous avons évalué numériquement le recouvrement de la pompe avec les multiples cœurs de ces amplificateurs. En plus de la conception, nous fabriquons et caractérisons une fibre multi-cœurs à quelques modes dopées à l’erbium. Nous réalisons la première démonstration des amplificateurs à fibre optique spatialement intégrés incorporant de telles fibres dopées. Enfin, nous présentons les conclusions ainsi que les perspectives de cette recherche. La recherche et le développement des SIEDFA offriront d’énormes avantages non seulement pour les systèmes de transmission future SDM, mais aussi pour les systèmes de transmission monomode sur des fibres standards à un cœur car ils permettent de remplacer plusieurs amplificateurs par un amplificateur intégré.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Scientific applications rely heavily on floating point data types. Floating point operations are complex and require complicated hardware that is both area and power intensive. The emergence of massively parallel architectures like Rigel creates new challenges and poses new questions with respect to floating point support. The massively parallel aspect of Rigel places great emphasis on area efficient, low power designs. At the same time, Rigel is a general purpose accelerator and must provide high performance for a wide class of applications. This thesis presents an analysis of various floating point unit (FPU) components with respect to Rigel, and attempts to present a candidate design of an FPU that balances performance, area, and power and is suitable for massively parallel architectures like Rigel.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Due to the growth of design size and complexity, design verification is an important aspect of the Logic Circuit development process. The purpose of verification is to validate that the design meets the system requirements and specification. This is done by either functional or formal verification. The most popular approach to functional verification is the use of simulation based techniques. Using models to replicate the behaviour of an actual system is called simulation. In this thesis, a software/data structure architecture without explicit locks is proposed to accelerate logic gate circuit simulation. We call thus system ZSIM. The ZSIM software architecture simulator targets low cost SIMD multi-core machines. Its performance is evaluated on the Intel Xeon Phi and 2 other machines (Intel Xeon and AMD Opteron). The aim of these experiments is to: • Verify that the data structure used allows SIMD acceleration, particularly on machines with gather instructions ( section 5.3.1). • Verify that, on sufficiently large circuits, substantial gains could be made from multicore parallelism ( section 5.3.2 ). • Show that a simulator using this approach out-performs an existing commercial simulator on a standard workstation ( section 5.3.3 ). • Show that the performance on a cheap Xeon Phi card is competitive with results reported elsewhere on much more expensive super-computers ( section 5.3.5 ). To evaluate the ZSIM, two types of test circuits were used: 1. Circuits from the IWLS benchmark suit [1] which allow direct comparison with other published studies of parallel simulators.2. Circuits generated by a parametrised circuit synthesizer. The synthesizer used an algorithm that has been shown to generate circuits that are statistically representative of real logic circuits. The synthesizer allowed testing of a range of very large circuits, larger than the ones for which it was possible to obtain open source files. The experimental results show that with SIMD acceleration and multicore, ZSIM gained a peak parallelisation factor of 300 on Intel Xeon Phi and 11 on Intel Xeon. With only SIMD enabled, ZSIM achieved a maximum parallelistion gain of 10 on Intel Xeon Phi and 4 on Intel Xeon. Furthermore, it was shown that this software architecture simulator running on a SIMD machine is much faster than, and can handle much bigger circuits than a widely used commercial simulator (Xilinx) running on a workstation. The performance achieved by ZSIM was also compared with similar pre-existing work on logic simulation targeting GPUs and supercomputers. It was shown that ZSIM simulator running on a Xeon Phi machine gives comparable simulation performance to the IBM Blue Gene supercomputer at very much lower cost. The experimental results have shown that the Xeon Phi is competitive with simulation on GPUs and allows the handling of much larger circuits than have been reported for GPU simulation. When targeting Xeon Phi architecture, the automatic cache management of the Xeon Phi, handles and manages the on-chip local store without any explicit mention of the local store being made in the architecture of the simulator itself. However, targeting GPUs, explicit cache management in program increases the complexity of the software architecture. Furthermore, one of the strongest points of the ZSIM simulator is its portability. Note that the same code was tested on both AMD and Xeon Phi machines. The same architecture that efficiently performs on Xeon Phi, was ported into a 64 core NUMA AMD Opteron. To conclude, the two main achievements are restated as following: The primary achievement of this work was proving that the ZSIM architecture was faster than previously published logic simulators on low cost platforms. The secondary achievement was the development of a synthetic testing suite that went beyond the scale range that was previously publicly available, based on prior work that showed the synthesis technique is valid.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Due to the limitation of the lens effect of the optical fibre and the inhomogeneity of the laser fluence on different cores, it is still challenging to controllably inscribe different fibre Bragg gratings (FBGs) in multicore fibres. In this article, we reported the FBG inscription in four core fibres (FCFs), whose cores are arranged in the corners of a square lattice. By investigating the influence of different inscription conditions during inscription, different results, such as simultaneous inscription of all cores, selectively inscription of individual or two cores, and even double scanning in perpendicular core couples by diagonal, are achieved. The phase mask scanning method, consisting of a 244nm Argon-ion frequencydoubled laser, air-bearing linear transfer stage and cylindrical lens and mirror setup, is used to precisely control the grating inscription in FCFs. The influence of three factors is systematically investigated to overcome the limitations, and they are the defocusing length between the cylindrical lens and the bare fibre, the rotation geometry of the fibre to the irritation beam, and the relative position of the fibre in the vertical direction of the laser beam.