893 resultados para Single Graphics Processing Units
Resumo:
The efficient computation of matrix function vector products has become an important area of research in recent times, driven in particular by two important applications: the numerical solution of fractional partial differential equations and the integration of large systems of ordinary differential equations. In this work we consider a problem that combines these two applications, in the form of a numerical solution algorithm for fractional reaction diffusion equations that after spatial discretisation, is advanced in time using the exponential Euler method. We focus on the efficient implementation of the algorithm on Graphics Processing Units (GPU), as we wish to make use of the increased computational power available with this hardware. We compute the matrix function vector products using the contour integration method in [N. Hale, N. Higham, and L. Trefethen. Computing Aα, log(A), and related matrix functions by contour integrals. SIAM J. Numer. Anal., 46(5):2505–2523, 2008]. Multiple levels of preconditioning are applied to reduce the GPU memory footprint and to further accelerate convergence. We also derive an error bound for the convergence of the contour integral method that allows us to pre-determine the appropriate number of quadrature points. Results are presented that demonstrate the effectiveness of the method for large two-dimensional problems, showing a speedup of more than an order of magnitude compared to a CPU-only implementation.
Resumo:
This article describes a maximum likelihood method for estimating the parameters of the standard square-root stochastic volatility model and a variant of the model that includes jumps in equity prices. The model is fitted to data on the S&P 500 Index and the prices of vanilla options written on the index, for the period 1990 to 2011. The method is able to estimate both the parameters of the physical measure (associated with the index) and the parameters of the risk-neutral measure (associated with the options), including the volatility and jump risk premia. The estimation is implemented using a particle filter whose efficacy is demonstrated under simulation. The computational load of this estimation method, which previously has been prohibitive, is managed by the effective use of parallel computing using graphics processing units (GPUs). The empirical results indicate that the parameters of the models are reliably estimated and consistent with values reported in previous work. In particular, both the volatility risk premium and the jump risk premium are found to be significant.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
We present a method of rapidly producing computer-generated holograms that exhibit geometric occlusion in the reconstructed image. Conceptually, a bundle of rays is shot from every hologram sample into the object volume.We use z buffering to find the nearest intersecting object point for every ray and add its complex field contribution to the corresponding hologram sample. Each hologram sample belongs to an independent operation, allowing us to exploit the parallel computing capability of modern programmable graphics processing units (GPUs). Unlike algorithms that use points or planar segments as the basis for constructing the hologram, our algorithm's complexity is dependent on fixed system parameters, such as the number of ray-casting operations, and can therefore handle complicated models more efficiently. The finite number of hologram pixels is, in effect, a windowing function, and from analyzing the Wigner distribution function of windowed free-space transfer function we find an upper limit on the cone angle of the ray bundle. Experimentally, we found that an angular sampling distance of 0:01' for a 2:66' cone angle produces acceptable reconstruction quality. © 2009 Optical Society of America.
Resumo:
Os métodos numéricos convencionais, baseados em malhas, têm sido amplamente aplicados na resolução de problemas da Dinâmica dos Fluidos Computacional. Entretanto, em problemas de escoamento de fluidos que envolvem superfícies livres, grandes explosões, grandes deformações, descontinuidades, ondas de choque etc., estes métodos podem apresentar algumas dificuldades práticas quando da resolução destes problemas. Como uma alternativa viável, existem os métodos de partículas livre de malhas. Neste trabalho é feita uma introdução ao método Lagrangeano de partículas, livre de malhas, Smoothed Particle Hydrodynamics (SPH) voltado para a simulação numérica de escoamentos de fluidos newtonianos compressíveis e quase-incompressíveis. Dois códigos numéricos foram desenvolvidos, uma versão serial e outra em paralelo, empregando a linguagem de programação C/C++ e a Compute Unified Device Architecture (CUDA), que possibilita o processamento em paralelo empregando os núcleos das Graphics Processing Units (GPUs) das placas de vídeo da NVIDIA Corporation. Os resultados numéricos foram validados e a eficiência computacional avaliada considerandose a resolução dos problemas unidimensionais Shock Tube e Blast Wave e bidimensional da Cavidade (Shear Driven Cavity Problem).
Resumo:
A obtenção de imagens usando tomografia computadorizada revolucionou o diagnóstico de doenças na medicina e é usada amplamente em diferentes áreas da pesquisa científica. Como parte do processo de obtenção das imagens tomográficas tridimensionais um conjunto de radiografias são processadas por um algoritmo computacional, o mais usado atualmente é o algoritmo de Feldkamp, David e Kress (FDK). Os usos do processamento paralelo para acelerar os cálculos em algoritmos computacionais usando as diferentes tecnologias disponíveis no mercado têm mostrado sua utilidade para diminuir os tempos de processamento. No presente trabalho é apresentada a paralelização do algoritmo de reconstrução de imagens tridimensionais FDK usando unidades gráficas de processamento (GPU) e a linguagem CUDA-C. São apresentadas as GPUs como uma opção viável para executar computação paralela e abordados os conceitos introdutórios associados à tomografia computadorizada, GPUs, CUDA-C e processamento paralelo. A versão paralela do algoritmo FDK executada na GPU é comparada com uma versão serial do mesmo, mostrando maior velocidade de processamento. Os testes de desempenho foram feitos em duas GPUs de diferentes capacidades: a placa NVIDIA GeForce 9400GT (16 núcleos) e a placa NVIDIA Quadro 2000 (192 núcleos).
Resumo:
A new three-dimensional Navier-Stokes solver for flows in turbomachines has been developed. The new solver is based on the latest version of the Denton codes, but has been implemented to run on Graphics Processing Units (GPUs) instead of the traditional Central Processing Unit (CPU). The change in processor enables an order-of-magnitude reduction in run-time due to the higher performance of the GPU. Scaling results for a 16 node GPU cluster are also presented, showing almost linear scaling for typical turbomachinery cases. For validation purposes, a test case consisting of a three-stage turbine with complete hub and casing leakage paths is described. Good agreement is obtained with previously published experimental results. The simulation runs in less than 10 minutes on a cluster with four GPUs. Copyright © 2009 by ASME.
Resumo:
Fuel treatment is considered a suitable way to mitigate the hazard related to potential wildfires on a landscape. However, designing an optimal spatial layout of treatment units represents a difficult optimization problem. In fact, budget constraints, the probabilistic nature of fire spread and interactions among the different area units composing the whole treatment, give rise to challenging search spaces on typical landscapes. In this paper we formulate such optimization problem with the objective of minimizing the extension of land characterized by high fire hazard. Then, we propose a computational approach that leads to a spatially-optimized treatment layout exploiting Tabu Search and General-Purpose computing on Graphics Processing Units (GPGPU). Using an application example, we also show that the proposed methodology can provide high-quality design solutions in low computing time. © 2013 The Authors. Published by Elsevier B.V.
Resumo:
Cloud computing technology has rapidly evolved over the last decade, offering an alternative way to store and work with large amounts of data. However data security remains an important issue particularly when using a public cloud service provider. The recent area of homomorphic cryptography allows computation on encrypted data, which would allow users to ensure data privacy on the cloud and increase the potential market for cloud computing. A significant amount of research on homomorphic cryptography appeared in the literature over the last few years; yet the performance of existing implementations of encryption schemes remains unsuitable for real time applications. One way this limitation is being addressed is through the use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) for implementations of homomorphic encryption schemes. This review presents the current state of the art in this promising new area of research and highlights the interesting remaining open problems.
Resumo:
La tomographie d’émission par positrons (TEP) est une modalité d’imagerie moléculaire utilisant des radiotraceurs marqués par des isotopes émetteurs de positrons permettant de quantifier et de sonder des processus biologiques et physiologiques. Cette modalité est surtout utilisée actuellement en oncologie, mais elle est aussi utilisée de plus en plus en cardiologie, en neurologie et en pharmacologie. En fait, c’est une modalité qui est intrinsèquement capable d’offrir avec une meilleure sensibilité des informations fonctionnelles sur le métabolisme cellulaire. Les limites de cette modalité sont surtout la faible résolution spatiale et le manque d’exactitude de la quantification. Par ailleurs, afin de dépasser ces limites qui constituent un obstacle pour élargir le champ des applications cliniques de la TEP, les nouveaux systèmes d’acquisition sont équipés d’un grand nombre de petits détecteurs ayant des meilleures performances de détection. La reconstruction de l’image se fait en utilisant les algorithmes stochastiques itératifs mieux adaptés aux acquisitions à faibles statistiques. De ce fait, le temps de reconstruction est devenu trop long pour une utilisation en milieu clinique. Ainsi, pour réduire ce temps, on les données d’acquisition sont compressées et des versions accélérées d’algorithmes stochastiques itératifs qui sont généralement moins exactes sont utilisées. Les performances améliorées par l’augmentation de nombre des détecteurs sont donc limitées par les contraintes de temps de calcul. Afin de sortir de cette boucle et permettre l’utilisation des algorithmes de reconstruction robustes, de nombreux travaux ont été effectués pour accélérer ces algorithmes sur les dispositifs GPU (Graphics Processing Units) de calcul haute performance. Dans ce travail, nous avons rejoint cet effort de la communauté scientifique pour développer et introduire en clinique l’utilisation des algorithmes de reconstruction puissants qui améliorent la résolution spatiale et l’exactitude de la quantification en TEP. Nous avons d’abord travaillé sur le développement des stratégies pour accélérer sur les dispositifs GPU la reconstruction des images TEP à partir des données d’acquisition en mode liste. En fait, le mode liste offre de nombreux avantages par rapport à la reconstruction à partir des sinogrammes, entre autres : il permet d’implanter facilement et avec précision la correction du mouvement et le temps de vol (TOF : Time-Of Flight) pour améliorer l’exactitude de la quantification. Il permet aussi d’utiliser les fonctions de bases spatio-temporelles pour effectuer la reconstruction 4D afin d’estimer les paramètres cinétiques des métabolismes avec exactitude. Cependant, d’une part, l’utilisation de ce mode est très limitée en clinique, et d’autre part, il est surtout utilisé pour estimer la valeur normalisée de captation SUV qui est une grandeur semi-quantitative limitant le caractère fonctionnel de la TEP. Nos contributions sont les suivantes : - Le développement d’une nouvelle stratégie visant à accélérer sur les dispositifs GPU l’algorithme 3D LM-OSEM (List Mode Ordered-Subset Expectation-Maximization), y compris le calcul de la matrice de sensibilité intégrant les facteurs d’atténuation du patient et les coefficients de normalisation des détecteurs. Le temps de calcul obtenu est non seulement compatible avec une utilisation clinique des algorithmes 3D LM-OSEM, mais il permet également d’envisager des reconstructions rapides pour les applications TEP avancées telles que les études dynamiques en temps réel et des reconstructions d’images paramétriques à partir des données d’acquisitions directement. - Le développement et l’implantation sur GPU de l’approche Multigrilles/Multitrames pour accélérer l’algorithme LMEM (List-Mode Expectation-Maximization). L’objectif est de développer une nouvelle stratégie pour accélérer l’algorithme de référence LMEM qui est un algorithme convergent et puissant, mais qui a l’inconvénient de converger très lentement. Les résultats obtenus permettent d’entrevoir des reconstructions en temps quasi-réel que ce soit pour les examens utilisant un grand nombre de données d’acquisition aussi bien que pour les acquisitions dynamiques synchronisées. Par ailleurs, en clinique, la quantification est souvent faite à partir de données d’acquisition en sinogrammes généralement compressés. Mais des travaux antérieurs ont montré que cette approche pour accélérer la reconstruction diminue l’exactitude de la quantification et dégrade la résolution spatiale. Pour cette raison, nous avons parallélisé et implémenté sur GPU l’algorithme AW-LOR-OSEM (Attenuation-Weighted Line-of-Response-OSEM) ; une version de l’algorithme 3D OSEM qui effectue la reconstruction à partir de sinogrammes sans compression de données en intégrant les corrections de l’atténuation et de la normalisation dans les matrices de sensibilité. Nous avons comparé deux approches d’implantation : dans la première, la matrice système (MS) est calculée en temps réel au cours de la reconstruction, tandis que la seconde implantation utilise une MS pré- calculée avec une meilleure exactitude. Les résultats montrent que la première implantation offre une efficacité de calcul environ deux fois meilleure que celle obtenue dans la deuxième implantation. Les temps de reconstruction rapportés sont compatibles avec une utilisation clinique de ces deux stratégies.
Resumo:
clRNG et clProbdist sont deux interfaces de programmation (APIs) que nous avons développées pour la génération de nombres aléatoires uniformes et non uniformes sur des dispositifs de calculs parallèles en utilisant l’environnement OpenCL. La première interface permet de créer au niveau d’un ordinateur central (hôte) des objets de type stream considérés comme des générateurs virtuels parallèles qui peuvent être utilisés aussi bien sur l’hôte que sur les dispositifs parallèles (unités de traitement graphique, CPU multinoyaux, etc.) pour la génération de séquences de nombres aléatoires. La seconde interface permet aussi de générer au niveau de ces unités des variables aléatoires selon différentes lois de probabilité continues et discrètes. Dans ce mémoire, nous allons rappeler des notions de base sur les générateurs de nombres aléatoires, décrire les systèmes hétérogènes ainsi que les techniques de génération parallèle de nombres aléatoires. Nous présenterons aussi les différents modèles composant l’architecture de l’environnement OpenCL et détaillerons les structures des APIs développées. Nous distinguons pour clRNG les fonctions qui permettent la création des streams, les fonctions qui génèrent les variables aléatoires uniformes ainsi que celles qui manipulent les états des streams. clProbDist contient les fonctions de génération de variables aléatoires non uniformes selon la technique d’inversion ainsi que les fonctions qui permettent de retourner différentes statistiques des lois de distribution implémentées. Nous évaluerons ces interfaces de programmation avec deux simulations qui implémentent un exemple simplifié d’un modèle d’inventaire et un exemple d’une option financière. Enfin, nous fournirons les résultats d’expérimentation sur les performances des générateurs implémentés.
Resumo:
Simulating spiking neural networks is of great interest to scientists wanting to model the functioning of the brain. However, large-scale models are expensive to simulate due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in an embodied setting, the simulation must be real-time in order to be useful. In this paper we present NeMo, a platform for such simulations which achieves high performance through the use of highly parallel commodity hardware in the form of graphics processing units (GPUs). NeMo makes use of the Izhikevich neuron model which provides a range of realistic spiking dynamics while being computationally efficient. Our GPU kernel can deliver up to 400 million spikes per second. This corresponds to a real-time simulation of around 40 000 neurons under biologically plausible conditions with 1000 synapses per neuron and a mean firing rate of 10 Hz.
Resumo:
SAFT techniques are based on the sequential activation, in emission and reception, of the array elements and the post-processing of all the received signals to compose the image. Thus, the image generation can be divided into two stages: (1) the excitation and acquisition stage, where the signals received by each element or group of elements are stored; and (2) the beamforming stage, where the signals are combined together to obtain the image pixels. The use of Graphics Processing Units (GPUs), which are programmable devices with a high level of parallelism, can accelerate the computations of the beamforming process, that usually includes different functions such as dynamic focusing, band-pass filtering, spatial filtering or envelope detection. This work shows that using GPU technology can accelerate, in more than one order of magnitude with respect to CPU implementations, the beamforming and post-processing algorithms in SAFT imaging. ©2009 IEEE.
Resumo:
In general, pattern recognition techniques require a high computational burden for learning the discriminating functions that are responsible to separate samples from distinct classes. As such, there are several studies that make effort to employ machine learning algorithms in the context of big data classification problems. The research on this area ranges from Graphics Processing Units-based implementations to mathematical optimizations, being the main drawback of the former approaches to be dependent on the graphic video card. Here, we propose an architecture-independent optimization approach for the optimum-path forest (OPF) classifier, that is designed using a theoretical formulation that relates the minimum spanning tree with the minimum spanning forest generated by the OPF over the training dataset. The experiments have shown that the approach proposed can be faster than the traditional one in five public datasets, being also as accurate as the original OPF. (C) 2014 Elsevier B. V. All rights reserved.