802 resultados para GPU computing
Resumo:
Breast cancer is the most common cancer among women, being a major public health problem. Worldwide, X-ray mammography is the current gold-standard for medical imaging of breast cancer. However, it has associated some well-known limitations. The false-negative rates, up to 66% in symptomatic women, and the false-positive rates, up to 60%, are a continued source of concern and debate. These drawbacks prompt the development of other imaging techniques for breast cancer detection, in which Digital Breast Tomosynthesis (DBT) is included. DBT is a 3D radiographic technique that reduces the obscuring effect of tissue overlap and appears to address both issues of false-negative and false-positive rates. The 3D images in DBT are only achieved through image reconstruction methods. These methods play an important role in a clinical setting since there is a need to implement a reconstruction process that is both accurate and fast. This dissertation deals with the optimization of iterative algorithms, with parallel computing through an implementation on Graphics Processing Units (GPUs) to make the 3D reconstruction faster using Compute Unified Device Architecture (CUDA). Iterative algorithms have shown to produce the highest quality DBT images, but since they are computationally intensive, their clinical use is currently rejected. These algorithms have the potential to reduce patient dose in DBT scans. A method of integrating CUDA in Interactive Data Language (IDL) is proposed in order to accelerate the DBT image reconstructions. This method has never been attempted before for DBT. In this work the system matrix calculation, the most computationally expensive part of iterative algorithms, is accelerated. A speedup of 1.6 is achieved proving the fact that GPUs can accelerate the IDL implementation.
Resumo:
Le code source de la libraire développée accompagne ce dépôt dans l'état où il était à ce moment. Il est possible de trouver une version plus à jour sur github (http://github.com/abergeron).
Resumo:
In this article we explore the NVIDIA graphical processing units (GPU) computational power in cryptography using CUDA (Compute Unified Device Architecture) technology. CUDA makes the general purpose computing easy using the parallel processing presents in GPUs. To do this, the NVIDIA GPUs architectures and CUDA are presented, besides cryptography concepts. Furthermore, we do the comparison between the versions executed in CPU with the parallel version of the cryptography algorithms Advanced Encryption Standard (AES) and Message-digest Algorithm 5 (MD5) wrote in CUDA. © 2011 AISTI.
Resumo:
Microprocessori basati su singolo processore (CPU), hanno visto una rapida crescita di performances ed un abbattimento dei costi per circa venti anni. Questi microprocessori hanno portato una potenza di calcolo nell’ordine del GFLOPS (Giga Floating Point Operation per Second) sui PC Desktop e centinaia di GFLOPS su clusters di server. Questa ascesa ha portato nuove funzionalità nei programmi, migliori interfacce utente e tanti altri vantaggi. Tuttavia questa crescita ha subito un brusco rallentamento nel 2003 a causa di consumi energetici sempre più elevati e problemi di dissipazione termica, che hanno impedito incrementi di frequenza di clock. I limiti fisici del silicio erano sempre più vicini. Per ovviare al problema i produttori di CPU (Central Processing Unit) hanno iniziato a progettare microprocessori multicore, scelta che ha avuto un impatto notevole sulla comunità degli sviluppatori, abituati a considerare il software come una serie di comandi sequenziali. Quindi i programmi che avevano sempre giovato di miglioramenti di prestazioni ad ogni nuova generazione di CPU, non hanno avuto incrementi di performance, in quanto essendo eseguiti su un solo core, non beneficiavano dell’intera potenza della CPU. Per sfruttare appieno la potenza delle nuove CPU la programmazione concorrente, precedentemente utilizzata solo su sistemi costosi o supercomputers, è diventata una pratica sempre più utilizzata dagli sviluppatori. Allo stesso tempo, l’industria videoludica ha conquistato una fetta di mercato notevole: solo nel 2013 verranno spesi quasi 100 miliardi di dollari fra hardware e software dedicati al gaming. Le software houses impegnate nello sviluppo di videogames, per rendere i loro titoli più accattivanti, puntano su motori grafici sempre più potenti e spesso scarsamente ottimizzati, rendendoli estremamente esosi in termini di performance. Per questo motivo i produttori di GPU (Graphic Processing Unit), specialmente nell’ultimo decennio, hanno dato vita ad una vera e propria rincorsa alle performances che li ha portati ad ottenere dei prodotti con capacità di calcolo vertiginose. Ma al contrario delle CPU che agli inizi del 2000 intrapresero la strada del multicore per continuare a favorire programmi sequenziali, le GPU sono diventate manycore, ovvero con centinaia e centinaia di piccoli cores che eseguono calcoli in parallelo. Questa immensa capacità di calcolo può essere utilizzata in altri campi applicativi? La risposta è si e l’obiettivo di questa tesi è proprio quello di constatare allo stato attuale, in che modo e con quale efficienza pùo un software generico, avvalersi dell’utilizzo della GPU invece della CPU.
Resumo:
In accordo con la filosofia della Software Defined Radio è stato progettato un decoder LDPC software che utilizza una GPU per ottenere prestazioni migliori. Il lavoro, che comprende anche l'encoder e un simulatore di canale AWGN, può essere utilizzato sia per eseguire simulazioni che per elaborare dati in real time. Come caso di studio si sono considerati i codici LDPC dello standard DVB-S2.
Resumo:
This thesis deals with heterogeneous architectures in standard workstations. Heterogeneous architectures represent an appealing alternative to traditional supercomputers because they are based on commodity components fabricated in large quantities. Hence their price-performance ratio is unparalleled in the world of high performance computing (HPC). In particular, different aspects related to the performance and consumption of heterogeneous architectures have been explored. The thesis initially focuses on an efficient implementation of a parallel application, where the execution time is dominated by an high number of floating point instructions. Then the thesis touches the central problem of efficient management of power peaks in heterogeneous computing systems. Finally it discusses a memory-bounded problem, where the execution time is dominated by the memory latency. Specifically, the following main contributions have been carried out: A novel framework for the design and analysis of solar field for Central Receiver Systems (CRS) has been developed. The implementation based on desktop workstation equipped with multiple Graphics Processing Units (GPUs) is motivated by the need to have an accurate and fast simulation environment for studying mirror imperfection and non-planar geometries. Secondly, a power-aware scheduling algorithm on heterogeneous CPU-GPU architectures, based on an efficient distribution of the computing workload to the resources, has been realized. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The two main contributions of this work follow: the approach reduces the supply cost due to high peak power whilst having negligible impact on the parallelism of computational nodes. from another point of view the developed model allows designer to increase the number of cores without increasing the capacity of the power supply unit. Finally, an implementation for efficient graph exploration on reconfigurable architectures is presented. The purpose is to accelerate graph exploration, reducing the number of random memory accesses.
Resumo:
We present in this paper several contributions on the collision detection optimization centered on hardware performance. We focus on the broad phase which is the first step of the collision detection process and propose three new ways of parallelization of the well-known Sweep and Prune algorithm. We first developed a multi-core model takes into account the number of available cores. Multi-core architecture enables us to distribute geometric computations with use of multi-threading. Critical writing section and threads idling have been minimized by introducing new data structures for each thread. Programming with directives, like OpenMP, appears to be a good compromise for code portability. We then proposed a new GPU-based algorithm also based on the "Sweep and Prune" that has been adapted to multi-GPU architectures. Our technique is based on a spatial subdivision method used to distribute computations among GPUs. Results show that significant speed-up can be obtained by passing from 1 to 4 GPUs in a large-scale environment.
Resumo:
A parallel algorithm for image noise removal is proposed. The algorithm is based on peer group concept and uses a fuzzy metric. An optimization study on the use of the CUDA platform to remove impulsive noise using this algorithm is presented. Moreover, an implementation of the algorithm on multi-core platforms using OpenMP is presented. Performance is evaluated in terms of execution time and a comparison of the implementation parallelised in multi-core, GPUs and the combination of both is conducted. A performance analysis with large images is conducted in order to identify the amount of pixels to allocate in the CPU and GPU. The observed time shows that both devices must have work to do, leaving the most to the GPU. Results show that parallel implementations of denoising filters on GPUs and multi-cores are very advisable, and they open the door to use such algorithms for real-time processing.
Resumo:
In this work, we propose the use of the neural gas (NG), a neural network that uses an unsupervised Competitive Hebbian Learning (CHL) rule, to develop a reverse engineering process. This is a simple and accurate method to reconstruct objects from point clouds obtained from multiple overlapping views using low-cost sensors. In contrast to other methods that may need several stages that include downsampling, noise filtering and many other tasks, the NG automatically obtains the 3D model of the scanned objects. To demonstrate the validity of our proposal we tested our method with several models and performed a study of the neural network parameterization computing the quality of representation and also comparing results with other neural methods like growing neural gas and Kohonen maps or classical methods like Voxel Grid. We also reconstructed models acquired by low cost sensors that can be used in virtual and augmented reality environments for redesign or manipulation purposes. Since the NG algorithm has a strong computational cost we propose its acceleration. We have redesigned and implemented the NG learning algorithm to fit it onto Graphics Processing Units using CUDA. A speed-up of 180× faster is obtained compared to the sequential CPU version.
Resumo:
The astonishing development of diverse and different hardware platforms is twofold: on one side, the challenge for the exascale performance for big data processing and management; on the other side, the mobile and embedded devices for data collection and human machine interaction. This drove to a highly hierarchical evolution of programming models. GVirtuS is the general virtualization system developed in 2009 and firstly introduced in 2010 enabling a completely transparent layer among GPUs and VMs. This paper shows the latest achievements and developments of GVirtuS, now supporting CUDA 6.5, memory management and scheduling. Thanks to the new and improved remoting capabilities, GVirtus now enables GPU sharing among physical and virtual machines based on x86 and ARM CPUs on local workstations,computing clusters and distributed cloud appliances.
Resumo:
The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform.
Resumo:
In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.
Resumo:
We investigate in detail the effects of a QND vibrational number measurement made on single ions in a recently proposed measurement scheme for the vibrational state of a register of ions in a linear rf trap [C. D'HELON and G. J. MILBURN, Phys Rev. A 54, 5141 (1996)]. The performance of a measurement shows some interesting patterns which are closely related to searching.
Resumo:
Expokit provides a set of routines aimed at computing matrix exponentials. More precisely, it computes either a small matrix exponential in full, the action of a large sparse matrix exponential on an operand vector, or the solution of a system of linear ODEs with constant inhomogeneity. The backbone of the sparse routines consists of matrix-free Krylov subspace projection methods (Arnoldi and Lanczos processes), and that is why the toolkit is capable of coping with sparse matrices of large dimension. The software handles real and complex matrices and provides specific routines for symmetric and Hermitian matrices. The computation of matrix exponentials is a numerical issue of critical importance in the area of Markov chains and furthermore, the computed solution is subject to probabilistic constraints. In addition to addressing general matrix exponentials, a distinct attention is assigned to the computation of transient states of Markov chains.