106 resultados para NVIDIA CUDA
Resumo:
The Nha Trang Bay (latitude 12°15'N) and central areas of Vietnam present strong ecological differences from other parts of the country.
Resumo:
En este documento se expondrá una implementación del problema del viajante de comercio usando una implementación personalizada de un mapa auto-organizado basándose en soluciones anteriores y adaptándolas a la arquitectura CUDA, haciendo a la vez una comparativa de la implementación eficiente en CUDA C/C++ con la implementación de las funciones de GPU incluidas en el Parallel Computing Toolbox de Matlab. La solución que se da reduce en casi un cuarto las iteraciones necesarias para llegar a una solución buena del problema mencionado, además de la mejora inminente del uso de las arquitecturas paralelas. En esta solución se estudia la mejora en tiempo que se consigue con el uso específico de la memoria compartida, siendo esta una de las herramientas más potentes para mejorar el rendimiento. En lo referente a los tiempos de ejecución, se llega a concluir que la mejor solución es el lanzamiento de un kernel de CUDA desde Matlab a través de la funcionalidad incluida en el Parallel Computing Toolbox.
Resumo:
After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.
Resumo:
In questa tesi discuteremo come è possibile effettuare la traduzione di un software parallelo scritto in linguaggio CUDA ad uno in linguaggio OpenCL. Tratteremo le tecnologie utilizzate per lo sviluppo di un simulatore cardiaco parallelo e discuteremo in particolar modo come derivare da queste una versione che ne permetta l’esecuzione su schede video e processori arbitrari. Questa versione verrà messa poi a confronto con quelle già esistenti, per analizzarne prestazioni ed eventuali cambiamenti strutturali del codice. Quanto affermato sopra è stato possibile in gran parte grazie ad un wrapper chiamato SimpleCL pensato per rendere la programmazione OpenCL simile a quella in ambiente CUDA. OpenCL permette di operare con le unità di calcolo in maniera molto astratta, ricordando vagamente i concetti di astrazione di memoria e processori della controparte NVIDIA. Ragionevolmente SimpleCL fornisce solamente una interfaccia che ricorda chiamate CUDA, mantenendo il flusso sottostante fedele a quello che si aspetterebbe OpenCL.
Resumo:
La Stereo Vision è un popolare argomento di ricerca nel campo della Visione Artificiale; esso consiste nell’usare due immagini di una stessa scena,prodotte da due fotocamere diverse, per estrarre informazioni in 3D. L’idea di base della Stereo Vision è la simulazione della visione binoculare umana:le due fotocamere sono disposte in orizzontale per fungere da “occhi” che guardano la scena in 3D. Confrontando le due immagini ottenute, si possono ottenere informazioni riguardo alle posizioni degli oggetti della scena.In questa relazione presenteremo un algoritmo di Stereo Vision: si tratta di un algoritmo parallelo che ha come obiettivo di tracciare le linee di livello di un area geografica. L’algoritmo in origine era stato implementato per la Connection Machine CM-2, un supercomputer sviluppato negli anni 80, ed era espresso in *Lisp, un linguaggio derivato dal Lisp e ideato per la macchina stessa. Questa relazione tratta anche la traduzione e l’implementazione dell’algoritmo in CUDA, ovvero un’architettura hardware per l’elaborazione pa- rallela sviluppata da NVIDIA, che consente di eseguire codice parallelo su GPU. Si darà inoltre uno sguardo alle difficoltà che sono state riscontrate nella traduzione da *Lisp a CUDA.
Resumo:
Graphics processors were originally developed for rendering graphics but have recently evolved towards being an architecture for general-purpose computations. They are also expected to become important parts of embedded systems hardware -- not just for graphics. However, this necessitates the development of appropriate timing analysis techniques which would be required because techniques developed for CPU scheduling are not applicable. The reason is that we are not interested in how long it takes for any given GPU thread to complete, but rather how long it takes for all of them to complete. We therefore develop a simple method for finding an upper bound on the makespan of a group of GPU threads executing the same program and competing for the resources of a single streaming multiprocessor (whose architecture is based on NVIDIA Fermi, with some simplifying assunptions). We then build upon this method to formulate the derivation of the exact worst-case makespan (and corresponding schedule) as an optimization problem. Addressing the issue of tractability, we also present a technique for efficiently computing a safe estimate of the worstcase makespan with minimal pessimism, which may be used when finding an exact value would take too long.
Resumo:
Graphics processor units (GPUs) today can be used for computations that go beyond graphics and such use can attain a performance that is orders of magnitude greater than a normal processor. The software executing on a graphics processor is composed of a set of (often thousands of) threads which operate on different parts of the data and thereby jointly compute a result which is delivered to another thread executing on the main processor. Hence the response time of a thread executing on the main processor is dependent on the finishing time of the execution of threads executing on the GPU. Therefore, we present a simple method for calculating an upper bound on the finishing time of threads executing on a GPU, in particular NVIDIA Fermi. Developing such a method is nontrivial because threads executing on a GPU share hardware resources at very fine granularity.
Resumo:
This paper presents a new parallel implementation of a previously hyperspectral coded aperture (HYCA) algorithm for compressive sensing on graphics processing units (GPUs). HYCA method combines the ideas of spectral unmixing and compressive sensing exploiting the high spatial correlation that can be observed in the data and the generally low number of endmembers needed in order to explain the data. The proposed implementation exploits the GPU architecture at low level, thus taking full advantage of the computational power of GPUs using shared memory and coalesced accesses to memory. The proposed algorithm is evaluated not only in terms of reconstruction error but also in terms of computational performance using two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN. Experimental results using real data reveals signficant speedups up with regards to serial implementation.
Resumo:
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Resumo:
Hyperspectral imaging has become one of the main topics in remote sensing applications, which comprise hundreds of spectral bands at different (almost contiguous) wavelength channels over the same area generating large data volumes comprising several GBs per flight. This high spectral resolution can be used for object detection and for discriminate between different objects based on their spectral characteristics. One of the main problems involved in hyperspectral analysis is the presence of mixed pixels, which arise when the spacial resolution of the sensor is not able to separate spectrally distinct materials. Spectral unmixing is one of the most important task for hyperspectral data exploitation. However, the unmixing algorithms can be computationally very expensive, and even high power consuming, which compromises the use in applications under on-board constraints. In recent years, graphics processing units (GPUs) have evolved into highly parallel and programmable systems. Specifically, several hyperspectral imaging algorithms have shown to be able to benefit from this hardware taking advantage of the extremely high floating-point processing performance, compact size, huge memory bandwidth, and relatively low cost of these units, which make them appealing for onboard data processing. In this paper, we propose a parallel implementation of an augmented Lagragian based method for unsupervised hyperspectral linear unmixing on GPUs using CUDA. The method called simplex identification via split augmented Lagrangian (SISAL) aims to identify the endmembers of a scene, i.e., is able to unmix hyperspectral data sets in which the pure pixel assumption is violated. The efficient implementation of SISAL method presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory.
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
Breast cancer is the most common cancer among women, being a major public health problem. Worldwide, X-ray mammography is the current gold-standard for medical imaging of breast cancer. However, it has associated some well-known limitations. The false-negative rates, up to 66% in symptomatic women, and the false-positive rates, up to 60%, are a continued source of concern and debate. These drawbacks prompt the development of other imaging techniques for breast cancer detection, in which Digital Breast Tomosynthesis (DBT) is included. DBT is a 3D radiographic technique that reduces the obscuring effect of tissue overlap and appears to address both issues of false-negative and false-positive rates. The 3D images in DBT are only achieved through image reconstruction methods. These methods play an important role in a clinical setting since there is a need to implement a reconstruction process that is both accurate and fast. This dissertation deals with the optimization of iterative algorithms, with parallel computing through an implementation on Graphics Processing Units (GPUs) to make the 3D reconstruction faster using Compute Unified Device Architecture (CUDA). Iterative algorithms have shown to produce the highest quality DBT images, but since they are computationally intensive, their clinical use is currently rejected. These algorithms have the potential to reduce patient dose in DBT scans. A method of integrating CUDA in Interactive Data Language (IDL) is proposed in order to accelerate the DBT image reconstructions. This method has never been attempted before for DBT. In this work the system matrix calculation, the most computationally expensive part of iterative algorithms, is accelerated. A speedup of 1.6 is achieved proving the fact that GPUs can accelerate the IDL implementation.
Resumo:
As simulações que pretendam modelar fenómenos reais com grande precisão em tempo útil exigem enormes quantidades de recursos computacionais, sejam estes de processamento, de memória, ou comunicação. Se até há pouco tempo estas capacidades estavam confinadas a grandes supercomputadores, com o advento dos processadores multicore e GPUs manycore os recursos necessários para este tipo de problemas estão agora acessíveis a preços razoáveis não só a investigadores como aos utilizadores em geral. O presente trabalho está focado na otimização de uma aplicação que simula o comportamento dinâmico de materiais granulares secos, um problema do âmbito da Engenharia Civil, mais especificamente na área da Geotecnia, na qual estas simulações permitem por exemplo investigar a deslocação de grandes massas sólidas provocadas pelo colapso de taludes. Assim, tem havido interesse em abordar esta temática e produzir simulações representativas de situações reais, nomeadamente por parte do CGSE (Australian Research Council Centre of Excellence for Geotechnical Science and Engineering) da Universidade de Newcastle em colaboração com um membro da UNIC (Centro de Investigação em Estruturas de Construção da FCT/UNL) que tem vindo a desenvolver a sua própria linha de investigação, que se materializou na implementação, em CUDA, de um algoritmo para GPUs que possibilita simulações de sistemas com um elevado número de partículas. O trabalho apresentado consiste na otimização, assente na premissa da não alteração (ou alteração mínima) do código original, da supracitada implementação, de forma a obter melhorias significativas tanto no tempo global de execução da aplicação, como no aumento do número de partículas a simular. Ao mesmo tempo, valida-se a formulação proposta ao conseguir simulações que refletem, com grande precisão, os fenómenos físicos. Com as otimizações realizadas, conseguiu-se obter uma redução de cerca de 30% do tempo inicial cumprindo com os requisitos de correção e precisão necessários.