6 resultados para virtualised GPU

em Repositorio Institucional de la Universidad de Málaga


20.00% 20.00%



Abstract: As time has passed, the general purpose programming paradigm has evolved, producing different hardware architectures whose characteristics differ widely. In this work, we are going to demonstrate, through different applications belonging to the field of Image Processing, the existing difference between three Nvidia hardware platforms: two of them belong to the GeForce graphics cards series, the GTX 480 and the GTX 980 and one of the low consumption platforms which purpose is to allow the execution of embedded applications as well as providing an extreme efficiency: the Jetson TK1. With respect to the test applications we will use five examples from Nvidia CUDA Samples. These applications are directly related to Image Processing, as the algorithms they use are similar to those from the field of medical image registration. After the tests, it will be proven that GTX 980 is both the device with the highest computational power and the one that has greater consumption, it will be seen that Jetson TK1 is the most efficient platform, it will be shown that GTX 480 produces more heat than the others and we will learn other effects produced by the existing difference between the architecture of the devices.


20.00% 20.00%



In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.


10.00% 10.00%



Abstract: Medical image processing in general and brain image processing in particular are computationally intensive tasks. Luckily, their use can be liberalized by means of techniques such as GPU programming. In this article we study NiftyReg, a brain image processing library with a GPU implementation using CUDA, and analyse different possible ways of further optimising the existing codes. We will focus on fully using the memory hierarchy and on exploiting the computational power of the CPU. The ideas that lead us towards the different attempts to change and optimize the code will be shown as hypotheses, which we will then test empirically using the results obtained from running the application. Finally, for each set of related optimizations we will study the validity of the obtained results in terms of both performance and the accuracy of the resulting images.


10.00% 10.00%



En este documento se expondrá una implementación del problema del viajante de comercio usando una implementación personalizada de un mapa auto-organizado basándose en soluciones anteriores y adaptándolas a la arquitectura CUDA, haciendo a la vez una comparativa de la implementación eficiente en CUDA C/C++ con la implementación de las funciones de GPU incluidas en el Parallel Computing Toolbox de Matlab. La solución que se da reduce en casi un cuarto las iteraciones necesarias para llegar a una solución buena del problema mencionado, además de la mejora inminente del uso de las arquitecturas paralelas. En esta solución se estudia la mejora en tiempo que se consigue con el uso específico de la memoria compartida, siendo esta una de las herramientas más potentes para mejorar el rendimiento. En lo referente a los tiempos de ejecución, se llega a concluir que la mejor solución es el lanzamiento de un kernel de CUDA desde Matlab a través de la funcionalidad incluida en el Parallel Computing Toolbox.


10.00% 10.00%



En esta tesis doctoral se exponen los fundamentos teóricos necesarios en el diseño de esquemas numéricos de volúmenes finitos para sistemas hiperbólicos no conservativos de una y dos dimensiones. Para el caso unidimensional se repasan los conceptos de esquema camino-conservativo y esquema bien equilibrado, así como la extensión de los esquemas numéricos a alto orden, basados en la reconstrucción de estados. En particular, se presentan los esquemas de tipo PVM (Polynomial Viscosity Matrix), así como diversos esquemas de limitadores de flujo que resultan de la extensión natural del método WAF, utilizando como base algunos esquemas de tipo PVM. Para el caso bidimensional se aborda el diseño de esquemas numéricos camino-conservativos y bien equilibrados de volúmenes finitos para sistemas hiperbólicos no conservativos y su extensión a alto orden, en particular se presenta una reconstrucción de estados de tercer orden compacta y que resulta de la combinación WENO de paraboloides y planos. 
 Se presenta además el desarrollo de métodos numéricos para el sistema de aguas someras bidimensional de una capa. En particular se definen esquemas de primer orden de tipo HLL y FORCE y su extensión a alto orden, un método de limitadores de flujo basado en el esquema HLL-WAF, así como su implementación en arquitecturas de tipo GPU, usando el entorno de programación CUDA. A continuación, se presenta un esquema numérico de orden uno para el sistema de aguas someras de una capa bidimensional en coordenadas esféricas (longitud/latitud), así como la extensión natural del método de limitadores de flujo presentado en el Capítulo 3 a este sistema. Finalmente, se presenta la validación del esquema de limitadores de flujo mediante la simulación de tsunamis reales, y la comparación con datos de campo.


10.00% 10.00%



After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.