Biblioteca Digital

999 resultados para Multi-CPU

Collision Detection: Broad Phase Adaptation from Multi-Core to Multi-GPU Architecture

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We present in this paper several contributions on the collision detection optimization centered on hardware performance. We focus on the broad phase which is the first step of the collision detection process and propose three new ways of parallelization of the well-known Sweep and Prune algorithm. We first developed a multi-core model takes into account the number of available cores. Multi-core architecture enables us to distribute geometric computations with use of multi-threading. Critical writing section and threads idling have been minimized by introducing new data structures for each thread. Programming with directives, like OpenMP, appears to be a good compromise for code portability. We then proposed a new GPU-based algorithm also based on the "Sweep and Prune" that has been adapted to multi-GPU architectures. Our technique is based on a spatial subdivision method used to distribute computations among GPUs. Results show that significant speed-up can be obtained by passing from 1 to 4 GPUs in a large-scale environment.

Optimization of Pattern Matching Algorithms for Multi- and Many-Core Platforms

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Image and video compression play a major role in the world today, allowing the storage and transmission of large multimedia content volumes. However, the processing of this information requires high computational resources, hence the improvement of the computational performance of these compression algorithms is very important. The Multidimensional Multiscale Parser (MMP) is a pattern-matching-based compression algorithm for multimedia contents, namely images, achieving high compression ratios, maintaining good image quality, Rodrigues et al. [2008]. However, in comparison with other existing algorithms, this algorithm takes some time to execute. Therefore, two parallel implementations for GPUs were proposed by Ribeiro [2016] and Silva [2015] in CUDA and OpenCL-GPU, respectively. In this dissertation, to complement the referred work, we propose two parallel versions that run the MMP algorithm in CPU: one resorting to OpenMP and another that converts the existing OpenCL-GPU into OpenCL-CPU. The proposed solutions are able to improve the computational performance of MMP by 3 and 2:7 , respectively. The High Efficiency Video Coding (HEVC/H.265) is the most recent standard for compression of image and video. Its impressive compression performance, makes it a target for many adaptations, particularly for holoscopic image/video processing (or light field). Some of the proposed modifications to encode this new multimedia content are based on geometry-based disparity compensations (SS), developed by Conti et al. [2014], and a Geometric Transformations (GT) module, proposed by Monteiro et al. [2015]. These compression algorithms for holoscopic images based on HEVC present an implementation of specific search for similar micro-images that is more efficient than the one performed by HEVC, but its implementation is considerably slower than HEVC. In order to enable better execution times, we choose to use the OpenCL API as the GPU enabling language in order to increase the module performance. With its most costly setting, we are able to reduce the GT module execution time from 6.9 days to less then 4 hours, effectively attaining a speedup of 45 .

Aceleração da classificação de documentos com recurso a GPU

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Informática

Heterogeneous multicore systems for signal processing

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included.

Selecting methods to solve multi-well master equations

Relevância:

30.00% 30.00%

Publicador:

Resumo:

There are several competing methods commonly used to solve energy grained master equations describing gas-phase reactive systems. When it comes to selecting an appropriate method for any particular problem, there is little guidance in the literature. In this paper we directly compare several variants of spectral and numerical integration methods from the point of view of computer time required to calculate the solution and the range of temperature and pressure conditions under which the methods are successful. The test case used in the comparison is an important reaction in combustion chemistry and incorporates reversible and irreversible bimolecular reaction steps as well as isomerizations between multiple unimolecular species. While the numerical integration of the ODE with a stiff ODE integrator is not the fastest method overall, it is the fastest method applicable to all conditions.

A Multi-DAG Model for Real-Time Parallel Applications with Conditional Execution

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The 30th ACM/SIGAPP Symposium On Applied Computing (SAC 2015). 13 to 17, Apr, 2015, Embedded Systems. Salamanca, Spain.

Multi-objective optimization scheme for static and dynamic multicast flows

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Muchas de las nuevas aplicaciones emergentes de Internet tales como TV sobre Internet, Radio sobre Internet,Video Streamming multi-punto, entre otras, necesitan los siguientes requerimientos de recursos: ancho de banda consumido, retardo extremo-a-extremo, tasa de paquetes perdidos, etc. Por lo anterior, es necesario formular una propuesta que especifique y provea para este tipo de aplicaciones los recursos necesarios para su buen funcionamiento. En esta tesis, proponemos un esquema de ingeniería de tráfico multi-objetivo a través del uso de diferentes árboles de distribución para muchos flujos multicast. En este caso, estamos usando la aproximación de múltiples caminos para cada nodo egreso y de esta forma obtener la aproximación de múltiples árboles y a través de esta forma crear diferentes árboles multicast. Sin embargo, nuestra propuesta resuelve la fracción de la división del tráfico a través de múltiples árboles. La propuesta puede ser aplicada en redes MPLS estableciendo rutas explícitas en eventos multicast. En primera instancia, el objetivo es combinar los siguientes objetivos ponderados dentro de una métrica agregada: máxima utilización de los enlaces, cantidad de saltos, el ancho de banda total consumido y el retardo total extremo-a-extremo. Nosotros hemos formulado esta función multi-objetivo (modelo MHDB-S) y los resultados obtenidos muestran que varios objetivos ponderados son reducidos y la máxima utilización de los enlaces es minimizada. El problema es NP-duro, por lo tanto, un algoritmo es propuesto para optimizar los diferentes objetivos. El comportamiento que obtuvimos usando este algoritmo es similar al que obtuvimos con el modelo. Normalmente, durante la transmisión multicast los nodos egresos pueden salir o entrar del árbol y por esta razón en esta tesis proponemos un esquema de ingeniería de tráfico multi-objetivo usando diferentes árboles para grupos multicast dinámicos. (en el cual los nodos egresos pueden cambiar durante el tiempo de vida de la conexión). Si un árbol multicast es recomputado desde el principio, esto podría consumir un tiempo considerable de CPU y además todas las comuicaciones que están usando el árbol multicast serán temporalmente interrumpida. Para aliviar estos inconvenientes, proponemos un modelo de optimización (modelo dinámico MHDB-D) que utilice los árboles multicast previamente computados (modelo estático MHDB-S) adicionando nuevos nodos egreso. Usando el método de la suma ponderada para resolver el modelo analítico, no necesariamente es correcto, porque es posible tener un espacio de solución no convexo y por esta razón algunas soluciones pueden no ser encontradas. Adicionalmente, otros tipos de objetivos fueron encontrados en diferentes trabajos de investigación. Por las razones mencionadas anteriormente, un nuevo modelo llamado GMM es propuesto y para dar solución a este problema un nuevo algoritmo usando Algoritmos Evolutivos Multi-Objetivos es propuesto. Este algoritmo esta inspirado por el algoritmo Strength Pareto Evolutionary Algorithm (SPEA). Para dar una solución al caso dinámico con este modelo generalizado, nosotros hemos propuesto un nuevo modelo dinámico y una solución computacional usando Breadth First Search (BFS) probabilístico. Finalmente, para evaluar nuestro esquema de optimización propuesto, ejecutamos diferentes pruebas y simulaciones. Las principales contribuciones de esta tesis son la taxonomía, los modelos de optimización multi-objetivo para los casos estático y dinámico en transmisiones multicast (MHDB-S y MHDB-D), los algoritmos para dar solución computacional a los modelos. Finalmente, los modelos generalizados también para los casos estático y dinámico (GMM y GMM Dinámico) y las propuestas computacionales para dar slución usando MOEA y BFS probabilístico.

A multi-GPU algorithm for large-scale neuronal networks

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Large-scale simulations of parts of the brain using detailed neuronal models to improve our understanding of brain functions are becoming a reality with the usage of supercomputers and large clusters. However, the high acquisition and maintenance cost of these computers, including the physical space, air conditioning, and electrical power, limits the number of simulations of this kind that scientists can perform. Modern commodity graphical cards, based on the CUDA platform, contain graphical processing units (GPUs) composed of hundreds of processors that can simultaneously execute thousands of threads and thus constitute a low-cost solution for many high-performance computing applications. In this work, we present a CUDA algorithm that enables the execution, on multiple GPUs, of simulations of large-scale networks composed of biologically realistic Hodgkin-Huxley neurons. The algorithm represents each neuron as a CUDA thread, which solves the set of coupled differential equations that model each neuron. Communication among neurons located in different GPUs is coordinated by the CPU. We obtained speedups of 40 for the simulation of 200k neurons that received random external input and speedups of 9 for a network with 200k neurons and 20M neuronal connections, in a single computer with two graphic boards with two GPUs each, when compared with a modern quad-core CPU. Copyright (C) 2010 John Wiley & Sons, Ltd.

Use of shared memory in the context of embedded multi-core processors: exploration of the technology and its limits

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.

Heterogeneous Multi-core Architectures for High Performance Computing

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This thesis deals with heterogeneous architectures in standard workstations. Heterogeneous architectures represent an appealing alternative to traditional supercomputers because they are based on commodity components fabricated in large quantities. Hence their price-performance ratio is unparalleled in the world of high performance computing (HPC). In particular, different aspects related to the performance and consumption of heterogeneous architectures have been explored. The thesis initially focuses on an efficient implementation of a parallel application, where the execution time is dominated by an high number of floating point instructions. Then the thesis touches the central problem of efficient management of power peaks in heterogeneous computing systems. Finally it discusses a memory-bounded problem, where the execution time is dominated by the memory latency. Specifically, the following main contributions have been carried out: A novel framework for the design and analysis of solar field for Central Receiver Systems (CRS) has been developed. The implementation based on desktop workstation equipped with multiple Graphics Processing Units (GPUs) is motivated by the need to have an accurate and fast simulation environment for studying mirror imperfection and non-planar geometries. Secondly, a power-aware scheduling algorithm on heterogeneous CPU-GPU architectures, based on an efficient distribution of the computing workload to the resources, has been realized. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The two main contributions of this work follow: the approach reduces the supply cost due to high peak power whilst having negligible impact on the parallelism of computational nodes. from another point of view the developed model allows designer to increase the number of cores without increasing the capacity of the power supply unit. Finally, an implementation for efficient graph exploration on reconfigurable architectures is presented. The purpose is to accelerate graph exploration, reducing the number of random memory accesses.

Towards a fuzzy-based multi-classifier selection module for activity recognition applications

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Performing activity recognition using the information provided by the different sensors embedded in a smartphone face limitations due to the capabilities of those devices when the computations are carried out in the terminal. In this work a fuzzy inference module is implemented in order to decide which classifier is the most appropriate to be used at a specific moment regarding the application requirements and the device context characterized by its battery level, available memory and CPU load. The set of classifiers that is considered is composed of Decision Tables and Trees that have been trained using different number of sensors and features. In addition, some classifiers perform activity recognition regardless of the on-body device position and others rely on the previous recognition of that position to use a classifier that is trained with measurements gathered with the mobile placed on that specific position. The modules implemented show that an evaluation of the classifiers allows sorting them so the fuzzy inference module can choose periodically the one that best suits the device context and application requirements.

A multi-resource load balancing algorithm for cloud cache systems

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the advent of cloud computing model, distributed caches have become the cornerstone for building scalable applications. Popular systems like Facebook [1] or Twitter use Memcached [5], a highly scalable distributed object cache, to speed up applications by avoiding database accesses. Distributed object caches assign objects to cache instances based on a hashing function, and objects are not moved from a cache instance to another unless more instances are added to the cache and objects are redistributed. This may lead to situations where some cache instances are overloaded when some of the objects they store are frequently accessed, while other cache instances are less frequently used. In this paper we propose a multi-resource load balancing algorithm for distributed cache systems. The algorithm aims at balancing both CPU and Memory resources among cache instances by redistributing stored data. Considering the possible conflict of balancing multiple resources at the same time, we give CPU and Memory resources weighted priorities based on the runtime load distributions. A scarcer resource is given a higher weight than a less scarce resource when load balancing. The system imbalance degree is evaluated based on monitoring information, and the utility load of a node, a unit for resource consumption. Besides, since continuous rebalance of the system may affect the QoS of applications utilizing the cache system, our data selection policy ensures that each data migration minimizes the system imbalance degree and hence, the total reconfiguration cost can be minimized. An extensive simulation is conducted to compare our policy with other policies. Our policy shows a significant improvement in time efficiency and decrease in reconfiguration cost.

A parallel method for impulsive image noise removal on hybrid CPU/GPU systems

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A parallel algorithm for image noise removal is proposed. The algorithm is based on peer group concept and uses a fuzzy metric. An optimization study on the use of the CUDA platform to remove impulsive noise using this algorithm is presented. Moreover, an implementation of the algorithm on multi-core platforms using OpenMP is presented. Performance is evaluated in terms of execution time and a comparison of the implementation parallelised in multi-core, GPUs and the combination of both is conducted. A performance analysis with large images is conducted in order to identify the amount of pixels to allocate in the CPU and GPU. The observed time shows that both devices must have work to do, leaving the most to the GPU. Results show that parallel implementations of denoising filters on GPUs and multi-cores are very advisable, and they open the door to use such algorithms for real-time processing.

Image Noise Removal on Heterogeneous CPU-GPU Configurations

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A parallel algorithm to remove impulsive noise in digital images using heterogeneous CPU/GPU computing is proposed. The parallel denoising algorithm is based on the peer group concept and uses an Euclidean metric. In order to identify the amount of pixels to be allocated in multi-core and GPUs, a performance analysis using large images is presented. A comparison of the parallel implementation in multi-core, GPUs and a combination of both is performed. Performance has been evaluated in terms of execution time and Megapixels/second. We present several optimization strategies especially effective for the multi-core environment, and demonstrate significant performance improvements. The main advantage of the proposed noise removal methodology is its computational speed, which enables efficient filtering of color images in real-time applications.

Validation of a multi-component digital dissolution model for irregular particles

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A new mesoscale simulation model for solids dissolution based on an computationally efficient and versatile digital modelling approach (DigiDiss) is considered and validated against analytical solutions and published experimental data for simple geometries. As the digital model is specifically designed to handle irregular shapes and complex multi-component structures, use of the model is explored for single crystals (sugars) and clusters. Single crystals and the cluster were first scanned using X-ray microtomography to obtain a digital version of their structures. The digitised particles and clusters were used as a structural input to digital simulation. The same particles were then dissolved in water and the dissolution process was recorded by a video camera and analysed yielding: the overall dissolution times and images of particle size and shape during the dissolution. The results demonstrate the coherence of simulation method to reproduce experimental behaviour, based on known chemical and diffusion properties of constituent phase. The paper discusses how further sophistications to the modelling approach will need to include other important effects such as complex disintegration effects (particle ejection, uncertainties in chemical properties). The nature of the digital modelling approach is well suited to for future implementation with high speed computation using hybrid conventional (CPU) and graphical processor (GPU) systems.

«
1
2
3
4
5
6
7
8
...
66
67
»