986 resultados para LDPC, CUDA, GPGPU, computing, GPU, DVB, S2, SDR


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The contour tree is a topological abstraction of a scalar field that captures evolution in level set connectivity. It is an effective representation for visual exploration and analysis of scientific data. We describe a work-efficient, output sensitive, and scalable parallel algorithm for computing the contour tree of a scalar field defined on a domain that is represented using either an unstructured mesh or a structured grid. A hybrid implementation of the algorithm using the GPU and multi-core CPU can compute the contour tree of an input containing 16 million vertices in less than ten seconds with a speedup factor of upto 13. Experiments based on an implementation in a multi-core CPU environment show near-linear speedup for large data sets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a GPU implementation of normalized cuts for road extraction problem using panchromatic satellite imagery. The roads have been extracted in three stages namely pre-processing, image segmentation and post-processing. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, vegetation,. and fallow regions). The road regions are then extracted using the normalized cuts algorithm. Normalized cuts algorithm is a graph-based partitioning `approach whose focus lies in extracting the global impression (perceptual grouping) of an image rather than local features. For the segmented image, post-processing is carried out using morphological operations - erosion and dilation. Finally, the road extracted image is overlaid on the original image. Here, a GPGPU (General Purpose Graphical Processing Unit) approach has been adopted to implement the same algorithm on the GPU for fast processing. A performance comparison of this proposed GPU implementation of normalized cuts algorithm with the earlier algorithm (CPU implementation) is presented. From the results, we conclude that the computational improvement in terms of time as the size of image increases for the proposed GPU implementation of normalized cuts. Also, a qualitative and quantitative assessment of the segmentation results has been projected.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A block-structured adaptive mesh refinement (AMR) technique has been used to obtain numerical solutions for many scientific applications. Some block-structured AMR approaches have focused on forming patches of non-uniform sizes where the size of a patch can be tuned to the geometry of a region of interest. In this paper, we develop strategies for adaptive execution of block-structured AMR applications on GPUs, for hyperbolic directionally split solvers. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include bin-packing work units to load balance GPU computations, adaptive asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. Our experiments with synthetic and real data, for single-GPU and multi-GPU executions, on Tesla S1070 and Fermi C2070 clusters, show that our strategies result in up to a 3.23 speedup in performance over existing strategies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Rapid reconstruction of multidimensional image is crucial for enabling real-time 3D fluorescence imaging. This becomes a key factor for imaging rapidly occurring events in the cellular environment. To facilitate real-time imaging, we have developed a graphics processing unit (GPU) based real-time maximum a-posteriori (MAP) image reconstruction system. The parallel processing capability of GPU device that consists of a large number of tiny processing cores and the adaptability of image reconstruction algorithm to parallel processing (that employ multiple independent computing modules called threads) results in high temporal resolution. Moreover, the proposed quadratic potential based MAP algorithm effectively deconvolves the images as well as suppresses the noise. The multi-node multi-threaded GPU and the Compute Unified Device Architecture (CUDA) efficiently execute the iterative image reconstruction algorithm that is similar to 200-fold faster (for large dataset) when compared to existing CPU based systems. (C) 2015 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported License.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present a method of rapidly producing computer-generated holograms that exhibit geometric occlusion in the reconstructed image. Conceptually, a bundle of rays is shot from every hologram sample into the object volume.We use z buffering to find the nearest intersecting object point for every ray and add its complex field contribution to the corresponding hologram sample. Each hologram sample belongs to an independent operation, allowing us to exploit the parallel computing capability of modern programmable graphics processing units (GPUs). Unlike algorithms that use points or planar segments as the basis for constructing the hologram, our algorithm's complexity is dependent on fixed system parameters, such as the number of ray-casting operations, and can therefore handle complicated models more efficiently. The finite number of hologram pixels is, in effect, a windowing function, and from analyzing the Wigner distribution function of windowed free-space transfer function we find an upper limit on the cone angle of the ray bundle. Experimentally, we found that an angular sampling distance of 0:01' for a 2:66' cone angle produces acceptable reconstruction quality. © 2009 Optical Society of America.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Os métodos numéricos convencionais, baseados em malhas, têm sido amplamente aplicados na resolução de problemas da Dinâmica dos Fluidos Computacional. Entretanto, em problemas de escoamento de fluidos que envolvem superfícies livres, grandes explosões, grandes deformações, descontinuidades, ondas de choque etc., estes métodos podem apresentar algumas dificuldades práticas quando da resolução destes problemas. Como uma alternativa viável, existem os métodos de partículas livre de malhas. Neste trabalho é feita uma introdução ao método Lagrangeano de partículas, livre de malhas, Smoothed Particle Hydrodynamics (SPH) voltado para a simulação numérica de escoamentos de fluidos newtonianos compressíveis e quase-incompressíveis. Dois códigos numéricos foram desenvolvidos, uma versão serial e outra em paralelo, empregando a linguagem de programação C/C++ e a Compute Unified Device Architecture (CUDA), que possibilita o processamento em paralelo empregando os núcleos das Graphics Processing Units (GPUs) das placas de vídeo da NVIDIA Corporation. Os resultados numéricos foram validados e a eficiência computacional avaliada considerandose a resolução dos problemas unidimensionais Shock Tube e Blast Wave e bidimensional da Cavidade (Shear Driven Cavity Problem).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Este trabalho apresenta a proposta de um middleware, chamado DistributedCL, que torna transparente o processamento paralelo em GPUs distribuídas. Com o suporte do middleware DistributedCL uma aplicação, preparada para utilizar a API OpenCL, pode executar de forma distribuída, utilizando GPUs remotas, de forma transparente e sem necessidade de alteração ou nova compilação do seu código. A arquitetura proposta para o middleware DistributedCL é modular, com camadas bem definidas e um protótipo foi construído de acordo com a arquitetura, onde foram empregados vários pontos de otimização, incluindo o envio de dados em lotes, comunicação assíncrona via rede e chamada assíncrona da API OpenCL. O protótipo do middleware DistributedCL foi avaliado com o uso de benchmarks disponíveis e também foi desenvolvido o benchmark CLBench, para avaliação de acordo com a quantidade dos dados. O desempenho do protótipo se mostrou bom, superior às propostas semelhantes, tendo alguns resultados próximos do ideal, sendo o tamanho dos dados para transmissão através da rede o maior fator limitante.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

现代GPU的图形绘制管线一般基于扫描线转换算法。模型中的三角面片经过光栅化后投影到屏幕上,在投影区域的各像素位置分别生成一个对应的基本处理单位,称之为片元。在光栅化过程中,当场景中的物体互相遮挡时,相应像素位置会产生多个片元与之对应。一般图形应用只需处理视点所能直接看到的表面,即每个像素只需要保留离视点最近或最远的片元信息,但是有一些绘制效果需要同时处理同一像素位置对应的多个片元,这些特殊的效果通常称为多片元效果,包括顺序独立的透明现象、半透明现象、体绘制以及折射等。然而,现有GPU只针对不透明物体的绘制进行了硬件层面的优化,使得光栅化后每个象素位置只保留最近或最远的片元,其余都在生成最终图像时被抛弃,所以目前多片元效果的绘制需要重复光栅化整个场景多遍,才可以收集到每个像素对应的所有片元。当场景规模较小时算法性能较高,但对于大型复杂场景来说,模型的多次顶点变换将成为绘制瓶颈,导致算法效率下降,因而难以在交互式应用中广泛使用。 针对这个问题,本文提出一种基于桶排序的高效深度剥离算法,将GPU上的多绘制目标缓存作为桶数组,采用桶排序原理以及最大/最小融合模式来收集投影到同一个像素上的多个片元并按深度排序,最后在后处理中再对场景进行延迟着色以绘制多片元效果。当发生桶内片元冲突时可以采用多遍绘制或者自适应划分的方式来降低片元冲突概率,以进一步提高绘制的准确性。针对透明现象的绘制,提出一种基于桶内动态融合的改进算法,采用并发读写的方法逐一融合落入同一个桶内的所有片元,并在后处理中按从前向后的顺序融合各个桶内的颜色值。由于同时发生桶内片元冲突和读写冲突的概率非常小,因而可以进一步提高绘制结果的准确性。实验结果表明,基于桶排序的深度剥离算法可以高效地处理大型场景多片元效果的绘制,同时生成与真实结果非常相近的绘制效果。 针对传统图形管线的不足之处,本文进一步设计并实现了CUDA渲染器:第一个可以在当前图形硬件上运行的全线可编程的图形管线系统,并基于该框架设计了两种新的透明现象的高效单遍绘制策略:第一种策略称之为多级深度测试策略,该策略利用了CUDA 的原子操作符atomicMin,可以在单遍绘制中动态收集所有片元并排序;第二种策略称之为固定数组缓存策略,该策略利用CUDA 的原子操作符atomicInc,可以在单遍绘制中按光栅化顺序收集所有的片元并在后处理中排序。实验结果表明,基于CUDA渲染器的这两种片元收集策略可以在单遍场景遍历中高效地绘制多片元效果,同时生成与真实结果非常相近的绘制效果。 未来的工作方向在于进一步完善基于桶排序的深度剥离算法,设计更加完善的深度区间划分方式,使得桶数组可以与片元一一对应,以完全消除桶内片元冲突。此外,可以进一步完善CUDA渲染器,使其可以更高效地处理遮挡剔除以及反走样等其它经典图形问题。

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components.We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplementalmaterials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The use of accelerators, with compute architectures different and distinct from the CPU, has become a new research frontier in high-performance computing over the past ?ve years. This paper is a case study on how the instruction-level parallelism offered by three accelerator technologies, FPGA, GPU and ClearSpeed, can be exploited in atomic physics. The algorithm studied is the evaluation of two electron integrals, using direct numerical quadrature, a task that arises in the study of intermediate energy electron scattering by hydrogen atoms. The results of our ‘productivity’ study show that while each accelerator is viable, there are considerable differences in the implementation strategies that must be followed on each.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Wireless sensor node platforms are very diversified and very constrained, particularly in power consumption. When choosing or sizing a platform for a given application, it is necessary to be able to evaluate in an early design stage the impact of those choices. Applied to the computing platform implemented on the sensor node, it requires a good understanding of the workload it must perform. Nevertheless, this workload is highly application-dependent. It depends on the data sampling frequency together with application-specific data processing and management. It is thus necessary to have a model that can represent the workload of applications with various needs and characteristics. In this paper, we propose a workload model for wireless sensor node computing platforms. This model is based on a synthetic application that models the different computational tasks that the computing platform will perform to process sensor data. It allows to model the workload of various different applications by tuning data sampling rate and processing. A case study is performed by modeling different applications and by showing how it can be used for workload characterization. © 2011 IEEE.