957 resultados para High-performance computing


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes parallelism. However, existing automatic tiling frameworks often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this issue with a new tiling technique that ensures concurrent start-up as well as perfect load-balance whenever possible. We first provide necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses. We then provide an approach to find such hyperplanes. Experimental evaluation on a 12-core Intel Westmere shows that our code is able to outperform a tuned domain-specific stencil code generator by 4% to 27%, and previous compiler techniques by a factor of 2x to 10.14x.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Past studies use deterministic models to evaluate optimal cache configuration or to explore its design space. However, with the increasing number of components present on a chip multiprocessor (CMP), deterministic approaches do not scale well. Hence, we apply probabilistic genetic algorithms (GA) to determine a near-optimal cache configuration for a sixteen tiled CMP. We propose and implement a faster trace based approach to estimate fitness of a chromosome. It shows up-to 218x simulation speedup over the cycle-accurate architectural simulation. Our methodology can be applied to solve other cache optimization problems such as design space exploration of cache and its partitioning among applications/ virtual machines.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We have developed a real-time imaging method for two-color wide-field fluorescence microscopy using a combined approach that integrates multi-spectral imaging and Bayesian image reconstruction technique. To enable simultaneous observation of two dyes (primary and secondary), we exploit their spectral properties that allow parallel recording in both the channels. The key advantage of this technique is the use of a single wavelength of light to excite both the primary dye and the secondary dye. The primary and secondary dyes respectively give rise to fluorescence and bleed-through signal, which after normalization were merged to obtain two-color 3D images. To realize real-time imaging, we employed maximum likelihood (ML) and maximum a posteriori (MAP) techniques on a high-performance computing platform (GPU). The results show two-fold improvement in contrast while the signal-to-background ratio (SBR) is improved by a factor of 4. We report a speed boost of 52 and 350 for 2D and 3D images respectively. Using this system, we have studied the real-time protein aggregation in yeast cells and HeLa cells that exhibits dot-like protein distribution. The proposed technique has the ability to temporally resolve rapidly occurring biological events.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A block-structured adaptive mesh refinement (AMR) technique has been used to obtain numerical solutions for many scientific applications. Some block-structured AMR approaches have focused on forming patches of non-uniform sizes where the size of a patch can be tuned to the geometry of a region of interest. In this paper, we develop strategies for adaptive execution of block-structured AMR applications on GPUs, for hyperbolic directionally split solvers. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include bin-packing work units to load balance GPU computations, adaptive asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. Our experiments with synthetic and real data, for single-GPU and multi-GPU executions, on Tesla S1070 and Fermi C2070 clusters, show that our strategies result in up to a 3.23 speedup in performance over existing strategies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi- core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The polyhedral model provides an expressive intermediate representation that is convenient for the analysis and subsequent transformation of affine loop nests. Several heuristics exist for achieving complex program transformations in this model. However, there is also considerable scope to utilize this model to tackle the problem of automatic memory footprint optimization. In this paper, we present a new automatic storage optimization technique which can be used to achieve both intra-array as well as inter-array storage reuse with a pre-determined schedule for the computation. Our approach works by finding statement-wise storage partitioning hyper planes that partition a unified global array space so that values with overlapping live ranges are not mapped to the same partition. Our heuristic is driven by a fourfold objective function which not only minimizes the dimensionality and storage requirements of arrays required for each high-level statement, but also maximizes inter statement storage reuse. The storage mappings obtained using our heuristic can be asymptotically better than those obtained by any existing technique. We implement our technique and demonstrate its practical impact by evaluating its effectiveness on several benchmarks chosen from the domains of image processing, stencil computations, and high-performance computing.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A Física das Radiações é um ramo da Física que está presente em diversas áreas de estudo e se relaciona ao conceito de espectrometria. Dentre as inúmeras técnicas espectrométricas existentes, destaca-se a espectrometria por fluorescência de raios X. Esta também possui uma gama de variações da qual pode-se dar ênfase a um determinado subconjunto de técnicas. A produção de fluorescência de raios X permite (em certos casos) a análise das propriedades físico-químicas de uma amostra específica, possibilitando a determinação de sua constituiçõa química e abrindo um leque de aplicações. Porém, o estudo experimental pode exigir uma grande carga de trabalho, tanto em termos do aparato físico quanto em relação conhecimento técnico. Assim, a técnica de simulação entra em cena como um caminho viável, entre a teoria e a experimentação. Através do método de Monte Carlo, que se utiliza da manipulação de números aleatórios, a simulação se mostra como uma espécie de alternativa ao trabalho experimental.Ela desenvolve este papel por meio de um processo de modelagem, dentro de um ambiente seguro e livre de riscos. E ainda pode contar com a computação de alto desempenho, de forma a otimizar todo o trabalho por meio da arquitetura distribuída. O objetivo central deste trabalho é a elaboração de um simulador computacional para análise e estudo de sistemas de fluorescência de raios X desenvolvido numa plataforma de computação distribuída de forma nativa com o intuito de gerar dados otimizados. Como resultados deste trabalho, mostra-se a viabilidade da construção do simulador através da linguagem CHARM++, uma linguagem baseada em C++ que incorpora rotinas para processamento distribuído, o valor da metodologia para a modelagem de sistemas e a aplicação desta na construção de um simulador para espectrometria por fluorescência de raios X. O simulador foi construído com a capacidade de reproduzir uma fonte de radiação eletromagnética, amostras complexas e um conjunto de detectores. A modelagem dos detectores incorpora a capacidade de geração de imagens baseadas nas contagens registradas. Para validação do simulador, comparou-se os resultados espectrométricos com os resultados gerados por outro simulador já validado: o MCNP.