459 resultados para T-parallelism
Resumo:
Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallelmodels in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption. In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.
Resumo:
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2 D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.
Resumo:
Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.
Resumo:
Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Resumo:
This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective utilization of parallelism available on modern architectures. For applications that demand high performance, the traditional options are to use optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. The focus of our system, PolyMage, is on automatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. Our optimization approach primarily relies on the transformation and code generation capabilities of the polyhedral compiler framework. To the best of our knowledge, this is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically. Experimental results on a modern multicore system show that the performance achieved by our automatic approach is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler for image processing pipelines. For a camera raw image processing pipeline, our performance is comparable to that of a hand-tuned implementation.
Resumo:
Abstract: The idea of a “paradise in politics” is an answer to the cosmogonic- anthropogonic problem that, through their bodies, the life of human beings has been shaped politically from the very beginning: all creation is a creation of bodies and bodies are power. All creation, furthermore, means separation, it emerges through a multiplicity of things and beings only. The conventional solution for the problem, in the realm of human beings, consists in forming societies out of a multiplicity of indivuals that remains as such. The solution of a “paradise in politics”, however, envisions a “healing” of creation through a bodily transmutation by which a world of bodies emerges that is freed from the problem of bodies: separation, power. The article discusses the negative cosmology with which all tales on a paradise in politics start. It shows the essential role of phantasy in the constitution of these tales, and elucidates the principal structural elements through which visions of a paradise in politics are built. A special attention is given to the parallelism between these visions and known religious thought, as in the case of the concepts of apokatastasis or perichoresis, for instance. Methodically, the article achieves a demonstration of its subject by an extensive presentation and analysis of two case studies: Rousseau’s vision of a “terrestrial paradise” and the attempt at “bodily redemption” put on the stage in 1968-69 by the “Living Theatre” Group with its performance “Paradise Now”.
Resumo:
The genus Percichthys (Serranidae) includes three nominal species in Argentina, trucha, vinciguerrae and altispinis. The authors of this paper examine materials from: 1: the Río Negro river in its inferior course, in front of Viedma; 2: lake Pellegrini, near Neuquén, where the rivers Neuquén and Limay meet and form the Negro; 3: Plottier, near the place just named; 4: Colorado river, in Fortín Uno; 5: Curacó river, a tributary to the Colorado, now cut into separate sections since years ago on account of the lack of water; this river normally would connect the Colorado with the rivers up to the San Juan where the « trucha » lives; 6: Luro or La Salada lagoon, formed by the Colorado river near its mouth; 7: Argentino lake, in the southern Patagonia. These fishes are known as « trucha criolla » or « native trout » although the old Spanish name was « perca », more appropiate. Percichthys altispinis Regan 1905 is a good species ; it has been re-found in the Colorado river, at Fortín Uno. An illustration of it is given, characters of four specimens and a note on its scales. P. trucha C. V. reveeals itself on close examination as a complex species or linnean species (linneon) ; with several combinations of characters, but even more materials are needed to establish if there are geographical races (subspecies). A new examination of the Chilean materials is required (former authors considered them jointly with the Atlantic versant or Argentine materials). Some of the infraspeciíic forms are prognathous, and low finned ; others, the contrary; the head may be normal, or conical and bony; etc. As to P. vinciguerrae its standing as a valid species is doubtfull; perhaps, with P. laevis Jenyns it is a southern form. In the same reduced habitat (lagoon, or isolated course) diversified forms are present; some show parallelism with those of other places ; it is supposed that they show ecological influences according to the year or season of birth or developpment. A thorough study of the scales is given, with epidological characteristics and general conciusions as to the method of measuring and comparing their « reading». There are some marked differences even in the same habitat.
Resumo:
Coarse Particle sedimentation is studied by using an algorithm with no adjustable parameters based on stokesian dynamics. Only inter-particle interactions of hydrodynamic force and gravity are considered. The sedimentation of a simple cubic array of spheres is used to verify the computational results. The scaling and parallelism with OpenMP of the method are presented. Random suspension sedimentation is investigated with Mont Carlo simulation. The computational results are shown in good agreement with experimental fitting at the lower computational cost of O(N In N).
Resumo:
Technology scaling has enabled drastic growth in the computational and storage capacity of integrated circuits (ICs). This constant growth drives an increasing demand for high-bandwidth communication between and within ICs. In this dissertation we focus on low-power solutions that address this demand. We divide communication links into three subcategories depending on the communication distance. Each category has a different set of challenges and requirements and is affected by CMOS technology scaling in a different manner. We start with short-range chip-to-chip links for board-level communication. Next we will discuss board-to-board links, which demand a longer communication range. Finally on-chip links with communication ranges of a few millimeters are discussed.
Electrical signaling is a natural choice for chip-to-chip communication due to efficient integration and low cost. IO data rates have increased to the point where electrical signaling is now limited by the channel bandwidth. In order to achieve multi-Gb/s data rates, complex designs that equalize the channel are necessary. In addition, a high level of parallelism is central to sustaining bandwidth growth. Decision feedback equalization (DFE) is one of the most commonly employed techniques to overcome the limited bandwidth problem of the electrical channels. A linear and low-power summer is the central block of a DFE. Conventional approaches employ current-mode techniques to implement the summer, which require high power consumption. In order to achieve low-power operation we propose performing the summation in the charge domain. This approach enables a low-power and compact realization of the DFE as well as crosstalk cancellation. A prototype receiver was fabricated in 45nm SOI CMOS to validate the functionality of the proposed technique and was tested over channels with different levels of loss and coupling. Measurement results show that the receiver can equalize channels with maximum 21dB loss while consuming about 7.5mW from a 1.2V supply. We also introduce a compact, low-power transmitter employing passive equalization. The efficacy of the proposed technique is demonstrated through implementation of a prototype in 65nm CMOS. The design achieves up to 20Gb/s data rate while consuming less than 10mW.
An alternative to electrical signaling is to employ optical signaling for chip-to-chip interconnections, which offers low channel loss and cross-talk while providing high communication bandwidth. In this work we demonstrate the possibility of building compact and low-power optical receivers. A novel RC front-end is proposed that combines dynamic offset modulation and double-sampling techniques to eliminate the need for a short time constant at the input of the receiver. Unlike conventional designs, this receiver does not require a high-gain stage that runs at the data rate, making it suitable for low-power implementations. In addition, it allows time-division multiplexing to support very high data rates. A prototype was implemented in 65nm CMOS and achieved up to 24Gb/s with less than 0.4pJ/b power efficiency per channel. As the proposed design mainly employs digital blocks, it benefits greatly from technology scaling in terms of power and area saving.
As the technology scales, the number of transistors on the chip grows. This necessitates a corresponding increase in the bandwidth of the on-chip wires. In this dissertation, we take a close look at wire scaling and investigate its effect on wire performance metrics. We explore a novel on-chip communication link based on a double-sampling architecture and dynamic offset modulation technique that enables low power consumption and high data rates while achieving high bandwidth density in 28nm CMOS technology. The functionality of the link is demonstrated using different length minimum-pitch on-chip wires. Measurement results show that the link achieves up to 20Gb/s of data rate (12.5Gb/s/$\mu$m) with better than 136fJ/b of power efficiency.
Resumo:
A neural network is a highly interconnected set of simple processors. The many connections allow information to travel rapidly through the network, and due to their simplicity, many processors in one network are feasible. Together these properties imply that we can build efficient massively parallel machines using neural networks. The primary problem is how do we specify the interconnections in a neural network. The various approaches developed so far such as outer product, learning algorithm, or energy function suffer from the following deficiencies: long training/ specification times; not guaranteed to work on all inputs; requires full connectivity.
Alternatively we discuss methods of using the topology and constraints of the problems themselves to design the topology and connections of the neural solution. We define several useful circuits-generalizations of the Winner-Take-All circuitthat allows us to incorporate constraints using feedback in a controlled manner. These circuits are proven to be stable, and to only converge on valid states. We use the Hopfield electronic model since this is close to an actual implementation. We also discuss methods for incorporating these circuits into larger systems, neural and nonneural. By exploiting regularities in our definition, we can construct efficient networks. To demonstrate the methods, we look to three problems from communications. We first discuss two applications to problems from circuit switching; finding routes in large multistage switches, and the call rearrangement problem. These show both, how we can use many neurons to build massively parallel machines, and how the Winner-Take-All circuits can simplify our designs.
Next we develop a solution to the contention arbitration problem of high-speed packet switches. We define a useful class of switching networks and then design a neural network to solve the contention arbitration problem for this class. Various aspects of the neural network/switch system are analyzed to measure the queueing performance of this method. Using the basic design, a feasible architecture for a large (1024-input) ATM packet switch is presented. Using the massive parallelism of neural networks, we can consider algorithms that were previously computationally unattainable. These now viable algorithms lead us to new perspectives on switch design.
Resumo:
Negabinary is a component of the positional number system. A complete set of negabinary arithmetic operations are presented, including the basic addition/subtraction logic, the two-step carry-free addition/subtraction algorithm based on negabinary signed-digit (NSD) representation, parallel multiplication, and the fast conversion from NSD to the normal negabinary in the carry-look-ahead mode. All the arithmetic operations can be performed with binary logic. By programming the binary reference bits, addition and subtraction can be realized in parallel with the same binary logic functions. This offers a technique to perform space-variant arithmetic-logic functions with space-invariant instructions. Multiplication can be performed in the tree structure and it is simpler than the modified signed-digit (MSD) counterpart. The parallelism of the algorithms is very suitable for optical implementation. Correspondingly, a general-purpose optical logic system using an electron trapping device is suggested. Various complex logic functions can be performed by programming the illumination of the data arrays without additional temporal latency of the intermediate results. The system can be compact. These properties make the proposed negabinary arithmetic-logic system a strong candidate for future applications in digital optical computing with the development of smart pixel arrays. (C) 1999 Society of Photo-Optical Instrumentation Engineers. [S0091-3286(99)00803-X].
Resumo:
在拼接光栅和拼接光栅压缩器的设计中,子光栅调节偏差不可避免,各维偏差与拼接光栅的时间特性之间的关系很关键。通过脉冲压缩理论分析得到各维偏差和聚焦脉冲时间宽度展宽之间的解析关系,从数值计算结果分析,面平行左右偏差对脉冲的时间宽度影响较大,必须控制在21.08 μrad内;条纹密度差异对脉冲宽度的影响很显著,相对条纹密度的比值应控制在10-5以内;从消除角色散的角度分析,面平行俯仰偏差和条纹平行度偏差可以相互补偿,条纹密度差异和面平行左右偏差也可以相互补偿。
Resumo:
O trabalho foi desenvolvido no litoral norte do estado de São Paulo, onde ocorrem boas exposições de rochas intrusivas da porção meridional do Enxame de Diques da Serra do Mar, de idade eocretácica. O objetivo principal da dissertação é caracterizar os regimes tectônicos associados à colocação e à deformação de diques máficos na área de São Sebastião (SP) e sua distribuição espacial, a partir de interpretações de imagens de sensores remotos, análise de dados estruturais de campo e descrição petrográfica das rochas ígneas. A área apresenta grande complexidade no tocante ao magmatismo, uma vez que ocorrem diques de diabásios toleítico e alcalino, lamprófiro e rochas alcalinas félsicas como fonolitos, traquitos e sienitos, estes sob a forma diques, sills e plugs. Os diabásios toleíticos tem idades em torno 134 Ma, correlatas com o início do rifteamento sul-atlântico, enquanto que as rochas alcalinas datam de 86 Ma e estão relacionadas com um magmatismo intraplaca posterior. Os lineamentos estruturais orientam-se majoritariamente na direção ENE-WSW, paralela às foliações metamórficas e zonas de cisalhamento observadas no campo e descritas na literatura, referentes ao Domínio Costeiro da Faixa Ribeira. Os diques se orientam na direção NE-SW, com azimute semelhante porém ângulos de mergulho discordantes da foliação em grande parte da área, onde as foliações são de baixo ângulo. Um segundo conjunto de lineamentos orientado NW-SE ocorre como um importante conjunto de fraturas que cortam tanto as rochas do embasamento proterozóico quanto as rochas alcalinas neocretácicas. Diques com esta orientação são escassos. Um terceiro conjunto NNE-SSW ocorre na porção oeste da área, associado à presença de diques de diabásio que por vezes mostram indicadores de movimentação sinistral. A análise cinemática dos diques mostra um predomínio de distensão pura durante sua colocação, com um tensor de compressão mínima de orientação NW-SE, ortogonal ao principal trend dos diques. Componentes direcionais, por vezes ambíguas, são comumente observadas, com um discreto predomínio de componente sinistral. O mesmo padrão cinemático é observado para os diques toleíticos e para os alcalinos, sugerindo que o campo de tensões local pouco variou durante o Cretáceo. Embora o embasamento não tenha sido diretamente reativado durante a colocação dos diques, sua anisotropia pode ter controlado de certa forma a orientação do campo de tensões local durante o Cretáceo. Os mapas geofísicos da bacia de Santos existentes na literatura sugerem certo paralelismo entre as estruturas observadas na área de estudo e aquelas interpretadas na bacia. As estruturas NNE-SSW são paralelas ao trend das sub-bacias e ao gráben de Merluza, enquanto que as estruturas NW-SE são paralelas a zonas de transferência descritas na literatura.
Resumo:
O uso de técnicas com o funcional de Tikhonov em processamento de imagens tem sido amplamente usado nos últimos anos. A ideia básica nesse processo é modificar uma imagem inicial via equação de convolução e encontrar um parâmetro que minimize esse funcional afim de obter uma aproximação da imagem original. Porém, um problema típico neste método consiste na seleção do parâmetro de regularização adequado para o compromisso entre a acurácia e a estabilidade da solução. Um método desenvolvido por pesquisadores do IPRJ e UFRJ, atuantes na área de problemas inversos, consiste em minimizar um funcional de resíduos através do parâmetro de regularização de Tikhonov. Uma estratégia que emprega a busca iterativa deste parâmetro visando obter um valor mínimo para o funcional na iteração seguinte foi adotada recentemente em um algoritmo serial de restauração. Porém, o custo computacional é um fator problema encontrado ao empregar o método iterativo de busca. Com esta abordagem, neste trabalho é feita uma implementação em linguagem C++ que emprega técnicas de computação paralela usando MPI (Message Passing Interface) para a estratégia de minimização do funcional com o método de busca iterativa, reduzindo assim, o tempo de execução requerido pelo algoritmo. Uma versão modificada do método de Jacobi é considerada em duas versões do algoritmo, uma serial e outra em paralelo. Este algoritmo é adequado para implementação paralela por não possuir dependências de dados como de Gauss-Seidel que também é mostrado a convergir. Como indicador de desempenho para avaliação do algoritmo de restauração, além das medidas tradicionais, uma nova métrica que se baseia em critérios subjetivos denominada IWMSE (Information Weighted Mean Square Error) é empregada. Essas métricas foram introduzidas no programa serial de processamento de imagens e permitem fazer a análise da restauração a cada passo de iteração. Os resultados obtidos através das duas versões possibilitou verificar a aceleração e a eficiência da implementação paralela. A método de paralelismo apresentou resultados satisfatórios em um menor tempo de processamento e com desempenho aceitável.
Resumo:
A maioria das bacias paleozóicas brasileiras apresenta matéria orgânica termicamente pouco evoluída nos intervalos correspondentes ao Devoniano. O modelo mais adequado para se entender a geração, migração e acumulação de HC estaria relacionado às fases de intrusão de diabásio. No caso da Bacia do Amazonas, embora tenha havido condições de soterramento suficientes para a geração de hidrocarbonetos, não se deve descartar o modelo não convencional de geração como uma das formas possíveis de dar origem as acumulações comerciais de óleo e gás. Acredita-se que o intervalo mais apropriado para a geração de hidrocarbonetos (HC) inclua apenas as rochas depositadas no intervalo Frasniano, embora as rochas associadas ao intervalo Llandoveriano, também, devam ser observadas com atenção. Com o intuito de compreender melhor o papel da atividade magmática na evolução da Bacia do Amazonas, foi realizado o mapeamento sísmico de soleiras de diabásio e análise de dados geoquímicos de pirólise Rock-Eval e COT. Assim, foi possível avaliar a geração/migração de hidrocarbonetos e a variação dos parâmetros geotérmicos na Bacia do Amazonas, causados pela intrusão das soleiras de diabásio. A análise sismoestratigráfica baseou-se na interpretação de 20 linhas sísmicas 2D pós-stack, na qual foram reconhecidos e mapeados horizontes sísmicos (topos de formações e corpos ígneos intrusivos), utilizando dados de poços e dados da literatura para correlação. As intrusões de soleiras estão presentes nas sucessões de folhelhos/siltitos e anidritas das formações Andirá e Nova Olinda, respectivamente. Observou-se que as soleiras de diabásio podem estar intimamente relacionadas a diques sistematicamente orientados, tendo estes diques a função de alimentadores das soleiras. Extensas soleiras planares com segmentos transgressivos ocorrem nos níveis estratigráficos mais rasos da Bacia do Amazonas, e em maiores volumes nas formações Andirá e Nova Olinda. Em algumas regiões as soleiras desenvolvem morfologias marcantes em forma de pires. Esses corpos possuem espessuras que podem chegar a 500m. Comumente, a geometria em lençol denotada pelo paralelismo dos refletores está presente em toda extensão do mapeamento da bacia. Também foram observadas estruturas em domo. O efeito térmico imposto pelas intrusões dos corpos ígneos, diques e soleiras foi de grande importância, pois sem ele não haveria calor para a transformação da matéria orgânica. Através da análise de pirólise Rock-Eval e teor de carbono orgânico, foi possível avaliar e correlacionar os parâmetros como S2 (potencial de geração), IH (índice de hidrogênio), S1 (hidrocarbonetos livres) e Tmax (evolução térmica) com a profundidade. Foram utilizados dados de 04 poços na qual dois deles foram compilados a partir de artigos e teses publicados. As rochas potencialmente geradoras de petróleo são aquelas que apresentam COT igual ou superior a 1%. Dos quatro poços analisados, dois deles apresentam COT > 1% para a Formação Barreirinhas, mostrando que as rochas sedimentares são potencialmente geradoras de HC. Altos valores Tmax podem ser justificados pelo efeito térmico causado por intrusões de diabásio. Os resultados de índice de hidrogênio (IH) apresentaram valores abaixo de 200mgHC/g COT, indicando o potencial gerador desta bacia para gás.