947 resultados para parallel processing


Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this publication, we report on an online survey that was carried out among parallel programmers. More than 250 people worldwide have submitted answers to our questions, and their responses are analyzed here. Although not statistically sound, the data we provide give useful insights about which parallel programming systems and languages are known and in actual use. For instance, the collected data indicate that for our survey group MPI and (to a lesser extent) C are the most widely used parallel programming system and language, respectively.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The process of developing software that takes advantage of multiple processors is commonly referred to as parallel programming. For various reasons, this process is much harder than the sequential case. For decades, parallel programming has been a problem for a small niche only: engineers working on parallelizing mostly numerical applications in High Performance Computing. This has changed with the advent of multi-core processors in mainstream computer architectures. Parallel programming in our days becomes a problem for a much larger group of developers. The main objective of this thesis was to find ways to make parallel programming easier for them. Different aims were identified in order to reach the objective: research the state of the art of parallel programming today, improve the education of software developers about the topic, and provide programmers with powerful abstractions to make their work easier. To reach these aims, several key steps were taken. To start with, a survey was conducted among parallel programmers to find out about the state of the art. More than 250 people participated, yielding results about the parallel programming systems and languages in use, as well as about common problems with these systems. Furthermore, a study was conducted in university classes on parallel programming. It resulted in a list of frequently made mistakes that were analyzed and used to create a programmers' checklist to avoid them in the future. For programmers' education, an online resource was setup to collect experiences and knowledge in the field of parallel programming - called the Parawiki. Another key step in this direction was the creation of the Thinking Parallel weblog, where more than 50.000 readers to date have read essays on the topic. For the third aim (powerful abstractions), it was decided to concentrate on one parallel programming system: OpenMP. Its ease of use and high level of abstraction were the most important reasons for this decision. Two different research directions were pursued. The first one resulted in a parallel library called AthenaMP. It contains so-called generic components, derived from design patterns for parallel programming. These include functionality to enhance the locks provided by OpenMP, to perform operations on large amounts of data (data-parallel programming), and to enable the implementation of irregular algorithms using task pools. AthenaMP itself serves a triple role: the components are well-documented and can be used directly in programs, it enables developers to study the source code and learn from it, and it is possible for compiler writers to use it as a testing ground for their OpenMP compilers. The second research direction was targeted at changing the OpenMP specification to make the system more powerful. The main contributions here were a proposal to enable thread-cancellation and a proposal to avoid busy waiting. Both were implemented in a research compiler, shown to be useful in example applications, and proposed to the OpenMP Language Committee.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper contributes to the study of Freely Rewriting Restarting Automata (FRR-automata) and Parallel Communicating Grammar Systems (PCGS), which both are useful models in computational linguistics. For PCGSs we study two complexity measures called 'generation complexity' and 'distribution complexity', and we prove that a PCGS Pi, for which the generation complexity and the distribution complexity are both bounded by constants, can be transformed into a freely rewriting restarting automaton of a very restricted form. From this characterization it follows that the language L(Pi) generated by Pi is semi-linear, that its characteristic analysis is of polynomial size, and that this analysis can be computed in polynomial time.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In der vorliegenden Dissertation werden Systeme von parallel arbeitenden und miteinander kommunizierenden Restart-Automaten (engl.: systems of parallel communicating restarting automata; abgekürzt PCRA-Systeme) vorgestellt und untersucht. Dabei werden zwei bekannte Konzepte aus den Bereichen Formale Sprachen und Automatentheorie miteinander vescrknüpft: das Modell der Restart-Automaten und die sogenannten PC-Systeme (systems of parallel communicating components). Ein PCRA-System besteht aus endlich vielen Restart-Automaten, welche einerseits parallel und unabhängig voneinander lokale Berechnungen durchführen und andererseits miteinander kommunizieren dürfen. Die Kommunikation erfolgt dabei durch ein festgelegtes Kommunikationsprotokoll, das mithilfe von speziellen Kommunikationszuständen realisiert wird. Ein wesentliches Merkmal hinsichtlich der Kommunikationsstruktur in Systemen von miteinander kooperierenden Komponenten ist, ob die Kommunikation zentralisiert oder nichtzentralisiert erfolgt. Während in einer nichtzentralisierten Kommunikationsstruktur jede Komponente mit jeder anderen Komponente kommunizieren darf, findet jegliche Kommunikation innerhalb einer zentralisierten Kommunikationsstruktur ausschließlich mit einer ausgewählten Master-Komponente statt. Eines der wichtigsten Resultate dieser Arbeit zeigt, dass zentralisierte Systeme und nichtzentralisierte Systeme die gleiche Berechnungsstärke besitzen (das ist im Allgemeinen bei PC-Systemen nicht so). Darüber hinaus bewirkt auch die Verwendung von Multicast- oder Broadcast-Kommunikationsansätzen neben Punkt-zu-Punkt-Kommunikationen keine Erhöhung der Berechnungsstärke. Desweiteren wird die Ausdrucksstärke von PCRA-Systemen untersucht und mit der von PC-Systemen von endlichen Automaten und mit der von Mehrkopfautomaten verglichen. PC-Systeme von endlichen Automaten besitzen bekanntermaßen die gleiche Ausdrucksstärke wie Einwegmehrkopfautomaten und bilden eine untere Schranke für die Ausdrucksstärke von PCRA-Systemen mit Einwegkomponenten. Tatsächlich sind PCRA-Systeme auch dann stärker als PC-Systeme von endlichen Automaten, wenn die Komponenten für sich genommen die gleiche Ausdrucksstärke besitzen, also die regulären Sprachen charakterisieren. Für PCRA-Systeme mit Zweiwegekomponenten werden als untere Schranke die Sprachklassen der Zweiwegemehrkopfautomaten im deterministischen und im nichtdeterministischen Fall gezeigt, welche wiederum den bekannten Komplexitätsklassen L (deterministisch logarithmischer Platz) und NL (nichtdeterministisch logarithmischer Platz) entsprechen. Als obere Schranke wird die Klasse der kontextsensitiven Sprachen gezeigt. Außerdem werden Erweiterungen von Restart-Automaten betrachtet (nonforgetting-Eigenschaft, shrinking-Eigenschaft), welche bei einzelnen Komponenten eine Erhöhung der Berechnungsstärke bewirken, in Systemen jedoch deren Stärke nicht erhöhen. Die von PCRA-Systemen charakterisierten Sprachklassen sind unter diversen Sprachoperationen abgeschlossen und einige Sprachklassen sind sogar abstrakte Sprachfamilien (sogenannte AFL's). Abschließend werden für PCRA-Systeme spezifische Probleme auf ihre Entscheidbarkeit hin untersucht. Es wird gezeigt, dass Leerheit, Universalität, Inklusion, Gleichheit und Endlichkeit bereits für Systeme mit zwei Restart-Automaten des schwächsten Typs nicht semientscheidbar sind. Für das Wortproblem wird gezeigt, dass es im deterministischen Fall in quadratischer Zeit und im nichtdeterministischen Fall in exponentieller Zeit entscheidbar ist.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes a parallel architecture for estimation of the motion of an underwater robot. It is well known that image processing requires a huge amount of computation, mainly at low-level processing where the algorithms are dealing with a great number of data. In a motion estimation algorithm, correspondences between two images have to be solved at the low level. In the underwater imaging, normalised correlation can be a solution in the presence of non-uniform illumination. Due to its regular processing scheme, parallel implementation of the correspondence problem can be an adequate approach to reduce the computation time. Taking into consideration the complexity of the normalised correlation criteria, a new approach using parallel organisation of every processor from the architecture is proposed

Relevância:

30.00% 30.00%

Publicador:

Resumo:

It has been previously demonstrated that extensive activation in the dorsolateral temporal lobes associated with masking a speech target with a speech masker, consistent with the hypothesis that competition for central auditory processes is an important factor in informational masking. Here, masking from speech and two additional maskers derived from the original speech were investigated. One of these is spectrally rotated speech, which is unintelligible and has a similar (inverted) spectrotemporal profile to speech. The authors also controlled for the possibility of “glimpsing” of the target signal during modulated masking sounds by using speech-modulated noise as a masker in a baseline condition. Functional imaging results reveal that masking speech with speech leads to bilateral superior temporal gyrus (STG) activation relative to a speech-in-noise baseline, while masking speech with spectrally rotated speech leads solely to right STG activation relative to the baseline. This result is discussed in terms of hemispheric asymmetries for speech perception, and interpreted as showing that masking effects can arise through two parallel neural systems, in the left and right temporal lobes. This has implications for the competition for resources caused by speech and rotated speech maskers, and may illuminate some of the mechanisms involved in informational masking.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

It has been previously demonstrated that extensive activation in the dorsolateral temporal lobes associated with masking a speech target with a speech masker, consistent with the hypothesis that competition for central auditory processes is an important factor in informational masking. Here, masking from speech and two additional maskers derived from the original speech were investigated. One of these is spectrally rotated speech, which is unintelligible and has a similar (inverted) spectrotemporal profile to speech. The authors also controlled for the possibility of "glimpsing" of the target signal during modulated masking sounds by using speech-modulated noise as a masker in a baseline condition. Functional imaging results reveal that masking speech with speech leads to bilateral superior temporal gyrus (STG) activation relative to a speech-in-noise baseline, while masking speech with spectrally rotated speech leads solely to right STG activation relative to the baseline. This result is discussed in terms of hemispheric asymmetries for speech perception, and interpreted as showing that masking effects can arise through two parallel neural systems, in the left and right temporal lobes. This has implications for the competition for resources caused by speech and rotated speech maskers, and may illuminate some of the mechanisms involved in informational masking.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a paralleled Two-Pass Hexagonal (TPA) algorithm constituted by Linear Hashtable Motion Estimation Algorithm (LHMEA) and Hexagonal Search (HEXBS) for motion estimation. In the TPA., Motion Vectors (MV) are generated from the first-pass LHMEA and are used as predictors for second-pass HEXBS motion estimation, which only searches a small number of Macroblocks (MBs). We introduced hashtable into video processing and completed parallel implementation. We propose and evaluate parallel implementations of the LHMEA of TPA on clusters of workstations for real time video compression. It discusses how parallel video coding on load balanced multiprocessor systems can help, especially on motion estimation. The effect of load balancing for improved performance is discussed. The performance or the algorithm is evaluated by using standard video sequences and the results are compared to current algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a paralleled Two-Pass Hexagonal (TPA) algorithm constituted by Linear Hashtable Motion Estimation Algorithm (LHMEA) and Hexagonal Search (HEXBS) for motion estimation. In the TPA, Motion Vectors (MV) are generated from the first-pass LHMEA and are used as predictors for second-pass HEXBS motion estimation, which only searches a small number of Macroblocks (MBs). We introduced hashtable into video processing and completed parallel implementation. We propose and evaluate parallel implementations of the LHMEA of TPA on clusters of workstations for real time video compression. It discusses how parallel video coding on load balanced multiprocessor systems can help, especially on motion estimation. The effect of load balancing for improved performance is discussed. The performance of the algorithm is evaluated by using standard video sequences and the results are compared to current algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents an improved parallel Two-Pass Hexagonal (TPA) algorithm constituted by Linear Hashtable Motion Estimation Algorithm (LHMEA) and Hexagonal Search (HEXBS) for motion estimation. Motion Vectors (MV) are generated from the first-pass LHMEA and used as predictors for second-pass HEXBS motion estimation, which only searches a small number of Macroblocks (MBs). We used bashtable into video processing and completed parallel implementation. The hashtable structure of LHMEA is improved compared to the original TPA and LHMEA. We propose and evaluate parallel implementations of the LHMEA of TPA on clusters of workstations for real time video compression. The implementation contains spatial and temporal approaches. The performance of the algorithm is evaluated by using standard video sequences and the results are compared to current algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The real-time parallel computation of histograms using an array of pipelined cells is proposed and prototyped in this paper with application to consumer imaging products. The array operates in two modes: histogram computation and histogram reading. The proposed parallel computation method does not use any memory blocks. The resulting histogram bins can be stored into an external memory block in a pipelined fashion for subsequent reading or streaming of the results. The array of cells can be tuned to accommodate the required data path width in a VLSI image processing engine as present in many imaging consumer devices. Synthesis of the architectures presented in this paper in FPGA are shown to compute the real-time histogram of images streamed at over 36 megapixels at 30 frames/s by processing in parallel 1, 2 or 4 pixels per clock cycle.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background Selective serotonin reuptake inhibitors (SSRIs) are popular medications for anxiety and depression, but their effectiveness, particularly in patients with prominent symptoms of loss of motivation and pleasure, has been questioned. There are few studies of the effect of SSRIs on neural reward mechanisms in humans. Methods We studied 45 healthy participants who were randomly allocated to receive the SSRI citalopram, the noradrenaline reuptake inhibitor reboxetine, or placebo for 7 days in a double-blind, parallel group design. We used functional magnetic resonance imaging to measure the neural response to rewarding (sight and/or flavor of chocolate) and aversive stimuli (sight of moldy strawberries and/or an unpleasant strawberry taste) on the final day of drug treatment. Results Citalopram reduced activation to the chocolate stimuli in the ventral striatum and the ventral medial/orbitofrontal cortex. In contrast, reboxetine did not suppress ventral striatal activity and in fact increased neural responses within medial orbitofrontal cortex to reward. Citalopram also decreased neural responses to the aversive stimuli conditions in key “punishment” areas such as the lateral orbitofrontal cortex. Reboxetine produced a similar, although weaker effect. Conclusions Our findings are the first to show that treatment with SSRIs can diminish the neural processing of both rewarding and aversive stimuli. The ability of SSRIs to decrease neural responses to reward might underlie the questioned efficacy of SSRIs in depressive conditions characterized by decreased motivation and anhedonia and could also account for the experience of emotional blunting described by some patients during SSRI treatment.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.