813 resultados para Parallel operations
Resumo:
This thesis defines Pi, a parallel architecture interface that separates model and machine issues, allowing them to be addressed independently. This provides greater flexibility for both the model and machine builder. Pi addresses a set of common parallel model requirements including low latency communication, fast task switching, low cost synchronization, efficient storage management, the ability to exploit locality, and efficient support for sequential code. Since Pi provides generic parallel operations, it can efficiently support many parallel programming models including hybrids of existing models. Pi also forms a basis of comparison for architectural components.
Resumo:
In this work it is proposed an optimized dynamic response of parallel operation of two single-phase inverters with no control communication. The optimization aims the tuning of the slopes of P-ω and Q-V curves so that the system is stable, damped and minimum settling time. The slopes are tuned using an algorithm based on evolutionary theory. Simulation and experimental results are presented to prove the feasibility of the proposed approach. © 2010 IEEE.
Resumo:
This work focuses on the design of high-efficient DC-DC converters based on WBG power devices. The first objective is the development of an isolated bidirectional converter for the distribution network of future electrical aircrafts. A SiC-based Dual Active Bridge converter is designed and fabricated. Control strategies for individual and parallel operations are investigated and implemented into a FPGA platform. Experimental results on 1.2kW 270V/28V prototype are presented to confirm the proper behavior of the proposed solution. The second project belongs to the field of photovoltaic systems and aims to develop a three-port converter with multiple power elements interfacing capability. A GaN-based Triple Active Bridge has been designed, regarding both the controller and the hardware realization.
Resumo:
This paper is focused on a parallel JAVA implementation of a processor defined in a Network of Evolutionary Processors. Processor description is based on JDom, which provides a complete, Java-based solution for accessing, manipulating, and outputting XML data from Java code. Communication among different processor to obtain a fully functional simulation of a Network of Evolutionary Processors will be treated in future. A safe-thread model of processors performs all parallel operations such as rules and filters. A non-deterministic behavior of processors is achieved with a thread for each rule and for each filter (input and output). Different results of a processor evolution are shown.
Resumo:
Every year, the number of discarded electro-electronic products is increasing. For this reason recycling is needed, to avoid wasting non-renewable natural resources. The objective of this work is to study the recycling of materials from parallel wire cable through Unit operations of mineral processing. Parallel wire cables are basically composed of polymer and copper. The following unit operations were tested: grinding, size classification, dense medium separation, electrostatic separation, scrubbing, panning, and elutriation. It was observed that the operations used obtained copper and PVC concentrates with a low degree of cross contamination. It was Concluded that total liberation of the materials was accomplished after grinding to less than 3 mm, using a cage mill. Separation using panning and elutriation presented the best results in terms of recovery and cross contamination. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
The cost of spatial join processing can be very high because of the large sizes of spatial objects and the computation-intensive spatial operations. While parallel processing seems a natural solution to this problem, it is not clear how spatial data can be partitioned for this purpose. Various spatial data partitioning methods are examined in this paper. A framework combining the data-partitioning techniques used by most parallel join algorithms in relational databases and the filter-and-refine strategy for spatial operation processing is proposed for parallel spatial join processing. Object duplication caused by multi-assignment in spatial data partitioning can result in extra CPU cost as well as extra communication cost. We find that the key to overcome this problem is to preserve spatial locality in task decomposition. We show in this paper that a near-optimal speedup can be achieved for parallel spatial join processing using our new algorithms.
Resumo:
Single processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.
Resumo:
We address the problem of scheduling a multiclass $M/M/m$ queue with Bernoulli feedback on $m$ parallel servers to minimize time-average linear holding costs. We analyze the performance of a heuristic priority-index rule, which extends Klimov's optimal solution to the single-server case: servers select preemptively customers with larger Klimov indices. We present closed-form suboptimality bounds (approximate optimality) for Klimov's rule, which imply that its suboptimality gap is uniformly bounded above with respect to (i) external arrival rates, as long as they stay within system capacity;and (ii) the number of servers. It follows that its relativesuboptimality gap vanishes in a heavy-traffic limit, as external arrival rates approach system capacity (heavy-traffic optimality). We obtain simpler expressions for the special no-feedback case, where the heuristic reduces to the classical $c \mu$ rule. Our analysis is based on comparing the expected cost of Klimov's ruleto the value of a strong linear programming (LP) relaxation of the system's region of achievable performance of mean queue lengths. In order to obtain this relaxation, we derive and exploit a new set ofwork decomposition laws for the parallel-server system. We further report on the results of a computational study on the quality of the $c \mu$ rule for parallel scheduling.
Resumo:
The process of developing software that takes advantage of multiple processors is commonly referred to as parallel programming. For various reasons, this process is much harder than the sequential case. For decades, parallel programming has been a problem for a small niche only: engineers working on parallelizing mostly numerical applications in High Performance Computing. This has changed with the advent of multi-core processors in mainstream computer architectures. Parallel programming in our days becomes a problem for a much larger group of developers. The main objective of this thesis was to find ways to make parallel programming easier for them. Different aims were identified in order to reach the objective: research the state of the art of parallel programming today, improve the education of software developers about the topic, and provide programmers with powerful abstractions to make their work easier. To reach these aims, several key steps were taken. To start with, a survey was conducted among parallel programmers to find out about the state of the art. More than 250 people participated, yielding results about the parallel programming systems and languages in use, as well as about common problems with these systems. Furthermore, a study was conducted in university classes on parallel programming. It resulted in a list of frequently made mistakes that were analyzed and used to create a programmers' checklist to avoid them in the future. For programmers' education, an online resource was setup to collect experiences and knowledge in the field of parallel programming - called the Parawiki. Another key step in this direction was the creation of the Thinking Parallel weblog, where more than 50.000 readers to date have read essays on the topic. For the third aim (powerful abstractions), it was decided to concentrate on one parallel programming system: OpenMP. Its ease of use and high level of abstraction were the most important reasons for this decision. Two different research directions were pursued. The first one resulted in a parallel library called AthenaMP. It contains so-called generic components, derived from design patterns for parallel programming. These include functionality to enhance the locks provided by OpenMP, to perform operations on large amounts of data (data-parallel programming), and to enable the implementation of irregular algorithms using task pools. AthenaMP itself serves a triple role: the components are well-documented and can be used directly in programs, it enables developers to study the source code and learn from it, and it is possible for compiler writers to use it as a testing ground for their OpenMP compilers. The second research direction was targeted at changing the OpenMP specification to make the system more powerful. The main contributions here were a proposal to enable thread-cancellation and a proposal to avoid busy waiting. Both were implemented in a research compiler, shown to be useful in example applications, and proposed to the OpenMP Language Committee.
Resumo:
In der vorliegenden Dissertation werden Systeme von parallel arbeitenden und miteinander kommunizierenden Restart-Automaten (engl.: systems of parallel communicating restarting automata; abgekürzt PCRA-Systeme) vorgestellt und untersucht. Dabei werden zwei bekannte Konzepte aus den Bereichen Formale Sprachen und Automatentheorie miteinander vescrknüpft: das Modell der Restart-Automaten und die sogenannten PC-Systeme (systems of parallel communicating components). Ein PCRA-System besteht aus endlich vielen Restart-Automaten, welche einerseits parallel und unabhängig voneinander lokale Berechnungen durchführen und andererseits miteinander kommunizieren dürfen. Die Kommunikation erfolgt dabei durch ein festgelegtes Kommunikationsprotokoll, das mithilfe von speziellen Kommunikationszuständen realisiert wird. Ein wesentliches Merkmal hinsichtlich der Kommunikationsstruktur in Systemen von miteinander kooperierenden Komponenten ist, ob die Kommunikation zentralisiert oder nichtzentralisiert erfolgt. Während in einer nichtzentralisierten Kommunikationsstruktur jede Komponente mit jeder anderen Komponente kommunizieren darf, findet jegliche Kommunikation innerhalb einer zentralisierten Kommunikationsstruktur ausschließlich mit einer ausgewählten Master-Komponente statt. Eines der wichtigsten Resultate dieser Arbeit zeigt, dass zentralisierte Systeme und nichtzentralisierte Systeme die gleiche Berechnungsstärke besitzen (das ist im Allgemeinen bei PC-Systemen nicht so). Darüber hinaus bewirkt auch die Verwendung von Multicast- oder Broadcast-Kommunikationsansätzen neben Punkt-zu-Punkt-Kommunikationen keine Erhöhung der Berechnungsstärke. Desweiteren wird die Ausdrucksstärke von PCRA-Systemen untersucht und mit der von PC-Systemen von endlichen Automaten und mit der von Mehrkopfautomaten verglichen. PC-Systeme von endlichen Automaten besitzen bekanntermaßen die gleiche Ausdrucksstärke wie Einwegmehrkopfautomaten und bilden eine untere Schranke für die Ausdrucksstärke von PCRA-Systemen mit Einwegkomponenten. Tatsächlich sind PCRA-Systeme auch dann stärker als PC-Systeme von endlichen Automaten, wenn die Komponenten für sich genommen die gleiche Ausdrucksstärke besitzen, also die regulären Sprachen charakterisieren. Für PCRA-Systeme mit Zweiwegekomponenten werden als untere Schranke die Sprachklassen der Zweiwegemehrkopfautomaten im deterministischen und im nichtdeterministischen Fall gezeigt, welche wiederum den bekannten Komplexitätsklassen L (deterministisch logarithmischer Platz) und NL (nichtdeterministisch logarithmischer Platz) entsprechen. Als obere Schranke wird die Klasse der kontextsensitiven Sprachen gezeigt. Außerdem werden Erweiterungen von Restart-Automaten betrachtet (nonforgetting-Eigenschaft, shrinking-Eigenschaft), welche bei einzelnen Komponenten eine Erhöhung der Berechnungsstärke bewirken, in Systemen jedoch deren Stärke nicht erhöhen. Die von PCRA-Systemen charakterisierten Sprachklassen sind unter diversen Sprachoperationen abgeschlossen und einige Sprachklassen sind sogar abstrakte Sprachfamilien (sogenannte AFL's). Abschließend werden für PCRA-Systeme spezifische Probleme auf ihre Entscheidbarkeit hin untersucht. Es wird gezeigt, dass Leerheit, Universalität, Inklusion, Gleichheit und Endlichkeit bereits für Systeme mit zwei Restart-Automaten des schwächsten Typs nicht semientscheidbar sind. Für das Wortproblem wird gezeigt, dass es im deterministischen Fall in quadratischer Zeit und im nichtdeterministischen Fall in exponentieller Zeit entscheidbar ist.
Resumo:
Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies.
Resumo:
This paper presents the research and development of a 3-legged micro Parallel Kinematic Manipulator (PKM) for positioning in micro-machining and assembly operations. The structural characteristics associated with parallel manipulators are evaluated and the PKMs with translational and rotational movements are identified. Based on these identifications, a hybrid 3-UPU (Universal Joint-Prismatic Joint-Universal Joint) parallel manipulator is designed and fabricated. The principles of the operation and modeling of this micro PKM is largely similar to a normal size Stewart Platform (SP). A modular design methodology is introduced for the construction of this micro PKM. Calibration results of this hybrid 3-UPU PKM are discussed in this paper.
Resumo:
A parallel formulation of an algorithm for the histogram computation of n data items using an on-the-fly data decomposition and a novel quantum-like representation (QR) is developed. The QR transformation separates multiple data read operations from multiple bin update operations thereby making it easier to bind data items into their corresponding histogram bins. Under this model the steps required to compute the histogram is n/s + t steps, where s is a speedup factor and t is associated with pipeline latency. Here, we show that an overall speedup factor, s, is available for up to an eightfold acceleration. Our evaluation also shows that each one of these cells requires less area/time complexity compared to similar proposals found in the literature.
Resumo:
We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.
Resumo:
Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.