949 resultados para Parallel programming (computer)
Resumo:
Face estagnao da tecnologia uniprocessador registada na passada dcada, aos principais fabricantes de microprocessadores encontraram na tecnologia multi-core a resposta `as crescentes necessidades de processamento do mercado. Durante anos, os desenvolvedores de software viram as suas aplicaes acompanhar os ganhos de performance conferidos por cada nova gerao de processadores sequenciais, mas `a medida que a capacidade de processamento escala em funo do nmero de processadores, a computao sequencial tem de ser decomposta em vrias partes concorrentes que possam executar em paralelo, para que possam utilizar as unidades de processamento adicionais e completar mais rapidamente. A programao paralela implica um paradigma completamente distinto da programao sequencial. Ao contrrio dos computadores sequenciais tipificados no modelo de Von Neumann, a heterogeneidade de arquiteturas paralelas requer modelos de programao paralela que abstraiam os programadores dos detalhes da arquitectura e simplifiquem o desenvolvimento de aplicaes concorrentes. Os modelos de programao paralela mais populares incitam os programadores a identificar instrues concorrentes na sua lgica de programao, e a especific-las sob a forma de tarefas que possam ser atribudas a processadores distintos para executarem em simultneo. Estas tarefas so tipicamente lanadas durante a execuo, e atribudas aos processadores pelo motor de execuo subjacente. Como os requisitos de processamento costumam ser variveis, e no so conhecidos a priori, o mapeamento de tarefas para processadores tem de ser determinado dinamicamente, em resposta a alteraes imprevisveis dos requisitos de execuo. `A medida que o volume da computao cresce, torna-se cada vez menos vivel garantir as suas restries temporais em plataformas uniprocessador. Enquanto os sistemas de tempo real se comeam a adaptar ao paradigma de computao paralela, h uma crescente aposta em integrar execues de tempo real com aplicaes interativas no mesmo hardware, num mundo em que a tecnologia se torna cada vez mais pequena, leve, ubqua, e portvel. Esta integrao requer solues de escalonamento que simultaneamente garantam os requisitos temporais das tarefas de tempo real e mantenham um nvel aceitvel de QoS para as restantes execues. Para tal, torna-se imperativo que as aplicaes de tempo real paralelizem, de forma a minimizar os seus tempos de resposta e maximizar a utilizao dos recursos de processamento. Isto introduz uma nova dimenso ao problema do escalonamento, que tem de responder de forma correcta a novos requisitos de execuo imprevisveis e rapidamente conjeturar o mapeamento de tarefas que melhor beneficie os critrios de performance do sistema. A tcnica de escalonamento baseado em servidores permite reservar uma frao da capacidade de processamento para a execuo de tarefas de tempo real, e assegurar que os efeitos de latncia na sua execuo no afectam as reservas estipuladas para outras execues. No caso de tarefas escalonadas pelo tempo de execuo mximo, ou tarefas com tempos de execuo variveis, torna-se provvel que a largura de banda estipulada no seja consumida por completo. Para melhorar a utilizao do sistema, os algoritmos de partilha de largura de banda (capacity-sharing) doam a capacidade no utilizada para a execuo de outras tarefas, mantendo as garantias de isolamento entre servidores. Com eficincia comprovada em termos de espao, tempo, e comunicao, o mecanismo de work-stealing tem vindo a ganhar popularidade como metodologia para o escalonamento de tarefas com paralelismo dinmico e irregular. O algoritmo p-CSWS combina escalonamento baseado em servidores com capacity-sharing e work-stealing para cobrir as necessidades de escalonamento dos sistemas abertos de tempo real. Enquanto o escalonamento em servidores permite partilhar os recursos de processamento sem interferncias a nvel dos atrasos, uma nova poltica de work-stealing que opera sobre o mecanismo de capacity-sharing aplica uma explorao de paralelismo que melhora os tempos de resposta das aplicaes e melhora a utilizao do sistema. Esta tese prope uma implementao do algoritmo p-CSWS para o Linux. Em concordncia com a estrutura modular do escalonador do Linux, e definida uma nova classe de escalonamento que visa avaliar a aplicabilidade da heurstica p-CSWS em circunstncias reais. Ultrapassados os obstculos intrnsecos `a programao da kernel do Linux, os extensos testes experimentais provam que o p-CSWS e mais do que um conceito terico atrativo, e que a explorao heurstica de paralelismo proposta pelo algoritmo beneficia os tempos de resposta das aplicaes de tempo real, bem como a performance e eficincia da plataforma multiprocessador.
Resumo:
Dissertao para obteno do Grau de Mestre em Engenharia Informtica
Resumo:
Vita.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called Global Sharing, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRMs model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRMs task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity.
Resumo:
The TCP/IP architecture was consolidated as a standard to the distributed systems. However, there are several researches and discussions about alternatives to the evolution of this architecture and, in this study area, this work presents the Title Model to contribute with the application needs support by the cross layer ontology use and the horizontal addressing, in a next generation Internet. For a practical viewpoint, is showed the network cost reduction for the distributed programming example, in networks with layer 2 connectivity. To prove the title model enhancement, it is presented the network analysis performed for the message passing interface, sending a vector of integers and returning its sum. By this analysis, it is confirmed that the current proposal allows, in this environment, a reduction of 15,23% over the total network traffic, in bytes.
Resumo:
Multicore platforms have transformed parallelism into a main concern. Parallel programming models are being put forward to provide a better approach for application programmers to expose the opportunities for parallelism by pointing out potentially parallel regions within tasks, leaving the actual and dynamic scheduling of these regions onto processors to be performed at runtime, exploiting the maximum amount of parallelism. It is in this context that this paper proposes a scheduling approach that combines the constant-bandwidth server abstraction with a priority-aware work-stealing load balancing scheme which, while ensuring isolation among tasks, enables parallel tasks to be executed on more than one processor at a given time instant.
Resumo:
Nos ltimos anos comearam a ser vulgares os computadores dotados de multiprocessadores e multi-cores. De modo a aproveitar eficientemente as novas caractersticas desse hardware comearam a surgir ferramentas para facilitar o desenvolvimento de software paralelo, atravs de linguagens e frameworks, adaptadas a diferentes linguagens. Com a grande difuso de redes de alta velocidade, tal como Gigabit Ethernet e a ltima gerao de redes Wi-Fi, abre-se a oportunidade de, alm de paralelizar o processamento entre processadores e cores, poder em simultneo paraleliz-lo entre mquinas diferentes. Ao modelo que permite paralelizar processamento localmente e em simultneo distribu-lo para mquinas que tambm tm capacidade de o paralelizar, chamou-se modelo paralelo distribudo. Nesta dissertao foram analisadas tcnicas e ferramentas utilizadas para fazer programao paralela e o trabalho que est feito dentro da rea de programao paralela e distribuda. Tendo estes dois factores em considerao foi proposta uma framework que tenta aplicar a simplicidade da programao paralela ao conceito paralelo distribudo. A proposta baseia-se na disponibilizao de uma framework em Java com uma interface de programao simples, de fcil aprendizagem e legibilidade que, de forma transparente, capaz de paralelizar e distribuir o processamento. Apesar de simples, existiu um esforo para a tornar configurvel de forma a adaptar-se ao mximo de situaes possvel. Nesta dissertao sero exploradas especialmente as questes relativas execuo e distribuio de trabalho, e a forma como o cdigo enviado de forma automtica pela rede, para outros ns cooperantes, evitando assim a instalao manual das aplicaes em todos os ns da rede. Para confirmar a validade deste conceito e das ideias defendidas nesta dissertao foi implementada esta framework qual se chamou DPF4j (Distributed Parallel Framework for JAVA) e foram feitos testes e retiradas mtricas para verificar a existncia de ganhos de performance em relao s solues j existentes.
Resumo:
Dissertao para obteno do Grau de Mestre em Engenharia Informtica
Resumo:
Breast cancer is the most common cancer among women, being a major public health problem. Worldwide, X-ray mammography is the current gold-standard for medical imaging of breast cancer. However, it has associated some well-known limitations. The false-negative rates, up to 66% in symptomatic women, and the false-positive rates, up to 60%, are a continued source of concern and debate. These drawbacks prompt the development of other imaging techniques for breast cancer detection, in which Digital Breast Tomosynthesis (DBT) is included. DBT is a 3D radiographic technique that reduces the obscuring effect of tissue overlap and appears to address both issues of false-negative and false-positive rates. The 3D images in DBT are only achieved through image reconstruction methods. These methods play an important role in a clinical setting since there is a need to implement a reconstruction process that is both accurate and fast. This dissertation deals with the optimization of iterative algorithms, with parallel computing through an implementation on Graphics Processing Units (GPUs) to make the 3D reconstruction faster using Compute Unified Device Architecture (CUDA). Iterative algorithms have shown to produce the highest quality DBT images, but since they are computationally intensive, their clinical use is currently rejected. These algorithms have the potential to reduce patient dose in DBT scans. A method of integrating CUDA in Interactive Data Language (IDL) is proposed in order to accelerate the DBT image reconstructions. This method has never been attempted before for DBT. In this work the system matrix calculation, the most computationally expensive part of iterative algorithms, is accelerated. A speedup of 1.6 is achieved proving the fact that GPUs can accelerate the IDL implementation.
Resumo:
The Intel R Xeon PhiTM is the first processor based on Intels MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel computations. The main novelty of the MIC architecture, relatively to GPUs, is its compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular. In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming, hiding details such as resource management, parallel decomposition, inter-execution flow communication, thus removing these concerns from the programmers mind. In this context, the goal of the thesis is to lay the foundations for the development of a simple but powerful and efficient skeleton framework for the programming of the Xeon Phi processor. For this purpose we build upon Marrow, an existing framework for the orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi. We evaluate the newly developed framework, several well-known benchmarks, like Saxpy and N-Body, will be used to compare, not only its performance to the existing framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.
Resumo:
This paper shows how a high level matrix programming language may be used to perform Monte Carlo simulation, bootstrapping, estimation by maximum likelihood and GMM, and kernel regression in parallel on symmetric multiprocessor computers or clusters of workstations. The implementation of parallelization is done in a way such that an investigator may use the programs without any knowledge of parallel programming. A bootable CD that allows rapid creation of a cluster for parallel computing is introduced. Examples show that parallelization can lead to important reductions in computational time. Detailed discussion of how the Monte Carlo problem was parallelized is included as an example for learning to write parallel programs for Octave.
Resumo:
The past few decades have seen a considerable increase in the number of parallel and distributed systems. With the development of more complex applications, the need for more powerful systems has emerged and various parallel and distributed environments have been designed and implemented. Each of the environments, including hardware and software, has unique strengths and weaknesses. There is no single parallel environment that can be identified as the best environment for all applications with respect to hardware and software properties. The main goal of this thesis is to provide a novel way of performing data-parallel computation in parallel and distributed environments by utilizing the best characteristics of difference aspects of parallel computing. For the purpose of this thesis, three aspects of parallel computing were identified and studied. First, three parallel environments (shared memory, distributed memory, and a network of workstations) are evaluated to quantify theirsuitability for different parallel applications. Due to the parallel and distributed nature of the environments, networks connecting the processors in these environments were investigated with respect to their performance characteristics. Second, scheduling algorithms are studied in order to make them more efficient and effective. A concept of application-specific information scheduling is introduced. The application- specific information is data about the workload extractedfrom an application, which is provided to a scheduling algorithm. Three scheduling algorithms are enhanced to utilize the application-specific information to further refine their scheduling properties. A more accurate description of the workload is especially important in cases where the workunits are heterogeneous and the parallel environment is heterogeneous and/or non-dedicated. The results obtained show that the additional information regarding the workload has a positive impact on the performance of applications. Third, a programming paradigm for networks of symmetric multiprocessor (SMP) workstations is introduced. The MPIT programming paradigm incorporates the Message Passing Interface (MPI) with threads to provide a methodology to write parallel applications that efficiently utilize the available resources and minimize the overhead. The MPIT allows for communication and computation to overlap by deploying a dedicated thread for communication. Furthermore, the programming paradigm implements an application-specific scheduling algorithm. The scheduling algorithm is executed by the communication thread. Thus, the scheduling does not affect the execution of the parallel application. Performance results achieved from the MPIT show that considerable improvements over conventional MPI applications are achieved.
Resumo:
This thesis will introduce a new strongly typed programming language utilizing Self types, named Win--*Foy, along with a suitable user interface designed specifically to highlight language features. The need for such a programming language is based on deficiencies found in programming languages that support both Self types and subtyping. Subtyping is a concept that is taken for granted by most software engineers programming in object-oriented languages. Subtyping supports subsumption but it does not support the inheritance of binary methods. Binary methods contain an argument of type Self, the same type as the object itself, in a contravariant position, i.e. as a parameter. There are several arguments in favour of introducing Self types into a programming language (11. This rationale led to the development of a relation that has become known as matching [4, 5). The matching relation does not support subsumption, however, it does support the inheritance of binary methods. Two forms of matching have been proposed (lJ. Specifically, these relations are known as higher-order matching and I-bound matching. Previous research on these relations indicates that the higher-order matching relation is both reflexive and transitive whereas the f-bound matching is reflexive but not transitive (7]. The higher-order matching relation provides significant flexibility regarding inheritance of methods that utilize or return values of the same type. This flexibility, in certain situations, can restrict the programmer from defining specific classes and methods which are based on constant values [21J. For this reason, the type This is used as a second reference to the type of the object that cannot, contrary to Self, be specialized in subclasses. F-bound matching allows a programmer to define a function that will work for all types of A', a subtype of an upper bound function of type A, with the result type being dependent on A'. The use of parametric polymorphism in f-bound matching provides a connection to subtyping in object-oriented languages. This thesis will contain two main sections. Firstly, significant details concerning deficiencies of the subtype relation and the need to introduce higher-order and f-bound matching relations into programming languages will be explored. Secondly, a new programming language named Win--*Foy Functional Object-Oriented Programming Language has been created, along with a suitable user interface, in order to facilitate experimentation by programmers regarding the matching relation. The construction of the programming language and the user interface will be explained in detail.
Resumo:
The Robocup Rescue Simulation System (RCRSS) is a dynamic system of multi-agent interaction, simulating a large-scale urban disaster scenario. Teams of rescue agents are charged with the tasks of minimizing civilian casualties and infrastructure damage while competing against limitations on time, communication, and awareness. This thesis provides the first known attempt of applying Genetic Programming (GP) to the development of behaviours necessary to perform well in the RCRSS. Specifically, this thesis studies the suitability of GP to evolve the operational behaviours required of each type of rescue agent in the RCRSS. The system developed is evaluated in terms of the consistency with which expected solutions are the target of convergence as well as by comparison to previous competition results. The results indicate that GP is capable of converging to some forms of expected behaviour, but that additional evolution in strategizing behaviours must be performed in order to become competitive. An enhancement to the standard GP algorithm is proposed which is shown to simplify the initial search space allowing evolution to occur much quicker. In addition, two forms of population are employed and compared in terms of their apparent effects on the evolution of control structures for intelligent rescue agents. The first is a single population in which each individual is comprised of three distinct trees for the respective control of three types of agents, the second is a set of three co-evolving subpopulations one for each type of agent. Multiple populations of cooperating individuals appear to achieve higher proficiencies in training, but testing on unseen instances raises the issue of overfitting.