74 resultados para Loops parallelization
Resumo:
This paper focuses on the parallelization of an ocean model applying current multicore processor-based cluster architectures to an irregular computational mesh. The aim is to maximize the efficiency of the computational resources used. To make the best use of the resources offered by these architectures, this parallelization has been addressed at all the hardware levels of modern supercomputers: firstly, exploiting the internal parallelism of the CPU through vectorization; secondly, taking advantage of the multiple cores of each node using OpenMP; and finally, using the cluster nodes to distribute the computational mesh, using MPI for communication within the nodes. The speedup obtained with each parallelization technique as well as the combined overall speedup have been measured for the western Mediterranean Sea for different cluster configurations, achieving a speedup factor of 73.3 using 256 processors. The results also show the efficiency achieved in the different cluster nodes and the advantages obtained by combining OpenMP and MPI versus using only OpenMP or MPI. Finally, the scalability of the model has been analysed by examining computation and communication times as well as the communication and synchronization overhead due to parallelization.
Resumo:
We show a method for parallelizing top down dynamic programs in a straightforward way by a careful choice of a lock-free shared hash table implementation and randomization of the order in which the dynamic program computes its subproblems. This generic approach is applied to dynamic programs for knapsack, shortest paths, and RNA structure alignment, as well as to a state-of-the-art solution for minimizing the máximum number of open stacks. Experimental results are provided on three different modern multicore architectures which show that this parallelization is effective and reasonably scalable. In particular, we obtain over 10 times speedup for 32 threads on the open stacks problem.
Resumo:
The technique of Abstract Interpretation has allowed the development of very sophisticated global program analyses which are at the same time provably correct and practical. We present in a tutorial fashion a novel program development framework which uses abstract interpretation as a fundamental tool. The framework uses modular, incremental abstract interpretation to obtain information about the program. This information is used to validate programs, to detect bugs with respect to partial specifications written using assertions (in the program itself and/or in system libraries), to generate and simplify run-time tests, and to perform high-level program transformations such as multiple abstract specialization, parallelization, and resource usage control, all in a provably correct way. In the case of validation and debugging, the assertions can refer to a variety of program points such as procedure entry, procedure exit, points within procedures, or global computations. The system can reason with much richer information than, for example, traditional types. This includes data structure shape (including pointer sharing), bounds on data structure sizes, and other operational variable instantiation properties, as well as procedure-level properties such as determinacy, termination, nonfailure, and bounds on resource consumption (time or space cost). CiaoPP, the preprocessor of the Ciao multi-paradigm programming system, which implements the described functionality, will be used to illustrate the fundamental ideas.
Resumo:
This paper outlines the problems found in the parallelization of SPH (Smoothed Particle Hydrodynamics) algorithms using Graphics Processing Units. Different results of some parallel GPU implementations in terms of the speed-up and the scalability compared to the CPU sequential codes are shown. The most problematic stage in the GPU-SPH algorithms is the one responsible for locating neighboring particles and building the vectors where this information is stored, since these specific algorithms raise many dificulties for a data-level parallelization. Because of the fact that the neighbor location using linked lists does not show enough data-level parallelism, two new approaches have been pro- posed to minimize bank conflicts in the writing and subsequent reading of the neighbor lists. The first strategy proposes an efficient coordination between CPU-GPU, using GPU algorithms for those stages that allow a straight forward parallelization, and sequential CPU algorithms for those instructions that involve some kind of vector reduction. This coordination provides a relatively orderly reading of the neighbor lists in the interactions stage, achieving a speed-up factor of x47 in this stage. However, since the construction of the neighbor lists is quite expensive, it is achieved an overall speed-up of x41. The second strategy seeks to maximize the use of the GPU in the neighbor's location process by executing a specific vector sorting algorithm that allows some data-level parallelism. Al- though this strategy has succeeded in improving the speed-up on the stage of neighboring location, the global speed-up on the interactions stage falls, due to inefficient reading of the neighbor vectors. Some changes to these strategies are proposed, aimed at maximizing the computational load of the GPU and using the GPU texture-units, in order to reach the maximum speed-up for such codes. Different practical applications have been added to the mentioned GPU codes. First, the classical dam-break problem is studied. Second, the wave impact of the sloshing fluid contained in LNG vessel tanks is also simulated as a practical example of particle methods
Resumo:
A highly parallel and scalable Deblocking Filter (DF) hardware architecture for H.264/AVC and SVC video codecs is presented in this paper. The proposed architecture mainly consists on a coarse grain systolic array obtained by replicating a unique and homogeneous Functional Unit (FU), in which a whole Deblocking-Filter unit is implemented. The proposal is also based on a novel macroblock-level parallelization strategy of the filtering algorithm which improves the final performance by exploiting specific data dependences. This way communication overhead is reduced and a more intensive parallelism in comparison with the existing state-of-the-art solutions is obtained. Furthermore, the architecture is completely flexible, since the level of parallelism can be changed, according to the application requirements. The design has been implemented in a Virtex-5 FPGA, and it allows filtering 4CIF (704 × 576 pixels @30 fps) video sequences in real-time at frequencies lower than 10.16 Mhz.
Resumo:
Irregular computations pose sorne of the most interesting and challenging problems in automatic parallelization. Irregularity appears in certain kinds of numerical problems and is pervasive in symbolic applications. Such computations often use dynamic data structures, which make heavy use of pointers. This complicates all the steps of a parallelizing compiler, from independence detection to task partitioning and placement. Starting in the mid 80s there has been significant progress in the development of parallelizing compilers for logic programming (and more recently, constraint programming) resulting in quite capable parallelizers. The typical applications of these paradigms frequently involve irregular computations, and make heavy use of dynamic data structures with pointers, since logical variables represent in practice a well-behaved form of pointers. This arguably makes the techniques used in these compilers potentially interesting. In this paper, we introduce in a tutoríal way, sorne of the problems faced by parallelizing compilers for logic and constraint programs and provide pointers to sorne of the significant progress made in the area. In particular, this work has resulted in a series of achievements in the areas of inter-procedural pointer aliasing analysis for independence detection, cost models and cost analysis, cactus-stack memory management, techniques for managing speculative and irregular computations through task granularity control and dynamic task allocation such as work-stealing schedulers), etc.
Resumo:
Systems relying on fixed hardware components with a static level of parallelism can suffer from an underuse of logical resources, since they have to be designed for the worst-case scenario. This problem is especially important in video applications due to the emergence of new flexible standards, like Scalable Video Coding (SVC), which offer several levels of scalability. In this paper, Dynamic and Partial Reconfiguration (DPR) of modern FPGAs is used to achieve run-time variable parallelism, by using scalable architectures where the size can be adapted at run-time. Based on this proposal, a scalable Deblocking Filter core (DF), compliant with the H.264/AVC and SVC standards has been designed. This scalable DF allows run-time addition or removal of computational units working in parallel. Scalability is offered together with a scalable parallelization strategy at the macroblock (MB) level, such that when the size of the architecture changes, MB filtering order is modified accordingly
Resumo:
In this paper we present a scalable software architecture for on-line multi-camera video processing, that guarantees a good trade off between computational power, scalability and flexibility. The software system is modular and its main blocks are the Processing Units (PUs), and the Central Unit. The Central Unit works as a supervisor of the running PUs and each PU manages the acquisition phase and the processing phase. Furthermore, an approach to easily parallelize the desired processing application has been presented. In this paper, as case study, we apply the proposed software architecture to a multi-camera system in order to efficiently manage multiple 2D object detection modules in a real-time scenario. System performance has been evaluated under different load conditions such as number of cameras and image sizes. The results show that the software architecture scales well with the number of camera and can easily works with different image formats respecting the real time constraints. Moreover, the parallelization approach can be used in order to speed up the processing tasks with a low level of overhead
Resumo:
Studying independence of goals has proven very useful in the context of logic programming. In particular, it has provided a formal basis for powerful automatic parallelization tools, since independence ensures that two goals may be evaluated in parallel while preserving correctness and eciency. We extend the concept of independence to constraint logic programs (CLP) and prove that it also ensures the correctness and eciency of the parallel evaluation of independent goals. Independence for CLP languages is more complex than for logic programming as search space preservation is necessary but no longer sucient for ensuring correctness and eciency. Two additional issues arise. The rst is that the cost of constraint solving may depend upon the order constraints are encountered. The second is the need to handle dynamic scheduling. We clarify these issues by proposing various types of search independence and constraint solver independence, and show how they can be combined to allow dierent optimizations, from parallelism to intelligent backtracking. Sucient conditions for independence which can be evaluated \a priori" at run-time are also proposed. Our study also yields new insights into independence in logic programming languages. In particular, we show that search space preservation is not only a sucient but also a necessary condition for ensuring correctness and eciency of parallel execution.
Resumo:
The properties of data and activities in business processes can be used to greatly facilítate several relevant tasks performed at design- and run-time, such as fragmentation, compliance checking, or top-down design. Business processes are often described using workflows. We present an approach for mechanically inferring business domain-specific attributes of workflow components (including data Ítems, activities, and elements of sub-workflows), taking as starting point known attributes of workflow inputs and the structure of the workflow. We achieve this by modeling these components as concepts and applying sharing analysis to a Horn clause-based representation of the workflow. The analysis is applicable to workflows featuring complex control and data dependencies, embedded control constructs, such as loops and branches, and embedded component services.
Resumo:
We present two concurrent semantics (i.e. semantics where concurrency is explicitely represented) for CC programs with atomic tells. One is based on simple partial orders of computation steps, while the other one is based on contextual nets and it is an extensión of a previous one for eventual CC programs. Both such semantics allow us to derive concurrency, dependency, and nondeterminism information for the considered languages. We prove some properties about the relation between the two semantics, and also about the relation between them and the operational semantics. Moreover, we discuss how to use the contextual net semantics in the context of CLP programs. More precisely, by interpreting concurrency as possible parallelism, our semantics can be useful for a safe parallelization of some CLP computation steps. Dually, the dependency information may also be interpreted as necessary sequentialization, thus possibly exploiting it for the task of scheduling CC programs. Moreover, our semantics is also suitable for CC programs with a new kind of atomic tell (called locally atomic tell), which checks for consistency only the constraints it depends on. Such a tell achieves a reasonable trade-off between efficiency and atomicity, since the checked constraints can be stored in a local memory and are thus easily accessible even in a distributed implementation.
Resumo:
In recent years, applications in domains such as telecommunications, network security or large scale sensor networks showed the limits of the traditional store-then-process paradigm. In this context, Stream Processing Engines emerged as a candidate solution for all these applications demanding for high processing capacity with low processing latency guarantees. With Stream Processing Engines, data streams are not persisted but rather processed on the fly, producing results continuously. Current Stream Processing Engines, either centralized or distributed, do not scale with the input load due to single-node bottlenecks. Moreover, they are based on static configurations that lead to either under or over-provisioning. This Ph.D. thesis discusses StreamCloud, an elastic paralleldistributed stream processing engine that enables for processing of large data stream volumes. Stream- Cloud minimizes the distribution and parallelization overhead introducing novel techniques that split queries into parallel subqueries and allocate them to independent sets of nodes. Moreover, Stream- Cloud elastic and dynamic load balancing protocols enable for effective adjustment of resources depending on the incoming load. Together with the parallelization and elasticity techniques, Stream- Cloud defines a novel fault tolerance protocol that introduces minimal overhead while providing fast recovery. StreamCloud has been fully implemented and evaluated using several real word applications such as fraud detection applications or network analysis applications. The evaluation, conducted using a cluster with more than 300 cores, demonstrates the large scalability, the elasticity and fault tolerance effectiveness of StreamCloud. Resumen En los útimos años, aplicaciones en dominios tales como telecomunicaciones, seguridad de redes y redes de sensores de gran escala se han encontrado con múltiples limitaciones en el paradigma tradicional de bases de datos. En este contexto, los sistemas de procesamiento de flujos de datos han emergido como solución a estas aplicaciones que demandan una alta capacidad de procesamiento con una baja latencia. En los sistemas de procesamiento de flujos de datos, los datos no se persisten y luego se procesan, en su lugar los datos son procesados al vuelo en memoria produciendo resultados de forma continua. Los actuales sistemas de procesamiento de flujos de datos, tanto los centralizados, como los distribuidos, no escalan respecto a la carga de entrada del sistema debido a un cuello de botella producido por la concentración de flujos de datos completos en nodos individuales. Por otra parte, éstos están basados en configuraciones estáticas lo que conducen a un sobre o bajo aprovisionamiento. Esta tesis doctoral presenta StreamCloud, un sistema elástico paralelo-distribuido para el procesamiento de flujos de datos que es capaz de procesar grandes volúmenes de datos. StreamCloud minimiza el coste de distribución y paralelización por medio de una técnica novedosa la cual particiona las queries en subqueries paralelas repartiéndolas en subconjuntos de nodos independientes. Ademas, Stream- Cloud posee protocolos de elasticidad y equilibrado de carga que permiten una optimización de los recursos dependiendo de la carga del sistema. Unidos a los protocolos de paralelización y elasticidad, StreamCloud define un protocolo de tolerancia a fallos que introduce un coste mínimo mientras que proporciona una rápida recuperación. StreamCloud ha sido implementado y evaluado mediante varias aplicaciones del mundo real tales como aplicaciones de detección de fraude o aplicaciones de análisis del tráfico de red. La evaluación ha sido realizada en un cluster con más de 300 núcleos, demostrando la alta escalabilidad y la efectividad tanto de la elasticidad, como de la tolerancia a fallos de StreamCloud.
Resumo:
The &-Prolog system, a practical implementation of a parallel execution niodel for Prolog exploiting strict and non-strict independent and-parallelism, is described. Both automatic and manual parallelization of programs is supported. This description includes a summary of the system's language and architecture, some details of its execution model (based on the RAP-WAM model), and data on its performance on sequential workstations and shared memory multiprocessors, which is compared to that of current Prolog systems. The results to date show significant speed advantages over state-of-the-art sequential systems.
Resumo:
The Set-Sharing domain has been widely used to infer at compiletime interesting properties of logic programs such as occurs-check reduction, automatic parallelization, and flnite-tree analysis. However, performing abstract uniflcation in this domain requires a closure operation that increases the number of sharing groups exponentially. Much attention has been given to mitigating this key inefflciency in this otherwise very useful domain. In this paper we present a novel approach to Set-Sharing: we define a new representation that leverages the complement (or negative) sharing relationships of the original sharing set, without loss of accuracy. Intuitively, given an abstract state sh\> over the finite set of variables of interest V, its negative representation is p(V) \ shy. Using this encoding during analysis dramatically reduces the number of elements that need to be represented in the abstract states and during abstract uniflcation as the cardinality of the original set grows toward 2 . To further compress the number of elements, we express the set-sharing relationships through a set of ternary strings that compacts the representation by eliminating redundancies among the sharing sets. Our experiments show that our approach can compress the number of relationships, reducing signiflcantly the memory usage and running time of all abstract operations, including abstract uniflcation.
Resumo:
Precise modeling of the program heap is fundamental for understanding the behavior of a program, and is thus of signiflcant interest for many optimization applications. One of the fundamental properties of the heap that can be used in a range of optimization techniques is the sharing relationships between the elements in an array or collection. If an analysis can determine that the memory locations pointed to by different entries of an array (or collection) are disjoint, then in many cases loops that traverse the array can be vectorized or transformed into a thread-parallel versión. This paper introduces several novel sharing properties over the concrete heap and corresponding abstractions to represent them. In conjunction with an existing shape analysis technique, these abstractions allow us to precisely resolve the sharing relations in a wide range of heap structures (arrays, collections, recursive data structures, composite heap structures) in a computationally efflcient manner. The effectiveness of the approach is evaluated on a set of challenge problems from the JOlden and SPECjvm98 suites. Sharing information obtained from the analysis is used to achieve substantial thread-level parallel speedups.