459 resultados para Parallelism
Resumo:
Membrane systems are computational equivalent to Turing machines. However, their distributed and massively parallel nature obtains polynomial solutions opposite to traditional non-polynomial ones. At this point, it is very important to develop dedicated hardware and software implementations exploiting those two membrane systems features. Dealing with distributed implementations of P systems, the bottleneck communication problem has arisen. When the number of membranes grows up, the network gets congested. The purpose of distributed architectures is to reach a compromise between the massively parallel character of the system and the needed evolution step time to transit from one configuration of the system to the next one, solving the bottleneck communication problem. The goal of this paper is twofold. Firstly, to survey in a systematic and uniform way the main results regarding the way membranes can be placed on processors in order to get a software/hardware simulation of P-Systems in a distributed environment. Secondly, we improve some results about the membrane dissolution problem, prove that it is connected, and discuss the possibility of simulating this property in the distributed model. All this yields an improvement in the system parallelism implementation since it gets an increment of the parallelism of the external communication among processors. Proposed ideas improve previous architectures to tackle the communication bottleneck problem, such as reduction of the total time of an evolution step, increase of the number of membranes that could run on a processor and reduction of the number of processors.
Resumo:
The goal of this paper is twofold. Firstly, to survey in a systematic and uniform way the main results regarding the way membranes can be placed on processors in order to get a software/hardware simulation of P-Systems in a distributed environment. Secondly, we improve some results about the membrane dissolution problem, prove that it is connected, and discuss the possibility of simulating this property in the distributed model. All this yields an improvement in the system parallelism implementation since it gets an increment of the parallelism of the external communication among processors. Also, the number of processors grows in such a way that is notorious the increment of the parallelism in the application of the evolution rules and the internal communica-tionsstudy because it gets an increment of the parallelism in the application of the evolution rules and the internal communications. Proposed ideas improve previous architectures to tackle the communication bottleneck problem, such as reduction of the total time of an evolution step, increase of the number of membranes that could run on a processor and reduction of the number of processors
Resumo:
This paper outlines the problems found in the parallelization of SPH (Smoothed Particle Hydrodynamics) algorithms using Graphics Processing Units. Different results of some parallel GPU implementations in terms of the speed-up and the scalability compared to the CPU sequential codes are shown. The most problematic stage in the GPU-SPH algorithms is the one responsible for locating neighboring particles and building the vectors where this information is stored, since these specific algorithms raise many dificulties for a data-level parallelization. Because of the fact that the neighbor location using linked lists does not show enough data-level parallelism, two new approaches have been pro- posed to minimize bank conflicts in the writing and subsequent reading of the neighbor lists. The first strategy proposes an efficient coordination between CPU-GPU, using GPU algorithms for those stages that allow a straight forward parallelization, and sequential CPU algorithms for those instructions that involve some kind of vector reduction. This coordination provides a relatively orderly reading of the neighbor lists in the interactions stage, achieving a speed-up factor of x47 in this stage. However, since the construction of the neighbor lists is quite expensive, it is achieved an overall speed-up of x41. The second strategy seeks to maximize the use of the GPU in the neighbor's location process by executing a specific vector sorting algorithm that allows some data-level parallelism. Al- though this strategy has succeeded in improving the speed-up on the stage of neighboring location, the global speed-up on the interactions stage falls, due to inefficient reading of the neighbor vectors. Some changes to these strategies are proposed, aimed at maximizing the computational load of the GPU and using the GPU texture-units, in order to reach the maximum speed-up for such codes. Different practical applications have been added to the mentioned GPU codes. First, the classical dam-break problem is studied. Second, the wave impact of the sloshing fluid contained in LNG vessel tanks is also simulated as a practical example of particle methods
Resumo:
A highly parallel and scalable Deblocking Filter (DF) hardware architecture for H.264/AVC and SVC video codecs is presented in this paper. The proposed architecture mainly consists on a coarse grain systolic array obtained by replicating a unique and homogeneous Functional Unit (FU), in which a whole Deblocking-Filter unit is implemented. The proposal is also based on a novel macroblock-level parallelization strategy of the filtering algorithm which improves the final performance by exploiting specific data dependences. This way communication overhead is reduced and a more intensive parallelism in comparison with the existing state-of-the-art solutions is obtained. Furthermore, the architecture is completely flexible, since the level of parallelism can be changed, according to the application requirements. The design has been implemented in a Virtex-5 FPGA, and it allows filtering 4CIF (704 × 576 pixels @30 fps) video sequences in real-time at frequencies lower than 10.16 Mhz.
Resumo:
Systems relying on fixed hardware components with a static level of parallelism can suffer from an underuse of logical resources, since they have to be designed for the worst-case scenario. This problem is especially important in video applications due to the emergence of new flexible standards, like Scalable Video Coding (SVC), which offer several levels of scalability. In this paper, Dynamic and Partial Reconfiguration (DPR) of modern FPGAs is used to achieve run-time variable parallelism, by using scalable architectures where the size can be adapted at run-time. Based on this proposal, a scalable Deblocking Filter core (DF), compliant with the H.264/AVC and SVC standards has been designed. This scalable DF allows run-time addition or removal of computational units working in parallel. Scalability is offered together with a scalable parallelization strategy at the macroblock (MB) level, such that when the size of the architecture changes, MB filtering order is modified accordingly
Resumo:
Studying independence of goals has proven very useful in the context of logic programming. In particular, it has provided a formal basis for powerful automatic parallelization tools, since independence ensures that two goals may be evaluated in parallel while preserving correctness and eciency. We extend the concept of independence to constraint logic programs (CLP) and prove that it also ensures the correctness and eciency of the parallel evaluation of independent goals. Independence for CLP languages is more complex than for logic programming as search space preservation is necessary but no longer sucient for ensuring correctness and eciency. Two additional issues arise. The rst is that the cost of constraint solving may depend upon the order constraints are encountered. The second is the need to handle dynamic scheduling. We clarify these issues by proposing various types of search independence and constraint solver independence, and show how they can be combined to allow dierent optimizations, from parallelism to intelligent backtracking. Sucient conditions for independence which can be evaluated \a priori" at run-time are also proposed. Our study also yields new insights into independence in logic programming languages. In particular, we show that search space preservation is not only a sucient but also a necessary condition for ensuring correctness and eciency of parallel execution.
Resumo:
Compilation techniques such as those portrayed by the Warren Abstract Machine(WAM) have greatly improved the speed of execution of logic programs. The research presented herein is geared towards providing additional performance to logic programs through the use of parallelism, while preserving the conventional semantics of logic languages. Two áreas to which special attention is given are the preservation of sequential performance and storage efficiency, and the use of low overhead mechanisms for controlling parallel execution. Accordingly, the techniques used for supporting parallelism are efficient extensions of those which have brought high inferencing speeds to sequential implementations. At a lower level, special attention is also given to design and simulation detail and to the architectural implications of the execution model behavior. This paper offers an overview of the basic concepts and techniques used in the parallel design, simulation tools used, and some of the results obtained to date.
Resumo:
Within the membrane computing research field, there are many papers about software simulations and a few about hardware implementations. In both cases, algorithms for implementing membrane systems in software and hardware that try to take advantages of massive parallelism are implemented. P-systems are parallel and non deterministic systems which simulate membranes behavior when processing information. This paper presents software techniques based on the proper utilization of virtual memory of a computer. There is a study of how much virtual memory is necessary to host a membrane model. This method improves performance in terms of time.
Resumo:
Single core capabilities have reached their maximum clock speed; new multicore architectures provide an alternative way to tackle this issue instead. The design of decoding applications running on top of these multicore platforms and their optimization to exploit all system computational power is crucial to obtain best results. Since the development at the integration level of printed circuit boards are increasingly difficult to optimize due to physical constraints and the inherent increase in power consumption, development of multiprocessor architectures is becoming the new Holy Grail. In this sense, it is crucial to develop applications that can run on the new multi-core architectures and find out distributions to maximize the potential use of the system. Today most of commercial electronic devices, available in the market, are composed of embedded systems. These devices incorporate recently multi-core processors. Task management onto multiple core/processors is not a trivial issue, and a good task/actor scheduling can yield to significant improvements in terms of efficiency gains and also processor power consumption. Scheduling of data flows between the actors that implement the applications aims to harness multi-core architectures to more types of applications, with an explicit expression of parallelism into the application. On the other hand, the recent development of the MPEG Reconfigurable Video Coding (RVC) standard allows the reconfiguration of the video decoders. RVC is a flexible standard compatible with MPEG developed codecs, making it the ideal tool to integrate into the new multimedia terminals to decode video sequences. With the new versions of the Open RVC-CAL Compiler (Orcc), a static mapping of the actors that implement the functionality of the application can be done once the application executable has been generated. This static mapping must be done for each of the different cores available on the working platform. It has been chosen an embedded system with a processor with two ARMv7 cores. This platform allows us to obtain the desired tests, get as much improvement results from the execution on a single core, and contrast both with a PC-based multiprocessor system. Las posibilidades ofrecidas por el aumento de la velocidad de la frecuencia de reloj de sistemas de un solo procesador están siendo agotadas. Las nuevas arquitecturas multiprocesador proporcionan una vía de desarrollo alternativa en este sentido. El diseño y optimización de aplicaciones de descodificación de video que se ejecuten sobre las nuevas arquitecturas permiten un mejor aprovechamiento y favorecen la obtención de mayores rendimientos. Hoy en día muchos de los dispositivos comerciales que se están lanzando al mercado están integrados por sistemas embebidos, que recientemente están basados en arquitecturas multinúcleo. El manejo de las tareas de ejecución sobre este tipo de arquitecturas no es una tarea trivial, y una buena planificación de los actores que implementan las funcionalidades puede proporcionar importantes mejoras en términos de eficiencia en el uso de la capacidad de los procesadores y, por ende, del consumo de energía. Por otro lado, el reciente desarrollo del estándar de Codificación de Video Reconfigurable (RVC), permite la reconfiguración de los descodificadores de video. RVC es un estándar flexible y compatible con anteriores codecs desarrollados por MPEG. Esto hace de RVC el estándar ideal para ser incorporado en los nuevos terminales multimedia que se están comercializando. Con el desarrollo de las nuevas versiones del compilador específico para el desarrollo de lenguaje RVC-CAL (Orcc), en el que se basa MPEG RVC, el mapeo estático, para entornos basados en multiprocesador, de los actores que integran un descodificador es posible. Se ha elegido un sistema embebido con un procesador con dos núcleos ARMv7. Esta plataforma nos permitirá llevar a cabo las pruebas de verificación y contraste de los conceptos estudiados en este trabajo, en el sentido del desarrollo de descodificadores de video basados en MPEG RVC y del estudio de la planificación y mapeo estático de los mismos.
Resumo:
We present two concurrent semantics (i.e. semantics where concurrency is explicitely represented) for CC programs with atomic tells. One is based on simple partial orders of computation steps, while the other one is based on contextual nets and it is an extensión of a previous one for eventual CC programs. Both such semantics allow us to derive concurrency, dependency, and nondeterminism information for the considered languages. We prove some properties about the relation between the two semantics, and also about the relation between them and the operational semantics. Moreover, we discuss how to use the contextual net semantics in the context of CLP programs. More precisely, by interpreting concurrency as possible parallelism, our semantics can be useful for a safe parallelization of some CLP computation steps. Dually, the dependency information may also be interpreted as necessary sequentialization, thus possibly exploiting it for the task of scheduling CC programs. Moreover, our semantics is also suitable for CC programs with a new kind of atomic tell (called locally atomic tell), which checks for consistency only the constraints it depends on. Such a tell achieves a reasonable trade-off between efficiency and atomicity, since the checked constraints can be stored in a local memory and are thus easily accessible even in a distributed implementation.
Resumo:
We propose a number of challenges for future constraint programming systems, including improvements in implementation technology (using global analysis based optimization and parallelism), debugging facilities, and the extensión of the application domain to distributed, global programming. We also briefly discuss how we are exploring techniques to meet these challenges in the context of the development of the CIAO constraint logic programming system.
Resumo:
We present the design and implementation of the and-parallel component of ACE. ACE is a computational model for the full Prolog language that simultaneously exploits both or-parallelism and independent and-parallelism. A high performance implementation of the ACE model has been realized and its performance reported in this paper. We discuss how some of the standard problems which appear when implementing and-parallel systems are solved in ACE. We then propose a number of optimizations aimed at reducing the overheads and the increased memory consumption which occur in such systems when using previously proposed solutions. Finally, we present results from an implementation of ACE which includes the optimizations proposed. The results show that ACE exploits and-parallelism with high efficiency and high speedups. Furthermore, they also show that the proposed optimizations, which are applicable to many other and-parallel systems, significantly decrease memory consumption and increase speedups and absolute performance both in forwards execution and during backtracking.
Resumo:
We argüe that in order to exploit both Independent And- and Or-parallelism in Prolog programs there is advantage in recomputing some of the independent goals, as opposed to all their solutions being reused. We present an abstract model, called the Composition-Tree, for representing and-or parallelism in Prolog Programs. The Composition-tree closely mirrors sequential Prolog execution by recomputing some independent goals rather than fully re-using them. We also outline two environment representation techniques for And-Or parallel execution of full Prolog based on the Composition-tree model abstraction. We argüe that these techniques have advantages over earlier proposals for exploiting and-or parallelism in Prolog.
Resumo:
This paper addresses the issue of the practicality of global flow analysis in logic program compilation, in terms of speed of the analysis, precisión, and usefulness of the information obtained. To this end, design and implementation aspects are discussed for two practical abstract interpretation-based flow analysis systems: MA , the MCC And-parallel Analyzer and Annotator; and Ms, an experimental mode inference system developed for SB-Prolog. The paper also provides performance data obtained (rom these implementations and, as an example of an application, a study of the usefulness of the mode information obtained in reducing run-time checks in independent and-parallelism.Based on the results obtained, it is concluded that the overhead of global flow analysis is not prohibitive, while the results of analysis can be quite precise and useful.
Resumo:
Concurrency in Logic Programming has received much attention in the past. One problem with many proposals, when applied to Prolog, is that they involve large modifications to the standard implementations, and/or the communication and synchronization facilities provided do not fit as naturally within the language model as we feel is possible. In this paper we propose a new mechanism for implementing synchronization and communication for concurrency, based on atomic accesses to designated facts in the (shared) datábase. We argüe that this model is comparatively easy to implement and harmonizes better than previous proposals within the Prolog control model and standard set of built-ins. We show how in the proposed model it is easy to express classical concurrency algorithms and to subsume other mechanisms such as Linda, variable-based communication, or classical parallelism-oriented primitives. We also report on an implementation of the model and provide performance and resource consumption data.