79 resultados para parallel processor
em Universidad Politécnica de Madrid
Resumo:
We propose a computational methodology -"B-LOG"-, which offers the potential for an effective implementation of Logic Programming in a parallel computer. We also propose a weighting scheme to guide the search process through the graph and we apply the concepts of parallel "branch and bound" algorithms in order to perform a "best-first" search using an information theoretic bound. The concept of "session" is used to speed up the search process in a succession of similar queries. Within a session, we strongly modify the bounds in a local database, while bounds kept in a global database are weakly modified to provide a better initial condition for other sessions. We also propose an implementation scheme based on a database machine using "semantic paging", and the "B-LOG processor" based on a scoreboard driven controller.
Resumo:
We show a method for parallelizing top down dynamic programs in a straightforward way by a careful choice of a lock-free shared hash table implementation and randomization of the order in which the dynamic program computes its subproblems. This generic approach is applied to dynamic programs for knapsack, shortest paths, and RNA structure alignment, as well as to a state-of-the-art solution for minimizing the máximum number of open stacks. Experimental results are provided on three different modern multicore architectures which show that this parallelization is effective and reasonably scalable. In particular, we obtain over 10 times speedup for 32 threads on the open stacks problem.
Resumo:
Since the early days of logic programming, researchers in the field realized the potential for exploitation of parallelism present in the execution of logic programs. Their high-level nature, the presence of nondeterminism, and their referential transparency, among other characteristics, make logic programs interesting candidates for obtaining speedups through parallel execution. At the same time, the fact that the typical applications of logic programming frequently involve irregular computations, make heavy use of dynamic data structures with logical variables, and involve search and speculation, makes the techniques used in the corresponding parallelizing compilers and run-time systems potentially interesting even outside the field. The objective of this article is to provide a comprehensive survey of the issues arising in parallel execution of logic programming languages along with the most relevant approaches explored to date in the field. Focus is mostly given to the challenges emerging from the parallel execution of Prolog programs. The article describes the major techniques used for shared memory implementation of Or-parallelism, And-parallelism, and combinations of the two. We also explore some related issues, such as memory management, compile-time analysis, and execution visualization.
Resumo:
Membrane systems are computational equivalent to Turing machines. However, their distributed and massively parallel nature obtains polynomial solutions opposite to traditional non-polynomial ones. At this point, it is very important to develop dedicated hardware and software implementations exploiting those two membrane systems features. Dealing with distributed implementations of P systems, the bottleneck communication problem has arisen. When the number of membranes grows up, the network gets congested. The purpose of distributed architectures is to reach a compromise between the massively parallel character of the system and the needed evolution step time to transit from one configuration of the system to the next one, solving the bottleneck communication problem. The goal of this paper is twofold. Firstly, to survey in a systematic and uniform way the main results regarding the way membranes can be placed on processors in order to get a software/hardware simulation of P-Systems in a distributed environment. Secondly, we improve some results about the membrane dissolution problem, prove that it is connected, and discuss the possibility of simulating this property in the distributed model. All this yields an improvement in the system parallelism implementation since it gets an increment of the parallelism of the external communication among processors. Proposed ideas improve previous architectures to tackle the communication bottleneck problem, such as reduction of the total time of an evolution step, increase of the number of membranes that could run on a processor and reduction of the number of processors.
Resumo:
This paper describes new improvements for BB-MaxClique (San Segundo et al. in Comput Oper Resour 38(2):571–581, 2011 ), a leading maximum clique algorithm which uses bit strings to efficiently compute basic operations during search by bit masking. Improvements include a recently described recoloring strategy in Tomita et al. (Proceedings of the 4th International Workshop on Algorithms and Computation. Lecture Notes in Computer Science, vol 5942. Springer, Berlin, pp 191–203, 2010 ), which is now integrated in the bit string framework, as well as different optimization strategies for fast bit scanning. Reported results over DIMACS and random graphs show that the new variants improve over previous BB-MaxClique for a vast majority of cases. It is also established that recoloring is mainly useful for graphs with high densities.
Resumo:
We have developed a new projector model specifically tailored for fast list-mode tomographic reconstructions in Positron emission tomography (PET) scanners with parallel planar detectors. The model provides an accurate estimation of the probability distribution of coincidence events defined by pairs of scintillating crystals. This distribution is parameterized with 2D elliptical Gaussian functions defined in planes perpendicular to the main axis of the tube of response (TOR). The parameters of these Gaussian functions have been obtained by fitting Monte Carlo simulations that include positron range, acolinearity of gamma rays, as well as detector attenuation and scatter effects. The proposed model has been applied efficiently to list-mode reconstruction algorithms. Evaluation with Monte Carlo simulations over a rotating high resolution PET scanner indicates that this model allows to obtain better recovery to noise ratio in OSEM (ordered-subsets, expectation-maximization) reconstruction, if compared to list-mode reconstruction with symmetric circular Gaussian TOR model, and histogram-based OSEM with precalculated system matrix using Monte Carlo simulated models and symmetries.
Resumo:
The manipulation and handling of an ever increasing volume of data by current data-intensive applications require novel techniques for e?cient data management. Despite recent advances in every aspect of data management (storage, access, querying, analysis, mining), future applications are expected to scale to even higher degrees, not only in terms of volumes of data handled but also in terms of users and resources, often making use of multiple, pre-existing autonomous, distributed or heterogeneous resources.
Resumo:
We developed a new FPGA-based method for coincidence detection in positronemissiontomography. The method requires low device resources and no specific peripherals in order to resolve coincident digital pulses within a time window of a few nanoseconds. This method has been validated with a low-end Xilinx Spartan-3E and provided coincidence resolutions lower than 6 ns. This resolution depends directly on the signal propagation properties of the target device and the maximum available clock frequency, therefore it is expected to improve considerably on higher-end FPGAs.
Resumo:
Zernike polynomials are a well known set of functions that find many applications in image or pattern characterization because they allow to construct shape descriptors that are invariant against translations, rotations or scale changes. The concepts behind them can be extended to higher dimension spaces, making them also fit to describe volumetric data. They have been less used than their properties might suggest due to their high computational cost. We present a parallel implementation of 3D Zernike moments analysis, written in C with CUDA extensions, which makes it practical to employ Zernike descriptors in interactive applications, yielding a performance of several frames per second in voxel datasets about 2003 in size. In our contribution, we describe the challenges of implementing 3D Zernike analysis in a general-purpose GPU. These include how to deal with numerical inaccuracies, due to the high precision demands of the algorithm, or how to deal with the high volume of input data so that it does not become a bottleneck for the system.
Resumo:
This paper outlines the problems found in the parallelization of SPH (Smoothed Particle Hydrodynamics) algorithms using Graphics Processing Units. Different results of some parallel GPU implementations in terms of the speed-up and the scalability compared to the CPU sequential codes are shown. The most problematic stage in the GPU-SPH algorithms is the one responsible for locating neighboring particles and building the vectors where this information is stored, since these specific algorithms raise many dificulties for a data-level parallelization. Because of the fact that the neighbor location using linked lists does not show enough data-level parallelism, two new approaches have been pro- posed to minimize bank conflicts in the writing and subsequent reading of the neighbor lists. The first strategy proposes an efficient coordination between CPU-GPU, using GPU algorithms for those stages that allow a straight forward parallelization, and sequential CPU algorithms for those instructions that involve some kind of vector reduction. This coordination provides a relatively orderly reading of the neighbor lists in the interactions stage, achieving a speed-up factor of x47 in this stage. However, since the construction of the neighbor lists is quite expensive, it is achieved an overall speed-up of x41. The second strategy seeks to maximize the use of the GPU in the neighbor's location process by executing a specific vector sorting algorithm that allows some data-level parallelism. Al- though this strategy has succeeded in improving the speed-up on the stage of neighboring location, the global speed-up on the interactions stage falls, due to inefficient reading of the neighbor vectors. Some changes to these strategies are proposed, aimed at maximizing the computational load of the GPU and using the GPU texture-units, in order to reach the maximum speed-up for such codes. Different practical applications have been added to the mentioned GPU codes. First, the classical dam-break problem is studied. Second, the wave impact of the sloshing fluid contained in LNG vessel tanks is also simulated as a practical example of particle methods
Resumo:
This article describes a new visual servo control and strategies that are used to carry out dynamic tasks by the Robotenis platform. This platform is basically a parallel robot that is equipped with an acquisition and processing system of visual information, its main feature is that it has a completely open architecture control, and planned in order to design, implement, test and compare control strategies and algorithms (visual and actuated joint controllers). Following sections describe a new visual control strategy specially designed to track and intercept objects in 3D space. The results are compared with a controller shown in previous woks, where the end effector of the robot keeps a constant distance from the tracked object. In this work, the controller is specially designed in order to allow changes in the tracking reference. Changes in the tracking reference can be used to grip an object that is under movement, or as in this case, hitting a hanging Ping-Pong ball. Lyapunov stability is taken into account in the controller design.
Resumo:
The main purpose of robot calibration is the correction of the possible errors in the robot parameters. This paper presents a method for a kinematic calibration of a parallel robot that is equipped with one camera in hand. In order to preserve the mechanical configuration of the robot, the camera is utilized to acquire incremental positions of the end effector from a spherical object that is fixed in the word reference frame. The positions of the end effector are related to incremental positions of resolvers of the motors of the robot, and a kinematic model of the robot is used to find a new group of parameters which minimizes errors in the kinematic equations. Additionally, properties of the spherical object and intrinsic camera parameters are utilized to model the projection of the object in the image and improving spatial measurements. Finally, the robotic system is designed to carry out tracking tasks and the calibration of the robot is validated by means of integrating the errors of the visual controller.
Resumo:
Rms voltage regulation may be an attractive possibility for controlling power inverters. Combined with a Hall Effect sensor for current control, it keeps its parallel operation capability while increasing its noise immunity, which may lead to a reduction of the Total Harmonic Distortion (THD). Besides, as voltage regulation is designed in DC, a simple PI regulator can provide accurate voltage tracking. Nevertheless, this approach does not lack drawbacks. Its narrow voltage bandwidth makes transients last longer and it increases the voltage THD when feeding non-linear loads, such as rectifying stages. On the other hand, the implementation can fall into offset voltage error. Furthermore, the information of the output voltage phase is hidden for the control as well, making the synchronization of a 3-phase setup not trivial. This paper explains the concept, design and implementation of the whole control scheme, in an on board inverter able to run in parallel and within a 3-phase setup. Special attention is paid to solve the problems foreseen at implementation level: a third analog loop accounts for the offset level is added and a digital algorithm guarantees 3-phase voltage synchronization.
Resumo:
In this paper we will see how the efficiency of the MBS simulations can be improved in two different ways, by considering both an explicit and implicit semi-recursive formulation. The explicit method is based on a double velocity transformation that involves the solution of a redundant but compatible system of equations. The high computational cost of this operation has been drastically reduced by taking into account the sparsity pattern of the system. Regarding this, the goal of this method is the introduction of MA48, a high performance mathematical library provided by Harwell Subroutine Library. The second method proposed in this paper has the particularity that, depending on the case, between 70 and 85% of the computation time is devoted to the evaluation of forces derivatives with respect to the relative position and velocity vectors. Keeping in mind that evaluating these derivatives can be decomposed into concurrent tasks, the main goal of this paper lies on a successful and straightforward parallel implementation that have led to a substantial improvement with a speedup of 3.2 by keeping all the cores busy in a quad-core processor and distributing the workload between them, achieving on this way a huge time reduction by doing an ideal CPU usage
Resumo:
This paper presents a low-power, high-speed 4-data-path 128-point mixed-radix (radix-2 & radix-2 2 ) FFT processor for MB-OFDM Ultra-WideBand (UWB) systems. The processor employs the single-path delay feedback (SDF) pipelined structure for the proposed algorithm, it uses substructure-sharing multiplication units and shift-add structure other than traditional complex multipliers. Furthermore, the word lengths are properly chosen, thus the hardware costs and power consumption of the proposed FFT processor are efficiently reduced. The proposed FFT processor is verified and synthesized by using 0.13 µm CMOS technology with a supply voltage of 1.32 V. The implementation results indicate that the proposed 128-point mixed-radix FFT architecture supports a throughput rate of 1Gsample/s with lower power consumption in comparison to existing 128-point FFT architectures