880 resultados para Parallel processing (Electronic computer)
Resumo:
Java is becoming an increasingly popular language for developing distributed and parallel scientific and engineering applications. Jini is a Java-based infrastructure developed by Sun that can allegedly provide all the services necessary to support distributed applications. It is the aim of this paper to explore and investigate the services and properties that Jini actually provides and match these against the needs of high performance distributed and parallel applications written in Java. The motivation for this work is the need to develop a distributed infrastructure to support an MPI-like interface to Java known as MPJ. In the first part of the paper we discuss the needs of MPJ, the parallel environment that we wish to support. In particular we look at aspects such as reliability and ease of use. We then move on to sketch out the Jini architecture and review the components and services that Jini provides. In the third part of the paper we critically explore a Jini infrastructure that could be used to support MPJ. Here we are particularly concerned with Jini's ability to support reliably a cocoon of MPJ processes executing in a heterogeneous envirnoment. In the final part of the paper we summarise our findings and report on future work being undertaken on Jini and MPJ.
Resumo:
This paper presents the use of a multiprocessor architecture for the performance improvement of tomographic image reconstruction. Image reconstruction in computed tomography (CT) is an intensive task for single-processor systems. We investigate the filtered image reconstruction suitability based on DSPs organized for parallel processing and its comparison with the Message Passing Interface (MPI) library. The experimental results show that the speedups observed for both platforms were increased in the same direction of the image resolution. In addition, the execution time to communication time ratios (Rt/Rc) as a function of the sample size have shown a narrow variation for the DSP platform in comparison with the MPI platform, which indicates its better performance for parallel image reconstruction.
Resumo:
In large distributed systems, where shared resources are owned by distinct entities, there is a need to reflect resource ownership in resource allocation. An appropriate resource management system should guarantee that resource's owners have access to a share of resources proportional to the share they provide. In order to achieve that some policies can be used for revoking access to resources currently used by other users. In this paper, a scheduling policy based in the concept of distributed ownership is introduced called Owner Share Enforcement Policy (OSEP). OSEP goal is to guarantee that owner do not have their jobs postponed for longer periods of time. We evaluate the results achieved with the application of this policy using metrics that describe policy violation, loss of capacity, policy cost and user satisfaction in environments with and without job checkpointing. We also evaluate and compare the OSEP policy with the Fair-Share policy, and from these results it is possible to capture the trade-offs from different ways to achieve fairness based on the user satisfaction. © 2009 IEEE.
Resumo:
Software Transactional Memory (STM) systems have poor performance under high contention scenarios. Since many transactions compete for the same data, most of them are aborted, wasting processor runtime. Contention management policies are typically used to avoid that, but they are passive approaches as they wait for an abort to happen so they can take action. More proactive approaches have emerged, trying to predict when a transaction is likely to abort so its execution can be delayed. Such techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction should run. In this paper we propose LUTS, a Lightweight User-Level Transaction Scheduler, which is based on an execution context record mechanism. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. Moreover, it avoids most of the issues caused by pseudo parallelism, as it only launches as many system-level threads as the number of available processor cores. We discuss LUTS design and present three conflict-avoidance heuristics built around LUTS scheduling capabilities. Experimental results, conducted with STMBench7 and STAMP benchmark suites, show LUTS efficiency when running high contention applications and how conflict-avoidance heuristics can improve STM performance even more. In fact, our transaction scheduling techniques are capable of improving program performance even in overloaded scenarios. © 2011 Springer-Verlag.
Resumo:
Transactional memory (TM) is a new synchronization mechanism devised to simplify parallel programming, thereby helping programmers to unleash the power of current multicore processors. Although software implementations of TM (STM) have been extensively analyzed in terms of runtime performance, little attention has been paid to an equally important constraint faced by nearly all computer systems: energy consumption. In this work we conduct a comprehensive study of energy and runtime tradeoff sin software transactional memory systems. We characterize the behavior of three state-of-the-art lock-based STM algorithms, along with three different conflict resolution schemes. As a result of this characterization, we propose a DVFS-based technique that can be integrated into the resolution policies so as to improve the energy-delay product (EDP). Experimental results show that our DVFS-enhanced policies are indeed beneficial for applications with high contention levels. Improvements of up to 59% in EDP can be observed in this scenario, with an average EDP reduction of 16% across the STAMP workloads. © 2012 IEEE.
Resumo:
Research on the micro-structural characterization of metal-matrix composites uses X-ray computed tomography to collect information about the interior features of the samples, in order to elucidate their exhibited properties. The tomographic raw data needs several steps of computational processing in order to eliminate noise and interference. Our experience with a program (Tritom) that handles these questions has shown that in some cases the processing steps take a very long time and that it is not easy for a Materials Science specialist to interact with Tritom in order to define the most adequate parameter values and the proper sequence of the available processing steps. For easing the use of Tritom, a system was built which addresses the aspects described before and that is based on the OpenDX visualization system. OpenDX visualization facilities constitute a great benefit to Tritom. The visual programming environment of OpenDX allows an easy definition of a sequence of processing steps thus fulfilling the requirement of an easy use by non-specialists on Computer Science. Also the possibility of incorporating external modules in a visual OpenDX program allows the researchers to tackle the aspect of reducing the long execution time of some processing steps. The longer processing steps of Tritom have been parallelized in two different types of hardware architectures (message-passing and shared-memory); the corresponding parallel programs can be easily incorporated in a sequence of processing steps defined in an OpenDX program. The benefits of our system are illustrated through an example where the tool is applied in the study of the sensitivity to crushing – and the implications thereof – of the reinforcements used in a functionally graded syntactic metallic foam.
Resumo:
In the past few decades, integrated circuits have become a major part of everyday life. Every circuit that is created needs to be tested for faults so faulty circuits are not sent to end-users. The creation of these tests is time consuming, costly and difficult to perform on larger circuits. This research presents a novel method for fault detection and test pattern reduction in integrated circuitry under test. By leveraging the FPGA's reconfigurability and parallel processing capabilities, a speed up in fault detection can be achieved over previous computer simulation techniques. This work presents the following contributions to the field of Stuck-At-Fault detection: We present a new method for inserting faults into a circuit net list. Given any circuit netlist, our tool can insert multiplexers into a circuit at correct internal nodes to aid in fault emulation on reconfigurable hardware. We present a parallel method of fault emulation. The benefit of the FPGA is not only its ability to implement any circuit, but its ability to process data in parallel. This research utilizes this to create a more efficient emulation method that implements numerous copies of the same circuit in the FPGA. A new method to organize the most efficient faults. Most methods for determinin the minimum number of inputs to cover the most faults require sophisticated softwareprograms that use heuristics. By utilizing hardware, this research is able to process data faster and use a simpler method for an efficient way of minimizing inputs.
Resumo:
This thesis develops high performance real-time signal processing modules for direction of arrival (DOA) estimation for localization systems. It proposes highly parallel algorithms for performing subspace decomposition and polynomial rooting, which are otherwise traditionally implemented using sequential algorithms. The proposed algorithms address the emerging need for real-time localization for a wide range of applications. As the antenna array size increases, the complexity of signal processing algorithms increases, making it increasingly difficult to satisfy the real-time constraints. This thesis addresses real-time implementation by proposing parallel algorithms, that maintain considerable improvement over traditional algorithms, especially for systems with larger number of antenna array elements. Singular value decomposition (SVD) and polynomial rooting are two computationally complex steps and act as the bottleneck to achieving real-time performance. The proposed algorithms are suitable for implementation on field programmable gated arrays (FPGAs), single instruction multiple data (SIMD) hardware or application specific integrated chips (ASICs), which offer large number of processing elements that can be exploited for parallel processing. The designs proposed in this thesis are modular, easily expandable and easy to implement. Firstly, this thesis proposes a fast converging SVD algorithm. The proposed method reduces the number of iterations it takes to converge to correct singular values, thus achieving closer to real-time performance. A general algorithm and a modular system design are provided making it easy for designers to replicate and extend the design to larger matrix sizes. Moreover, the method is highly parallel, which can be exploited in various hardware platforms mentioned earlier. A fixed point implementation of proposed SVD algorithm is presented. The FPGA design is pipelined to the maximum extent to increase the maximum achievable frequency of operation. The system was developed with the objective of achieving high throughput. Various modern cores available in FPGAs were used to maximize the performance and details of these modules are presented in detail. Finally, a parallel polynomial rooting technique based on Newton’s method applicable exclusively to root-MUSIC polynomials is proposed. Unique characteristics of root-MUSIC polynomial’s complex dynamics were exploited to derive this polynomial rooting method. The technique exhibits parallelism and converges to the desired root within fixed number of iterations, making this suitable for polynomial rooting of large degree polynomials. We believe this is the first time that complex dynamics of root-MUSIC polynomial were analyzed to propose an algorithm. In all, the thesis addresses two major bottlenecks in a direction of arrival estimation system, by providing simple, high throughput, parallel algorithms.
DESIGN AND IMPLEMENT DYNAMIC PROGRAMMING BASED DISCRETE POWER LEVEL SMART HOME SCHEDULING USING FPGA
Resumo:
With the development and capabilities of the Smart Home system, people today are entering an era in which household appliances are no longer just controlled by people, but also operated by a Smart System. This results in a more efficient, convenient, comfortable, and environmentally friendly living environment. A critical part of the Smart Home system is Home Automation, which means that there is a Micro-Controller Unit (MCU) to control all the household appliances and schedule their operating times. This reduces electricity bills by shifting amounts of power consumption from the on-peak hour consumption to the off-peak hour consumption, in terms of different “hour price”. In this paper, we propose an algorithm for scheduling multi-user power consumption and implement it on an FPGA board, using it as the MCU. This algorithm for discrete power level tasks scheduling is based on dynamic programming, which could find a scheduling solution close to the optimal one. We chose FPGA as our system’s controller because FPGA has low complexity, parallel processing capability, a large amount of I/O interface for further development and is programmable on both software and hardware. In conclusion, it costs little time running on FPGA board and the solution obtained is good enough for the consumers.
Resumo:
We investigate parallel algorithms for the solution of the Navier–Stokes equations in space-time. For periodic solutions, the discretized problem can be written as a large non-linear system of equations. This system of equations is solved by a Newton iteration. The Newton correction is computed using a preconditioned GMRES solver. The parallel performance of the algorithm is illustrated.
Resumo:
The manipulation and handling of an ever increasing volume of data by current data-intensive applications require novel techniques for e?cient data management. Despite recent advances in every aspect of data management (storage, access, querying, analysis, mining), future applications are expected to scale to even higher degrees, not only in terms of volumes of data handled but also in terms of users and resources, often making use of multiple, pre-existing autonomous, distributed or heterogeneous resources.
Resumo:
Zernike polynomials are a well known set of functions that find many applications in image or pattern characterization because they allow to construct shape descriptors that are invariant against translations, rotations or scale changes. The concepts behind them can be extended to higher dimension spaces, making them also fit to describe volumetric data. They have been less used than their properties might suggest due to their high computational cost. We present a parallel implementation of 3D Zernike moments analysis, written in C with CUDA extensions, which makes it practical to employ Zernike descriptors in interactive applications, yielding a performance of several frames per second in voxel datasets about 2003 in size. In our contribution, we describe the challenges of implementing 3D Zernike analysis in a general-purpose GPU. These include how to deal with numerical inaccuracies, due to the high precision demands of the algorithm, or how to deal with the high volume of input data so that it does not become a bottleneck for the system.
Resumo:
Abstract is not available.
Resumo:
We present a technique to estimate accurate speedups for parallel logic programs with relative independence from characteristics of a given implementation or underlying parallel hardware. The proposed technique is based on gathering accurate data describing one execution at run-time, which is fed to a simulator. Alternative schedulings are then simulated and estimates computed for the corresponding speedups. A tool implementing the aforementioned techniques is presented, and its predictions are compared to the performance of real systems, showing good correlation.
Resumo:
In recent years a lot of research has been invested in parallel processing of numerical applications. However, parallel processing of Symbolic and AI applications has received less attention. This paper presents a system for parallel symbolic computitig, narned ACE, based on the logic programming paradigm. ACE is a computational model for the full Prolog language, capable of exploiting Or-parall< lism and Independent And-parallelism. In this paper vve focus on the implementation of the and-parallel part of the ACE system (ralled &ACE) on a shared memory multiprocessor, d< scribing its organization, some optimizations, and presenting some performance figures, proving the abilhy of &ACE to efficiently exploit parallelism.