843 resultados para parallel processing systems
Resumo:
Embedded real-time applications increasingly present high computation requirements, which need to be completed within specific deadlines, but that present highly variable patterns, depending on the set of data available in a determined instant. The current trend to provide parallel processing in the embedded domain allows providing higher processing power; however, it does not address the variability in the processing pattern. Dimensioning each device for its worst-case scenario implies lower average utilization, and increased available, but unusable, processing in the overall system. A solution for this problem is to extend the parallel execution of the applications, allowing networked nodes to distribute the workload, on peak situations, to neighbour nodes. In this context, this report proposes a framework to develop parallel and distributed real-time embedded applications, transparently using OpenMP and Message Passing Interface (MPI), within a programming model based on OpenMP. The technical report also devises an integrated timing model, which enables the structured reasoning on the timing behaviour of these hybrid architectures.
Resumo:
The performance of the parallel vector implementation of the one- and two-dimensional orthogonal transforms is evaluated. The orthogonal transforms are computed using actual or modified fast Fourier transform (FFT) kernels. The factors considered in comparing the speed-up of these vectorized digital signal processing algorithms are discussed and it is shown that the traditional way of comparing th execution speed of digital signal processing algorithms by the ratios of the number of multiplications and additions is no longer effective for vector implementation; the structure of the algorithm must also be considered as a factor when comparing the execution speed of vectorized digital signal processing algorithms. Simulation results on the Cray X/MP with the following orthogonal transforms are presented: discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh transform (DWHT), and discrete Hadamard transform (DHDT). A comparison between the DHT and the fast Hartley transform is also included.(34 refs)
Resumo:
"UILU-ENG 78 1745."
Resumo:
Includes bibliographical references.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
The Streaming SIMD extension (SSE) is a special feature embedded in the Intel Pentium III and IV classes of microprocessors. It enables the execution of SIMD type operations to exploit data parallelism. This article presents improving computation performance of a railway network simulator by means of SSE. Voltage and current at various points of the supply system to an electrified railway line are crucial for design, daily operation and planning. With computer simulation, their time-variations can be attained by solving a matrix equation, whose size mainly depends upon the number of trains present in the system. A large coefficient matrix, as a result of congested railway line, inevitably leads to heavier computational demand and hence jeopardizes the simulation speed. With the special architectural features of the latest processors on PC platforms, significant speed-up in computations can be achieved.
Resumo:
Streaming SIMD Extensions (SSE) is a unique feature embedded in the Pentium III and P4 classes of microprocessors. By fully exploiting SSE, parallel algorithms can be implemented on a standard personal computer and a theoretical speedup of four can be achieved. In this paper, we demonstrate the implementation of a parallel LU matrix decomposition algorithm for solving power systems network equations with SSE and discuss advantages and disadvantages of this approach.
Resumo:
In this paper we propose a novel technique to model and ana¿ lyze the performability of parallel and distributed architectures using GSPN-reward models.
Resumo:
Modeling the performance behavior of parallel applications to predict the execution times of the applications for larger problem sizes and number of processors has been an active area of research for several years. The existing curve fitting strategies for performance modeling utilize data from experiments that are conducted under uniform loading conditions. Hence the accuracy of these models degrade when the load conditions on the machines and network change. In this paper, we analyze a curve fitting model that attempts to predict execution times for any load conditions that may exist on the systems during application execution. Based on the experiments conducted with the model for a parallel eigenvalue problem, we propose a multi-dimensional curve-fitting model based on rational polynomials for performance predictions of parallel applications in non-dedicated environments. We used the rational polynomial based model to predict execution times for 2 other parallel applications on systems with large load dynamics. In all the cases, the model gave good predictions of execution times with average percentage prediction errors of less than 20%
Resumo:
Information forms the basis of modern technology. To meet the ever-increasing demand for information, means have to be devised for a more efficient and better-equipped technology to intelligibly process data. Advances in photonics have made their impact on each of the four key applications in information processing, i.e., acquisition, transmission, storage and processing of information. The inherent advantages of ultrahigh bandwidth, high speed and low-loss transmission has already established fiber-optics as the backbone of communication technology. However, the optics to electronics inter-conversion at the transmitter and receiver ends severely limits both the speed and bit rate of lightwave communication systems. As the trend towards still faster and higher capacity systems continues, it has become increasingly necessary to perform more and more signal-processing operations in the optical domain itself, i.e., with all-optical components and devices that possess a high bandwidth and can perform parallel processing functions to eliminate the electronic bottleneck.
Resumo:
This paper studies the development of a real-time stereovision system to track multiple infrared markers attached to a surgical instrument. Multiple stages of pipeline in field-programmable gate array (FPGA) are developed to recognize the targets in both left and right image planes and to give each target a unique label. The pipeline architecture includes a smoothing filter, an adaptive threshold module, a connected component labeling operation, and a centroid extraction process. A parallel distortion correction method is proposed and implemented in a dual-core DSP. A suitable kinematic model is established for the moving targets, and a novel set of parallel and interactive computation mechanisms is proposed to position and track the targets, which are carried out by a cross-computation method in a dual-core DSP. The proposed tracking system can track the 3-D coordinate, velocity, and acceleration of four infrared markers with a delay of 9.18 ms. Furthermore, it is capable of tracking a maximum of 110 infrared markers without frame dropping at a frame rate of 60 f/s. The accuracy of the proposed system can reach the scale of 0.37 mm RMS along the x- and y-directions and 0.45 mm RMS along the depth direction (the depth is from 0.8 to 0.45 m). The performance of the proposed system can meet the requirements of applications such as surgical navigation, which needs high real time and accuracy capability.
Resumo:
Conventional parallel computer architectures do not provide support for non-uniformly distributed objects. In this thesis, I introduce sparsely faceted arrays (SFAs), a new low-level mechanism for naming regions of memory, or facets, on different processors in a distributed, shared memory parallel processing system. Sparsely faceted arrays address the disconnect between the global distributed arrays provided by conventional architectures (e.g. the Cray T3 series), and the requirements of high-level parallel programming methods that wish to use objects that are distributed over only a subset of processing elements. A sparsely faceted array names a virtual globally-distributed array, but actual facets are lazily allocated. By providing simple semantics and making efficient use of memory, SFAs enable efficient implementation of a variety of non-uniformly distributed data structures and related algorithms. I present example applications which use SFAs, and describe and evaluate simple hardware mechanisms for implementing SFAs. Keeping track of which nodes have allocated facets for a particular SFA is an important task that suggests the need for automatic memory management, including garbage collection. To address this need, I first argue that conventional tracing techniques such as mark/sweep and copying GC are inherently unscalable in parallel systems. I then present a parallel memory-management strategy, based on reference-counting, that is capable of garbage collecting sparsely faceted arrays. I also discuss opportunities for hardware support of this garbage collection strategy. I have implemented a high-level hardware/OS simulator featuring hardware support for sparsely faceted arrays and automatic garbage collection. I describe the simulator and outline a few of the numerous details associated with a "real" implementation of SFAs and SFA-aware garbage collection. Simulation results are used throughout this thesis in the evaluation of hardware support mechanisms.
A policy-definition language and prototype implementation library for policy-based autonomic systems
Resumo:
This paper presents work towards generic policy toolkit support for autonomic computing systems in which the policies themselves can be adapted dynamically and automatically. The work is motivated by three needs: the need for longer-term policy-based adaptation where the policy itself is dynamically adapted to continually maintain or improve its effectiveness despite changing environmental conditions; the need to enable non autonomics-expert practitioners to embed self-managing behaviours with low cost and risk; and the need for adaptive policy mechanisms that are easy to deploy into legacy code. A policy definition language is presented; designed to permit powerful expression of self-managing behaviours. The language is very flexible through the use of simple yet expressive syntax and semantics, and facilitates a very diverse policy behaviour space through both hierarchical and recursive uses of language elements. A prototype library implementation of the policy support mechanisms is described. The library reads and writes policies in well-formed XML script. The implementation extends the state of the art in policy-based autonomics through innovations which include support for multiple policy versions of a given policy type, multiple configuration templates, and meta-policies to dynamically select between policy instances and templates. Most significantly, the scheme supports hot-swapping between policy instances. To illustrate the feasibility and generalised applicability of these tools, two dissimilar example deployment scenarios are examined. The first is taken from an exploratory implementation of self-managing parallel processing, and is used to demonstrate the simple and efficient use of the tools. The second example demonstrates more-advanced functionality, in the context of an envisioned multi-policy stock trading scheme which is sensitive to environmental volatility
Resumo:
A novel application-specific instruction set processor (ASIP) for use in the construction of modern signal processing systems is presented. This is a flexible device that can be used in the construction of array processor systems for the real-time implementation of functions such as singular-value decomposition (SVD) and QR decomposition (QRD), as well as other important matrix computations. It uses a coordinate rotation digital computer (CORDIC) module to perform arithmetic operations and several approaches are adopted to achieve high performance including pipelining of the micro-rotations, the use of parallel instructions and a dual-bus architecture. In addition, a novel method for scale factor correction is presented which only needs to be applied once at the end of the computation. This also reduces computation time and enhances performance. Methods are described which allow this processor to be used in reduced dimension (i.e., folded) array processor structures that allow tradeoffs between hardware and performance. The net result is a flexible matrix computational processing element (PE) whose functionality can be changed under program control for use in a wider range of scenarios than previous work. Details are presented of the results of a design study, which considers the application of this decomposition PE architecture in a combined SVD/QRD system and demonstrates that a combination of high performance and efficient silicon implementation are achievable. © 2005 IEEE.
Resumo:
We discuss how common problems arising with multi/many core distributed architectures can he effectively handled through co-design of parallel/distributed programming abstractions and of autonomic management of non-functional concerns. In particular, we demonstrate how restricted patterns (or skeletons) may be efficiently managed by rule-based autonomic managers. We discuss the basic principles underlying pattern+manager co-design, current implementations inspired by this approach and some result achieved with proof-or-concept, prototype.