Biblioteca Digital

57 resultados para Contracts of execution

em Indian Institute of Science - Bangalore - Índia

Estimation of Execution Times of Process Control Algorithms on Microcomputers

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An important question which has to be answered in evaluting the suitability of a microcomputer for a control application is the time it would take to execute the specified control algorithm. In this paper, we present a method of obtaining closed-form formulas to estimate this time. These formulas are applicable to control algorithms in which arithmetic operations and matrix manipulations dominate. The method does not require writing detailed programs for implementing the control algorithm. Using this method, the execution times of a variety of control algorithms on a range of 16-bit mini- and recently announced microcomputers are calculated. The formulas have been verified independently by an analysis program, which computes the execution time bounds of control algorithms coded in Pascal when they are run on a specified micro- or minicomputer.

RETHROTTLE: Execution Throttling in the REDEFINE SoC Architecture

Relevância:

90.00% 90.00%

Publicador:

Resumo:

REDEFINE is a reconfigurable SoC architecture that provides a unique platform for high performance and low power computing by exploiting the synergistic interaction between coarse grain dynamic dataflow model of computation (to expose abundant parallelism in applications) and runtime composition of efficient compute structures (on the reconfigurable computation resources). We propose and study the throttling of execution in REDEFINE to maximize the architecture efficiency. A feature specific fast hybrid (mixed level) simulation framework for early in design phase study is developed and implemented to make the huge design space exploration practical. We do performance modeling in terms of selection of important performance criteria, ranking of the explored throttling schemes and investigate effectiveness of the design space exploration using statistical hypothesis testing. We find throttling schemes which give appreciable (24.8%) overall performance gain in the architecture and 37% resource usage gain in the throttling unit simultaneously.

Outward spherical solidification of a superheated melt with time dependent boundary flux

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Short-time analytical solutions of solid and liquid temperatures and freezing front have been obtained for the outward radially symmetric spherical solidification of a superheated melt. Although results are presented here only for time dependent boundary flux, the method of solution can be used for other kinds of boundary conditions also. Later, the analytical solution has been compared with the numerical solution obtained with the help of a finite difference numerical scheme in which the grid points change with the freezing front position. An efficient method of execution of the numerical scheme has been discussed in details. Graphs have been drawn for the total solidification times and temperature distributions in the solid.

Performance Modeling based on Multidimensional Surface Learning for Performance Predictions of Parallel Applications in Non-Dedicated Environments

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Modeling the performance behavior of parallel applications to predict the execution times of the applications for larger problem sizes and number of processors has been an active area of research for several years. The existing curve fitting strategies for performance modeling utilize data from experiments that are conducted under uniform loading conditions. Hence the accuracy of these models degrade when the load conditions on the machines and network change. In this paper, we analyze a curve fitting model that attempts to predict execution times for any load conditions that may exist on the systems during application execution. Based on the experiments conducted with the model for a parallel eigenvalue problem, we propose a multi-dimensional curve-fitting model based on rational polynomials for performance predictions of parallel applications in non-dedicated environments. We used the rational polynomial based model to predict execution times for 2 other parallel applications on systems with large load dynamics. In all the cases, the model gave good predictions of execution times with average percentage prediction errors of less than 20%

A Generalized Approach to Multiresolution Complex SAR Signal Processing

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Multiresolution synthetic aperture radar (SAR) image formation has been proven to be beneficial in a variety of applications such as improved imaging and target detection as well as speckle reduction. SAR signal processing traditionally carried out in the Fourier domain has inherent limitations in the context of image formation at hierarchical scales. We present a generalized approach to the formation of multiresolution SAR images using biorthogonal shift-invariant discrete wavelet transform (SIDWT) in both range and azimuth directions. Particularly in azimuth, the inherent subband decomposition property of wavelet packet transform is introduced to produce multiscale complex matched filtering without involving any approximations. This generalized approach also includes the formulation of multilook processing within the discrete wavelet transform (DWT) paradigm. The efficiency of the algorithm in parallel form of execution to generate hierarchical scale SAR images is shown. Analytical results and sample imagery of diffuse backscatter are presented to validate the method.

Evaluation of Charges for Power Transmission and Losses in Bilateral Power Contracts

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The paper presents a method for transmission loss charge allocation in deregulated power systems based on Relative Electrical Distance (RED) concept. Based on RED between the generator and load nodes and the predefined bilateral power contracts, charge evaluation is carried out. Generally through some power exchange mechanism a set of bilateral contracts are determined that facilitate bilateral agreements between the generation and distribution entities. In this paper the possible charges incurred in meeting loads like generation charge, transmission charge and charge due to losses are evaluated. Case studies have been carried out on a few practical equivalent systems. Due to space limitation results for a sample 5 bus system are presented considering ideal load/generation power contracts and deviated load/generation power contracts. Extensive numerical testing indicates that the proposed allocation scheme produces loss allocations that are appropriate and that behave in a physically reasonable manner.

Synergistic Execution of Stream Programs on Multicores with Accelerators

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.

Software Pipelined Execution of Stream Programs on GPUs

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

Voluntary Control of Multisaccade Gaze Shifts During Movement Preparation and Execution

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Ramakrishnan A, Chokhandre S, Murthy A. Voluntary control of multisaccade gaze shifts during movement preparation and execution. J Neurophysiol 103: 2400-2416, 2010. First published February 17, 2010; doi: 10.1152/jn.00843.2009. Although the nature of gaze control regulating single saccades is relatively well documented, how such control is implemented to regulate multisaccade gaze shifts is not known. We used highly eccentric targets to elicit multisaccade gaze shifts and tested the ability of subjects to control the saccade sequence by presenting a second target on random trials. Their response allowed us to test the nature of control at many levels: before, during, and between saccades. Although the saccade sequence could be inhibited before it began, we observed clear signs of truncation of the first saccade, which confirmed that it could be inhibited in midflight as well. Using a race model that explains the control of single saccades, we estimated that it took about 100 ms to inhibit a planned saccade but took about 150 ms to inhibit a saccade during its execution. Although the time taken to inhibit was different, the high subject-wise correlation suggests a unitary inhibitory control acting at different levels in the oculomotor system. We also frequently observed responses that consisted of hypometric initial saccades, followed by secondary saccades to the initial target. Given the estimates of the inhibitory process provided by the model that also took into account the variances of the processes as well, the secondary saccades (average latency similar to 215 ms) should have been inhibited. Failure to inhibit the secondary saccade suggests that the intersaccadic interval in a multisaccade response is a ballistic stage. Collectively, these data indicate that the oculomotor system can control a response until a very late stage in its execution. However, if the response consists of multiple movements then the preparation of the second movement becomes refractory to new visual input, either because it is part of a preprogrammed sequence or as a consequence of being a corrective response to a motor error.

Lenient Execution and Concurrent Execution of Re-entrant Routines: Efficient Implementation in Data Flow Systems

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Implementation details of efficient schemes for lenient execution and concurrent execution of re-entrant routines in a data flow model have been discussed in this paper. The proposed schemes require no extra hardware support and utilise the existing hardware resources such as the Matching Unit and Memory Network Interface, effectively to achieve the above mentioned goals.

Automatic Compilation of MATLAB Programs for Synergistic Execution on Heterogeneous Processors

Relevância:

40.00% 40.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Relevância:

40.00% 40.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Runtime dependence computation and execution of loops on heterogeneous systems

Relevância:

40.00% 40.00%

Publicador:

Resumo:

GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.

CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Implementation of Art1 and Art2 Artificial Neural Networks on Ring and Mesh Architectures

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Artificial Neural Networks (ANNs) are being used to solve a variety of problems in pattern recognition, robotic control, VLSI CAD and other areas. In most of these applications, a speedy response from the ANNs is imperative. However, ANNs comprise a large number of artificial neurons, and a massive interconnection network among them. Hence, implementation of these ANNs involves execution of computer-intensive operations. The usage of multiprocessor systems therefore becomes necessary. In this article, we have presented the implementation of ART1 and ART2 ANNs on ring and mesh architectures. The overall system design and implementation aspects are presented. The performance of the algorithm on ring, 2-dimensional mesh and n-dimensional mesh topologies is presented. The parallel algorithm presented for implementation of ART1 is not specific to any particular architecture. The parallel algorithm for ARTE is more suitable for a ring architecture.

«
1
2
3
4
»