44 resultados para forced execution of obligations

em Indian Institute of Science - Bangalore - Índia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The motion generated by forced oscillations in an incompressible inviscid rotating and/or stratified fluid is examined under linear theory taking the density variation on the inertia terms. The solution consists of numerous internal modes in addition to the mode which oscillates with forcing frequency. Resonance occurs when the forcing frequency is equal to one of the frequencies of the internal modes. Some of these modes grow linearly or exponentially with time rendering the motion unstable and eventually may lead to turbulence. Most of the results discussed here will be missed under Boussinesq approximation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A new mode of driven nonlinear vibrations of a stretched string is investigated with reference to conditions of existence, properties, and regions of stability. It is shown that this mode exhibits negative resistance properties at all frequencies and driving force amplitudes. Discovery of this mode helps to fill certain gaps in the theory of forced nonlinear vibrations of strings.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper the method of ultraspherical polynomial approximation is applied to study the steady-state response in forced oscillations of a third-order non-linear system. The non-linear function is expanded in ultraspherical polynomials and the expansion is restricted to the linear term. The equation for the response curve is obtained by using the linearized equation and the results are presented graphically. The agreement between the approximate solution and the analog computer solution is satisfactory. The problem of stability is not dealt with in this paper.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Implementation details of efficient schemes for lenient execution and concurrent execution of re-entrant routines in a data flow model have been discussed in this paper. The proposed schemes require no extra hardware support and utilise the existing hardware resources such as the Matching Unit and Memory Network Interface, effectively to achieve the above mentioned goals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Laminar forced convection of nanofluids in a vertical channel with symmetrically mounted rib heaters on surfaces of opposite walls is numerically studied. The fluid flow and heat transfer characteristics are examined for various Reynolds numbers and nanoparticles volume fractions of water-Al2O3 nanofluid. The flow exhibits various structures with varying Reynolds number. Even though the geometry and heating is symmetric with respect to a channel vertical mid-plane, asymmetric flow and heat transfer are found for Reynolds number greater than a critical value. Introduction of nanofluids in the base fluid delays the flow solution bifurcation point, and the critical Reynolds number increases with increasing nanoparticle volume fraction. A skin friction coefficient along the solid-fluid interfaces increases and decreases sharply along the bottom and top faces of the heaters, respectively, due to sudden acceleration and deceleration of the fluid at the respective faces. The skin friction coefficient, as well as Nusselt numbers in the channel, increase with increasing volume fraction of nanoparticles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Artificial Neural Networks (ANNs) are being used to solve a variety of problems in pattern recognition, robotic control, VLSI CAD and other areas. In most of these applications, a speedy response from the ANNs is imperative. However, ANNs comprise a large number of artificial neurons, and a massive interconnection network among them. Hence, implementation of these ANNs involves execution of computer-intensive operations. The usage of multiprocessor systems therefore becomes necessary. In this article, we have presented the implementation of ART1 and ART2 ANNs on ring and mesh architectures. The overall system design and implementation aspects are presented. The performance of the algorithm on ring, 2-dimensional mesh and n-dimensional mesh topologies is presented. The parallel algorithm presented for implementation of ART1 is not specific to any particular architecture. The parallel algorithm for ARTE is more suitable for a ring architecture.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The frequency-dependent response of a pinned charge density wave is considered in terms of forced vibration of an oscillator held in an anharmonic well. It is shown that the effective pinning-frequency can be reduced by applying a d.c. field. If a strong a.c. field, superposed on a d.c. field is applied on such a system “jumps” can be observed in the frequency dependent response of the system. The conditions at which these “jumps” occur are investigated with reference to NbSe3. The possibility of observing such phenomena in other systems like superionic conductors, non-linear dielectrics like ferroelectrics is pointed out. The characteristics are expressed in terms of some “scaled variables” — in terms of which the characteristics show a universal behaviour.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The frequency-dependent response of a pinned charge density wave is considered in terms of forced vibration of an oscillator held in an anharmonic well. It is shown that the effective pinning-frequency can be reduced by applying a d.c. field. If a strong a.c. field, superposed on a d.c. field is applied on such a system “jumps” can be observed in the frequency dependent response of the system. The conditions at which these “jumps” occur are investigated with reference to NbSe3. The possibility of observing such phenomena in other systems like superionic conductors, non-linear dielectrics like ferroelectrics is pointed out. The characteristics are expressed in terms of some “scaled variables” — in terms of which the characteristics show a universal behaviour