34 resultados para Parallelism

em Indian Institute of Science - Bangalore - Índia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

The cytokinins (benzyladenine or benzyladenosine) decreased spermidine and spermine contents despite increasing putrescine content, when administered to isolated cotyledons of Cucumis sativus L. var. Guntur in organ culture. KCl decreased putrescine contents, although marginally increasing polyamine contents. The cytokinins and/or KCl augmented nucleic acid biosynthesis and accumulation, resulting in enhanced growth and differentiation of the isolated cotyledons. These observations show that polyamine accumulation and growth are not always coupled.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we introduce an analytical technique based on queueing networks and Petri nets for making a performance analysis of dataflow computations when executed on the Manchester machine. This technique is also applicable for the analysis of parallel computations on multiprocessors. We characterize the parallelism in dataflow computations through a four-parameter characterization, namely, the minimum parallelism, the maximum parallelism, the average parallelism and the variance in parallelism. We observe through detailed investigation of our analytical models that the average parallelism is a good characterization of the dataflow computations only as long as the variance in parallelism is small. However, significant difference in performance measures will result when the variance in parallelism is comparable to or higher than the average parallelism.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Abstract—A new breed of processors like the Cell Broadband Engine, the Imagine stream processor and the various GPU processors emphasize data-level parallelism (DLP) and threadlevel parallelism (TLP) as opposed to traditional instructionlevel parallelism (ILP). This allows them to achieve order-ofmagnitude improvements over conventional superscalar processors for many workloads. However, it is unclear as to how much parallelism of these types exists in current programs. Most earlier studies have largely concentrated on the amount of ILP in a program, without differentiating DLP or TLP. In this study, we investigate the extent of data-level parallelism available in programs in the MediaBench suite. By packing instructions in a SIMD fashion, we observe reductions of up to 91 % (84 % on average) in the number of dynamic instructions, indicating a very high degree of DLP in several applications. I.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes parallelism. However, existing automatic tiling frameworks often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this issue with a new tiling technique that ensures concurrent start-up as well as perfect load-balance whenever possible. We first provide necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses. We then provide an approach to find such hyperplanes. Experimental evaluation on a 12-core Intel Westmere shows that our code is able to outperform a tuned domain-specific stencil code generator by 4% to 27%, and previous compiler techniques by a factor of 2x to 10.14x.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical tradeoffs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Multiprocessor systems which afford a high degree of parallelism are used in a variety of applications. The extremely stringent reliability requirement has made the provision of fault-tolerance an important aspect in the design of such systems. This paper presents a review of the various approaches towards tolerating hardware faults in multiprocessor systems. It. emphasizes the basic concepts of fault tolerant design and the various problems to be taken care of by the designer. An indepth survey of the various models, techniques and methods for fault diagnosis is given. Further, we consider the strategies for fault-tolerance in specialized multiprocessor architectures which have the ability of dynamic reconfiguration and are suited to VLSI implementation. An analysis of the state-óf-the-art is given which points out the major aspects of fault-tolerance in such architectures.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents an inverse dynamic formulation by the Newton–Euler approach for the Stewart platform manipulator of the most general architecture and models all the dynamic and gravity effects as well as the viscous friction at the joints. It is shown that a proper elimination procedure results in a remarkably economical and fast algorithm for the solution of actuator forces, which makes the method quite suitable for on-line control purposes. In addition, the parallelism inherent in the manipulator and in the modelling makes the algorithm quite efficient in a parallel computing environment, where it can be made as fast as the corresponding formulation for the 6-dof serial manipulator. The formulation has been implemented in a program and has been used for a few trajectories planned for a test manipulator. Results of simulation presented in the paper reveal the nature of the variation of actuator forces in the Stewart platform and justify the dynamic modelling for control.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Previous studies have shown that buffering packets in DRAM is a performance bottleneck. In order to understand the impediments in accessing the DRAM, we developed a detailed Petri net model of IP forwarding application on IXP2400 that models the different levels of the memory hierarchy. The cell based interface used to receive and transmit packets in a network processor leads to some small size DRAM accesses. Such narrow accesses to the DRAM expose the bank access latency, reducing the bandwidth that can be realized. With real traces up to 30% of the accesses are smaller than the cell size, resulting in 7.7% reduction in DRAM bandwidth. To overcome this problem, we propose buffering these small chunks of data in the on chip scratchpad memory. This scheme also exploits greater degree of parallelism between different levels of the memory hierarchy. Using real traces from the internet, we show that the transmit rate can be improved by an average of 21% over the base scheme without the use of additional hardware. Further, the impact of different traffic patterns on the network processor resources is studied. Under real traffic conditions, we show that the data bus which connects the off-chip packet buffer to the micro-engines, is the obstacle in achieving higher throughput.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

REDEFINE is a reconfigurable SoC architecture that provides a unique platform for high performance and low power computing by exploiting the synergistic interaction between coarse grain dynamic dataflow model of computation (to expose abundant parallelism in applications) and runtime composition of efficient compute structures (on the reconfigurable computation resources). We propose and study the throttling of execution in REDEFINE to maximize the architecture efficiency. A feature specific fast hybrid (mixed level) simulation framework for early in design phase study is developed and implemented to make the huge design space exploration practical. We do performance modeling in terms of selection of important performance criteria, ranking of the explored throttling schemes and investigate effectiveness of the design space exploration using statistical hypothesis testing. We find throttling schemes which give appreciable (24.8%) overall performance gain in the architecture and 37% resource usage gain in the throttling unit simultaneously.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Administration of 3,5-diethoxy carbonyl-1,4-dihydrocollidine (DDC) to mice resulted in a striking increase in the level of δ-aminolevulinic acid (ALA) synthetase in liver. Although the enzyme activity was primarily localized in mitochondria and postmicrosomal supernatant fluid, a significant level of activity was also detected in purified nuclei. The time course of induction showed a close parallelism between the bound and free enzyme activities with the former always accounting for a higher percentage of the total activity as compared to the latter. Studies with cycloheximide indicated a half-life of around 3 hr for both the bound and free ALA synthetase. Actinomycin D and hemin prevented enzyme induction when administered along with DDC, but when administered 12 hr after DDC treatment Actinomycin D did not lead to a decay of either the bound or free enzyme activity and hemin inhibited the bound enzyme activity but not the free enzyme level. The molecular sizes of the mitochondrial and cytosolic ALA synthetase(s) were found to be similar on sephadex columns.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Massively parallel SIMD computing is applied to obtain an order of magnitude improvement in the executional speed of an important algorithm in VLSI design automation. The physical design of a VLSI circuit involves logic module placement as a subtask. The paper is concerned with accelerating the well known Min-cut placement technique for logic cell placement. The inherent parallelism of the Min-cut algorithm is identified, and it is shown that a parallel machine based on the efficient execution of the placement procedure.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Conventional Random access scan (RAS) for testing has lower test application time, low power dissipation, and low test data volume compared to standard serial scan chain based design In this paper, we present two cluster based techniques, namely, Serial Input Random Access Scan and Variable Word Length Random Access Scan to reduce test application time even further by exploiting the parallelism among the clusters and performing write operations on multiple bits Experimental results on benchmarks circuits show on an average 2-3 times speed up in test write time and average 60% reduction in write test data volume

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The development of a radioreceptor assay (RRA) that can measure serum LH in a variety of species and CG in sera and urine of pregnant women and monkeys is reported. Using sheep luteal membrane as the receptor source and I-125-labelled hLH/hCG as the tracer, dose-response (displacement) curves were obtained using hLH or hCG as standard. The addition of LH-free serum (200 mul per tube) had no affect on the standard displacement curve. The assay is simple, requires less than 90 min to complete and provides reproducible results. The sensitivity of the assay was 0.6 ng hLH per tube and the intra- and interassay variations were 9.6 and 9.8, respectively. Sera obtained from male and female bonnet monkeys (Macaca radiata) and monkey pituitary extract showed parallelism to the standard curve. The concentrations of LH measured correlated with the physiological status of the animals. Sera of rats, rabbits, hamsters, guinea-pigs, sheep and humans showed parallelism to the hLH standard curve indicating the viability of the RRA to measure serum LH of different species. Since the receptors recognize LH and CG, detection of pregnancy in monkeys and women was possible using this assay. The sensitivity of the assay for hCG was 8.7 miu per tube. This RRA could be a convenient alternative to the Leydig cell bioassay for obtaining the LH bioactivity profile of sera and biological fluids.