143 resultados para Parallel Architectures


Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper deals with the development of a new model for the cooling process on the runout table of hot strip mills, The suitability of different numerical methods for the solution of the proposed model equation from the point of view of accuracy and computation time are studied, Parallel solutions for the model equation are proposed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The use of delayed coefficient adaptation in the least mean square (LMS) algorithm has enabled the design of pipelined architectures for real-time transversal adaptive filtering. However, the convergence speed of this delayed LMS (DLMS) algorithm, when compared with that of the standard LMS algorithm, is degraded and worsens with increase in the adaptation delay. Existing pipelined DLMS architectures have large adaptation delay and hence degraded convergence speed. We in this paper, first present a pipelined DLMS architecture with minimal adaptation delay for any given sampling rate. The architecture is synthesized by using a number of function preserving transformations on the signal flow graph representation of the DLMS algorithm. With the use of carry-save arithmetic, the pipelined architecture can support high sampling rates, limited only by the delay of a full adder and a 2-to-1 multiplexer. In the second part of this paper, we extend the synthesis methodology described in the first part, to synthesize pipelined DLMS architectures whose power dissipation meets a specified budget. This low-power architecture exploits the parallelism in the DLMS algorithm to meet the required computational throughput. The architecture exhibits a novel tradeoff between algorithmic performance (convergence speed) and power dissipation. (C) 1999 Elsevier Science B.V. All rights resented.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper discusses the parallel implementation of the solution of a set of linear equations using the Alternative Quadrant Interlocking Factorisation Methods (AQIF), on a star topology. Both the AQIF and LU decomposition methods are mapped onto star topology on an IBM SP2 system, with MPI as the internode communicator. Performance parameters such as speedup, efficiency have been obtained through experimental and theoretical means. The studies demonstrate (i) a mismatch of 15% between the theoretical and experimental results, (ii) scalability of the AQIF algorithm, and (iii) faster executing AQIF algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Parallel execution of computational mechanics codes requires efficient mesh-partitioning techniques. These mesh-partitioning techniques divide the mesh into specified number of submeshes of approximately the same size and at the same time, minimise the interface nodes of the submeshes. This paper describes a new mesh partitioning technique, employing Genetic Algorithms. The proposed algorithm operates on the deduced graph (dual or nodal graph) of the given finite element mesh rather than directly on the mesh itself. The algorithm works by first constructing a coarse graph approximation using an automatic graph coarsening method. The coarse graph is partitioned and the results are interpolated onto the original graph to initialise an optimisation of the graph partition problem. In practice, hierarchy of (usually more than two) graphs are used to obtain the final graph partition. The proposed partitioning algorithm is applied to graphs derived from unstructured finite element meshes describing practical engineering problems and also several example graphs related to finite element meshes given in the literature. The test results indicate that the proposed GA based graph partitioning algorithm generates high quality partitions and are superior to spectral and multilevel graph partitioning algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present a differential-geometric approach to analyze the singularities of task space point trajectories of two and three-degree-of-freedom serial and parallel manipulators. At non-singular configurations, the first-order, local properties are characterized by metric coefficients, and, geometrically, by the shape and size of a velocity ellipse or an ellipsoid. At singular configurations, the determinant of the matrix of metric coefficients is zero and the velocity ellipsoid degenerates to an ellipse, a line or a point, and the area or the volume of the velocity ellipse or ellipsoid becomes zero. The degeneracies of the velocity ellipsoid or ellipse gives a simple geometric picture of the possible task space velocities at a singular configuration. To study the second-order properties at a singularity, we use the derivatives of the metric coefficients and the rate of change of area or volume. The derivatives are shown to be related to the possible task space accelerations at a singular configuration. In the case of parallel manipulators, singularities may lead to either loss or gain of one or more degrees-of-freedom. For loss of one or more degrees-of-freedom, ther possible velocities and accelerations are again obtained from a modified metric and derivatives of the metric coefficients. In the case of a gain of one or more degrees-of-freedom, the possible task space velocities can be pictured as growth to lines, ellipses, and ellipsoids. The theoretical results are illustrated with the help of a general spatial 2R manipulator and a three-degree-of-freedom RPSSPR-SPR parallel manipulator.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we consider the problem of scheduling expression trees on delayed-load architectures. The problem tackled here takes root from the one considered in [Proceedings of the ACM SIGPLAN '91 Conf. on Programming Language Design and Implementation, 1991. p. 256] in which the leaves of the expression trees all refer to memory locations. A generalization of this involves the situation in which the trees may contain register variables, with the registers being used only at the leaves. Solutions to this generalization are given in [ACM Trans. Prog. Lang. Syst. 17 (1995) 740, Microproc. Microprog. 40 (1994) 577]. This paper considers the most general case in which the registers are reusable. This problem is tackled in [Comput. Lang, 21 (1995) 49] which gives an approximate solution to the problem under certain assumptions about the contiguity of the evaluation order: Here we propose an optimal solution (which may involve even a non-contiguous evaluation of the tree). The schedule generated by the algorithm given in this paper is optimal in the sense that it is an interlock-free schedule which uses the minimum number of registers required. An extension to the algorithm incorporates spilling. The problem as stated in this paper is an instruction scheduling problem. However, the problem could also be rephrased as an operations research problem with a difference in terminology. (C) 2002 Elsevier Science B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper a new parallel algorithm for nonlinear transient dynamic analysis of large structures has been presented. An unconditionally stable Newmark-beta method (constant average acceleration technique) has been employed for time integration. The proposed parallel algorithm has been devised within the broad framework of domain decomposition techniques. However, unlike most of the existing parallel algorithms (devised for structural dynamic applications) which are basically derived using nonoverlapped domains, the proposed algorithm uses overlapped domains. The parallel overlapped domain decomposition algorithm proposed in this paper has been formulated by splitting the mass, damping and stiffness matrices arises out of finite element discretisation of a given structure. A predictor-corrector scheme has been formulated for iteratively improving the solution in each step. A computer program based on the proposed algorithm has been developed and implemented with message passing interface as software development environment. PARAM-10000 MIMD parallel computer has been used to evaluate the performances. Numerical experiments have been conducted to validate as well as to evaluate the performance of the proposed parallel algorithm. Comparisons have been made with the conventional nonoverlapped domain decomposition algorithms. Numerical studies indicate that the proposed algorithm is superior in performance to the conventional domain decomposition algorithms. (C) 2003 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Very Long Instruction Word (VLIW) architectures exploit instruction level parallelism (ILP) with the help of the compiler to achieve higher instruction throughput with minimal hardware. However, control and data dependencies between operations limit the available ILP, which not only hinders the scalability of VLIW architectures, but also result in code size expansion. Although speculation and predicated execution mitigate ILP limitations due to control dependencies to a certain extent, they increase hardware cost and exacerbate code size expansion. Simultaneous multistreaming (SMS) can significantly improve operation throughput by allowing interleaved execution of operations from multiple instruction streams. In this paper we study SMS for VLIW architectures and quantify the benefits associated with it using a case study of the MPEG-2 video decoder. We also propose the notion of virtual resources for VLIW architectures, which decouple architectural resources (resources exposed to the compiler) from the microarchitectural resources, to limit code size expansion. Our results for a VLIW architecture demonstrate that: (1) SMS delivers much higher throughput than that achieved by speculation and predicated execution, (2) the increase in performance due to the addition of speculation and predicated execution support over SMS averages around 12%. The minor increase in performance might not warrant the additional hardware complexity involved, and (3) the notion of virtual resources is very effective in reducing no-operations (NOPs) and consequently reduce code size with little or no impact on performance.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Protein folding and unfolding are complex phenomena, and it is accepted that multidomain proteins generally follow multiple pathways. Maltose-binding protein (MBP) is a large (a two-domain, 370-amino acid residue) bacterial periplasmic protein involved in maltose uptake. Despite the large size, it has been shown to exhibit an apparent two-state equilibrium unfolding in bulk experiments. Single-molecule studies can uncover rare events that are masked by averaging in bulk studies. Here, we use single-molecule force spectroscopy to study the mechanical unfolding pathways of MBP and its precursor protein (preMBP) in the presence and absence of ligands. Our results show that MBP exhibits kinetic partitioning on mechanical stretching and unfolds via two parallel pathways: one of them involves a mechanically stable intermediate (path I) whereas the other is devoid of it (path II). The apoMBP unfolds via path I in 62% of the mechanical unfolding events, and the remaining 38% follow path II. In the case of maltose-bound MBP, the protein unfolds via the intermediate in 79% of the cases, the remaining 21% via path II. Similarly, on binding to maltotriose, a ligand whose binding strength with the polyprotein is similar to that of maltose, the occurrence of the intermediate is comparable (82% via path I) with that of maltose. The precursor protein preMBP also shows a similar behavior upon mechanical unfolding. The percentages of molecules unfolding via path I are 53% in the apo form and 68% and 72% upon binding to maltose and maltotriose, respectively, for preMBP. These observations demonstrate that ligand binding can modulate the mechanical unfolding pathways of proteins by a kinetic partitioning mechanism. This could be a general mechanism in the unfolding of other large two-domain ligand-binding proteins of the bacterial periplasmic space.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

As computational Grids are increasingly used for executing long running multi-phase parallel applications, it is important to develop efficient rescheduling frameworks that adapt application execution in response to resource and application dynamics. In this paper, three strategies or algorithms have been developed for deciding when and where to reschedule parallel applications that execute on multi-cluster Grids. The algorithms derive rescheduling plans that consist of potential points in application execution for rescheduling and schedules of resources for application execution between two consecutive rescheduling points. Using large number of simulations, it is shown that the rescheduling plans developed by the algorithms can lead to large decrease in application execution times when compared to executions without rescheduling on dynamic Grid resources. The rescheduling plans generated by the algorithms are also shown to be competitive when compared to the near-optimal plans generated by brute-force methods. Of the algorithms, genetic algorithm yielded the most efficient rescheduling plans with 9-12% smaller average execution times than the other algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Frequent accesses to the register file make it one of the major sources of energy consumption in ILP architectures. The large number of functional units connected to a large unified register file in VLIW architectures make power dissipation in the register file even worse because of the need for a large number of ports. High power dissipation in a relatively smaller area occupied by a register file leads to a high power density in the register file and makes it one of the prime hot-spots. This makes it highly susceptible to the possibility of a catastrophic heatstroke. This in turn impacts the performance and cost because of the need for periodic cool down and sophisticated packaging and cooling techniques respectively. Clustered VLIW architectures partition the register file among clusters of functional units and reduce the number of ports required thereby reducing the power dissipation. However, we observe that the aggregate accesses to register files in clustered VLIW architectures (and associated energy consumption) become very high compared to the centralized VLIW architectures and this can be attributed to a large number of explicit inter-cluster communications. Snooping based clustered VLIW architectures provide very limited but very fast way of inter-cluster communication by allowing some of the functional units to directly read some of the operands from the register file of some of the other clusters. In this paper, we propose instruction scheduling algorithms that exploit the limited snooping capability to reduce the register file energy consumption on an average by 12% and 18% and improve the overall performance by 5% and 11% for a 2-clustered and a 4-clustered machine respectively, over an earlier state-of-the-art clustered scheduling algorithm when evaluated in the context of snooping based clustered VLIW architectures.