135 resultados para Loaders (Machines)


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Recently, efficient scheduling algorithms based on Lagrangian relaxation have been proposed for scheduling parallel machine systems and job shops. In this article, we develop real-world extensions to these scheduling methods. In the first part of the paper, we consider the problem of scheduling single operation jobs on parallel identical machines and extend the methodology to handle multiple classes of jobs, taking into account setup times and setup costs, The proposed methodology uses Lagrangian relaxation and simulated annealing in a hybrid framework, In the second part of the paper, we consider a Lagrangian relaxation based method for scheduling job shops and extend it to obtain a scheduling methodology for a real-world flexible manufacturing system with centralized material handling.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Proteins are polymerized by cyclic machines called ribosomes, which use their messenger RNA (mRNA) track also as the corresponding template, and the process is called translation. We explore, in depth and detail, the stochastic nature of the translation. We compute various distributions associated with the translation process; one of them-namely, the dwell time distribution-has been measured in recent single-ribosome experiments. The form of the distribution, which fits best with our simulation data, is consistent with that extracted from the experimental data. For our computations, we use a model that captures both the mechanochemistry of each individual ribosome and their steric interactions. We also demonstrate the effects of the sequence inhomogeneities of real genes on the fluctuations and noise in translation. Finally, inspired by recent advances in the experimental techniques of manipulating single ribosomes, we make theoretical predictions on the force-velocity relation for individual ribosomes. In principle, all our predictions can be tested by carrying out in vitro experiments.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Among the multitude of test specimen geometries used for dynamic fiacture toughness evaluation, the most widely uscd specimen is lhc Chavpy specimen due its simple geomclry and availability of testing machines. The standard Chatpy specimen dimensions may llOl always give plane st~ain condilions and hence, it may be necessary Io coilduct lcs/s using specimens of dillEvcnt thicknesses to establish the plane strain K~a. An axisymmct/ic specimen, on the otlaev hand would always give flow constraints l~n a nominal specimen thickness i~rcspcctive of the test matctial. The notched disk specimen pVOl)oscd by Bcrn:ud ctal. [1] for static and dynamic initiation toughness measurement although p~ovicles plain-strain conditions, the crack plopagatcs at an angle to the direction of applied load. This makes inteq~retation of the test results difficult us it ~Ccluivcs ~actial slices to be cut fiom the fractured specimen to ascertain the angle o1 crack growth and a linite element model l~)r tl);t{ pa~ticulat ctack o~icntalion.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have implemented a Firewall with this architecture in reconflgurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results in both speed and area improvement when it is implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields. High throughput classification invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly in terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for the worst case packet size. The Firewall rule update involves only memory re-initialization in software without any hardware change.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

High end network security applications demand high speed operation and large rule set support. Packet classification is the core functionality that demands high throughput in such applications. This paper proposes a packet classification architecture to meet such high throughput. We have Implemented a Firewall with this architecture in reconfigurable hardware. We propose an extension to Distributed Crossproducting of Field Labels (DCFL) technique to achieve scalable and high performance architecture. The implemented Firewall takes advantage of inherent structure and redundancy of rule set by using, our DCFL Extended (DCFLE) algorithm. The use of DCFLE algorithm results In both speed and area Improvement when It is Implemented in hardware. Although we restrict ourselves to standard 5-tuple matching, the architecture supports additional fields.High throughput classification Invariably uses Ternary Content Addressable Memory (TCAM) for prefix matching, though TCAM fares poorly In terms of area and power efficiency. Use of TCAM for port range matching is expensive, as the range to prefix conversion results in large number of prefixes leading to storage inefficiency. Extended TCAM (ETCAM) is fast and the most storage efficient solution for range matching. We present for the first time a reconfigurable hardware Implementation of ETCAM. We have implemented our Firewall as an embedded system on Virtex-II Pro FPGA based platform, running Linux with the packet classification in hardware. The Firewall was tested in real time with 1 Gbps Ethernet link and 128 sample rules. The packet classification hardware uses a quarter of logic resources and slightly over one third of memory resources of XC2VP30 FPGA. It achieves a maximum classification throughput of 50 million packet/s corresponding to 16 Gbps link rate for file worst case packet size. The Firewall rule update Involves only memory re-initialiization in software without any hardware change.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents three methodologies for determining optimum locations and magnitudes of reactive power compensation in power distribution systems. Method I and Method II are suitable for complex distribution systems with a combination of both radial and ring-main feeders and having different voltage levels. Method III is suitable for low-tension single voltage level radial feeders. Method I is based on an iterative scheme with successive powerflow analyses, with formulation and solution of the optimization problem using linear programming. Method II and Method III are essentially based on the steady state performance of distribution systems. These methods are simple to implement and yield satisfactory results comparable with the results of Method I. The proposed methods have been applied to a few distribution systems, and results obtained for two typical systems are presented for illustration purposes.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The wedge shape is a fairly common cross-section found in many non-axisymmetric components used in machines, aircraft, ships and automobiles. If such components are forged between two mutually inclined dies the metal displaced by the dies flows into the converging as well as into the diverging channels created by the inclined dies. The extent of each type of flow (convergent/divergent) depends on the die—material interface friction and the included die angle. Given the initial cross-section, the length as well as the exact geometry of the forged cross-section are therefore uniquely determined by these parameters. In this paper a simple stress analysis is used to predict changes in the geometry of a wedge undergoing compression between inclined platens. The flow in directions normal to the cross-section is assumed to be negligible. Experiments carried out using wedge-shaped lead billets show that, knowing the interface friction and as long as the deformation is not too large, the dimensional changes in the wedge can be predicted with reasonable accuracy. The predicted flow behaviour of metal for a wide range of die angles and interface friction is presented: these characteristics can be used by the die designer to choose the die lubricant (only) if the die angle is specified and to choose both of these parameters if there is no restriction on the exact die angle. The present work shows that the length of a wedge undergoing compression is highly sensitive to die—material interface friction. Thus in a situation where the top and bottom dies are inclined to each other, a wedge made of the material to be forged could be put between the dies and then compressed, whereupon the length of the compressed wedge — given the degree of compression — affords an estimate of the die—material interface friction.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The analysis of transient electrical stresses in the insulation of high voltage rotating machines is rendered difficult because of the existence of capacitive and inductive couplings between phases. The Published theories ignore many of the couplings between phases to obtain the solution. A new procedure is proposed here to determine the transient voltage distribution on rotating machine windings. All the significicant capacitive and inductive couplings between different sections in a phase and between different sections in different phases have been considered in this analysis. The experimental results show good correlation with those computed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The paper describes a Simultaneous Implicit (SI) approach for transient stability simulations based on an iterative technique using traingularised admittance matrix [1]. The reduced saliency of generator in the subtransient state is taken advantage of to speed up the algorithm. Accordingly, generator differential equations, except rotor swing, contain voltage proportional to fluxes in the main field, dampers and a hypothetical winding representing deep flowing eddy currents, as state variables. The simulation results are validated by comparison with two independent methods viz. Runge-Kutta simulation for a simplified system and a method based on modelling damper windings using conventional induction motor theory.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

For systems which can be decomposed into slow and fast subsystems, a near optimum linear state regulator consisting of two subsystem regulators can be developed. Depending upon the desired criteria, either a short term (fast controller) or a long term controller (slow controller) can be easily designed with minimum computational costs. Using this approach an example of a power system supplying a cyclic load is studied and the performance of the different controllers are compared.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Near the boundaries of shells, thin shell theories cannot always provide a satisfactory description of the kinematic situation. This imposes severe limitations on simulating the boundary conditions in theoretical shell models. Here an attempt is made to overcome the above limitation. Three-dimensional theory of elasticity is used near boundaries, while thin shell theory covers the major part of the shell away from the boundaries. Both regions are connected by means of an “interphase element.” This method is used to study typical static stress and natural vibration problems

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Many novel computer architectures like array and multiprocessors which achieve high performance through the use of concurrency exploit variations of the von Neumann model of computation. The effective utilization of the machines makes special demands on programmers and their programming languages, such as the structuring of data into vectors or the partitioning of programs into concurrent processes. In comparison, the data flow model of computation demands only that the principle of structured programming be followed. A data flow program, often represented as a data flow graph, is a program that expresses a computation by indicating the data dependencies among operators. A data flow computer is a machine designed to take advantage of concurrency in data flow graphs by executing data independent operations in parallel. In this paper, we discuss the design of a high level language (DFL: Data Flow Language) suitable for data flow computers. Some sample procedures in DFL are presented. The implementation aspects have not been discussed in detail since there are no new problems encountered. The language DFL embodies the concepts of functional programming, but in appearance closely resembles Pascal. The language is a better vehicle than the data flow graph for expressing a parallel algorithm. The compiler has been implemented on a DEC 1090 system in Pascal.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, we propose an extension to the I/O device architecture, as recommended in the PCI-SIG IOV specification, for virtualizing network I/O devices. The aim is to enable fine-grained controls to a virtual machine on the I/O path of a shared device. The architecture allows native access of I/O devices to virtual machines and provides device level QoS hooks for controlling VM specific device usage. For evaluating the architecture we use layered queuing network (LQN) models. We implement the architecture and evaluate it using simulation techniques, on the LQN model, to demonstrate the benefits. With the architecture, the benefit for network I/O is 60% more than what can be expected on the existing architecture. Also, the proposed architecture improves scalability in terms of the number of virtual machines intending to share the I/O device.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Data flow computers are high-speed machines in which an instruction is executed as soon as all its operands are available. This paper describes the EXtended MANchester (EXMAN) data flow computer which incorporates three major extensions to the basic Manchester machine. As extensions we provide a multiple matching units scheme, an efficient, implementation of array data structure, and a facility to concurrently execute reentrant routines. A simulator for the EXMAN computer has been coded in the discrete event simulation language, SIMULA 67, on the DEC 1090 system. Performance analysis studies have been conducted on the simulated EXMAN computer to study the effectiveness of the proposed extensions. The performance experiments have been carried out using three sample problems: matrix multiplication, Bresenham's line drawing algorithm, and the polygon scan-conversion algorithm.