43 resultados para data transfer

em Indian Institute of Science - Bangalore - Índia


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Multi-access techniques are widely used in computer networking and distributed multiprocessor systems. On-the-fly arbitration schemes permit one of the many contenders to access the medium without collisions. Serial arbitration is cost effective but is slow and hence unsuitable for high-speed multiprocessor environments supporting very high data transfer rates. A fully parallel arbitration scheme takes less time but is not practically realisable for large numbers of contenders. In this paper, a generalised parallel-serial scheme is proposed which significantly reduces the arbitration time and is practically realisable.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a ``future-proof'' custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16x in power and 8x in area when compared to an ASIC. REDEFINE implementation consumes 0.1x the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000x more than that of REDEFINE.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

RECONNECT is a Network-on-Chip using a honeycomb topology. In this paper we focus on properties of general rules applicable to a variety of routing algorithms for the NoC which take into account the missing links of the honeycomb topology when compared to a mesh. We also extend the original proposal [5] and show a method to insert and extract data to and from the network. Access Routers at the boundary of the execution fabric establish connections to multiple periphery modules and create a torus to decrease the node distances. Our approach is scalable and ensures homogeneity among the compute elements in the NoC. We synthesized and evaluated the proposed enhancement in terms of power dissipation and area. Our results indicate that the impact of necessary alterations to the fabric is negligible and effects the data transfer between the fabric and the periphery only marginally.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biomedical engineering solutions like surgical simulators need High Performance Computing (HPC) to achieve real-time performance. Graphics Processing Units (GPUs) offer HPC capabilities at low cost and low power consumption. In this work, it is demonstrated that a liver which is discretized by about 2500 finite element nodes, can be graphically simulated in realtime, by making use of a GPU. Present work takes into consideration the time needed for the data transfer from CPU to GPU and back from GPU to CPU. Although behaviour of liver is very complicated, present computer simulation assumes linear elastostatics. One needs to use the commercial software ANSYS to obtain the global stiffness matrix of the liver. Results show that GPUs are useful for the real-time graphical simulation of liver, which in turn is needed in simulators that are used for training surgeons in laparoscopic surgery. Although the computer simulation should involve rendering also, neither rendering, nor the time needed for rendering and displaying the liver on a screen, is considered in the present work. The present work is just a demonstration of a concept; the concept is not really implemented and validated. Future work is to develop software which can accomplish real-time and very realistic graphical simulation of liver, with rendered image of liver on the screen changing in real-time according to the position of the surgical tool tip approximated as the mouse cursor in 3D.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We implement two energy models that accurately and comprehensively estimates the system energy cost and communication energy cost for using Bluetooth and Wi-Fi interfaces. The energy models running on a system is used to smartly pick the most energy optimal network interface so that data transfer between two end points is maximized.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We consider the wireless two-way relay channel, in which two-way data transfer takes place between the end nodes with the help of a relay. For the Denoise-And-Forward (DNF) protocol, it was shown by Koike-Akino et al. that adaptively changing the network coding map used at the relay greatly reduces the impact of Multiple Access Interference at the relay. The harmful effect of the deep channel fade conditions can be effectively mitigated by proper choice of these network coding maps at the relay. Alternatively, in this paper we propose a Distributed Space Time Coding (DSTC) scheme, which effectively removes most of the deep fade channel conditions at the transmitting nodes itself without any CSIT and without any need to adaptively change the network coding map used at the relay. It is shown that the deep fades occur when the channel fade coefficient vector falls in a finite number of vector subspaces of, which are referred to as the singular fade subspaces. DSTC design criterion referred to as the singularity minimization criterion under which the number of such vector subspaces are minimized is obtained. Also, a criterion to maximize the coding gain of the DSTC is obtained. Explicit low decoding complexity DSTC designs which satisfy the singularity minimization criterion and maximize the coding gain for QAM and PSK signal sets are provided. Simulation results show that at high Signal to Noise Ratio, the DSTC scheme provides large gains when compared to the conventional Exclusive OR network code and performs better than the adaptive network coding scheme.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

There are many wireless sensor network(WSN) applications which require reliable data transfer between the nodes. Several techniques including link level retransmission, error correction methods and hybrid Automatic Repeat re- Quest(ARQ) were introduced into the wireless sensor networks for ensuring reliability. In this paper, we use Automatic reSend request(ASQ) technique with regular acknowledgement to design reliable end-to-end communication protocol, called Adaptive Reliable Transport(ARTP) protocol, for WSNs. Besides ensuring reliability, objective of ARTP protocol is to ensure message stream FIFO at the receiver side instead of the byte stream FIFO used in TCP/IP protocol suite. To realize this objective, a new protocol stack has been used in the ARTP protocol. The ARTP protocol saves energy without affecting the throughput by sending three different types of acknowledgements, viz. ACK, NACK and FNACK with semantics different from that existing in the literature currently and adapting to the network conditions. Additionally, the protocol controls flow based on the receiver's feedback and congestion by holding ACK messages. To the best of our knowledge, there has been little or no attempt to build a receiver controlled regularly acknowledged reliable communication protocol. We have carried out extensive simulation studies of our protocol using Castalia simulator, and the study shows that our protocol performs better than related protocols in wireless/wire line networks, in terms of throughput and energy efficiency.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In wireless sensor networks (WSNs) the communication traffic is often time and space correlated, where multiple nodes in a proximity start transmitting at the same time. Such a situation is known as spatially correlated contention. The random access methods to resolve such contention suffers from high collision rate, whereas the traditional distributed TDMA scheduling techniques primarily try to improve the network capacity by reducing the schedule length. Usually, the situation of spatially correlated contention persists only for a short duration and therefore generating an optimal or sub-optimal schedule is not very useful. On the other hand, if the algorithm takes very large time to schedule, it will not only introduce additional delay in the data transfer but also consume more energy. To efficiently handle the spatially correlated contention in WSNs, we present a distributed TDMA slot scheduling algorithm, called DTSS algorithm. The DTSS algorithm is designed with the primary objective of reducing the time required to perform scheduling, while restricting the schedule length to maximum degree of interference graph. The algorithm uses randomized TDMA channel access as the mechanism to transmit protocol messages, which bounds the message delay and therefore reduces the time required to get a feasible schedule. The DTSS algorithm supports unicast, multicast and broadcast scheduling, simultaneously without any modification in the protocol. The protocol has been simulated using Castalia simulator to evaluate the run time performance. Simulation results show that our protocol is able to considerably reduce the time required to schedule.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Different medium access control (MAC) layer protocols, for example, IEEE 802.11 series and others are used in wireless local area networks. They have limitation in handling bulk data transfer applications, like video-on-demand, videoconference, etc. To avoid this problem a cooperative MAC protocol environment has been introduced, which enables the MAC protocol of a node to use its nearby nodes MAC protocol as and when required. We have found on various occasions that specified cooperative MAC establishes cooperative transmissions to send the specified data to the destination. In this paper we propose cooperative MAC priority (CoopMACPri) protocol which exploits the advantages of priority value given by the upper layers for selection of different paths to nodes running heterogeneous applications in a wireless ad hoc network environment. The CoopMACPri protocol improves the system throughput and minimizes energy consumption. Using a Markov chain model, we developed a model to analyse the performance of CoopMACPri protocol; and also derived closed-form expression of saturated system throughput and energy consumption. Performance evaluations validate the accuracy of the theoretical analysis, and also show that the performance of CoopMACPri protocol varies with the number of nodes. We observed that the simulation results and analysis reflects the effectiveness of the proposed protocol as per the specifications.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In WSNs the communication traffic is often time and space correlated, where multiple nodes in a proximity start transmitting simultaneously. Such a situation is known as spatially correlated contention. The random access method to resolve such contention suffers from high collision rate, whereas the traditional distributed TDMA scheduling techniques primarily try to improve the network capacity by reducing the schedule length. Usually, the situation of spatially correlated contention persists only for a short duration, and therefore generating an optimal or suboptimal schedule is not very useful. Additionally, if an algorithm takes very long time to schedule, it will not only introduce additional delay in the data transfer but also consume more energy. In this paper, we present a distributed TDMA slot scheduling (DTSS) algorithm, which considerably reduces the time required to perform scheduling, while restricting the schedule length to the maximum degree of interference graph. The DTSS algorithm supports unicast, multicast, and broadcast scheduling, simultaneously without any modification in the protocol. We have analyzed the protocol for average case performance and also simulated it using Castalia simulator to evaluate its runtime performance. Both analytical and simulation results show that our protocol is able to considerably reduce the time required for scheduling.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Aerodynamic forces and fore-body convective surface heat transfer rates over a 60 degrees apex-angle blunt cone have been simultaneously measured at a nominal Mach number of 5.75 in the hypersonic shock tunnel HST2. An aluminum model incorporating a three-component accelerometer-based balance system for measuring the aerodynamic forces and an array of platinum thin-film gauges deposited on thermally insulating backing material flush mounted on the model surface is used for convective surface heat transfer measurement in the investigations. The measured value of the drag coefficient varies by about +/-6% from the theoretically estimated value based on the modified Newtonian theory, while the axi-symmetric Navier-Stokes computations overpredict the drag coefficient by about 9%. The normalized values of measured heat transfer rates at 0 degrees angle of attack are about 11% higher than the theoretically estimated values. The aerodynamic and the heat transfer data presented here are very valuable for the validation of CFD codes used for the numerical computation of How fields around hypersonic vehicles.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

There are multiple goals of a technology transfer office (TTO) based in a university system. Whilst commercialization is a critical goal, maintenance and cleaning of the TTO's database needs detailing. Literature in the area is scarce and only some researchers make reference to TTO data cleaning. During an attempt to understand the commercial strategy of a university TTO in Bangalore the challenge of data cleaning was encountered. This paper describes a case study of data cleaning at an Indian university based TTO. 382 patent records were analyzed in the study. The case study first describes the back ground of the university system. Second, the method to clean the data and the experiences encountered are highlighted. Insights drawn indicate that patent data cleaning in a TTO is a specialized area which needs attention. Overlooking this activity can have legal implications and may result in an inability to commercialize the patent. Two levels of patent data cleaning are discussed in this case study. Best practices of data cleaning in academic TTOs are discussed.