186 resultados para Graphical processing units
em Indian Institute of Science - Bangalore - Índia
Resumo:
Biomedical engineering solutions like surgical simulators need High Performance Computing (HPC) to achieve real-time performance. Graphics Processing Units (GPUs) offer HPC capabilities at low cost and low power consumption. In this work, it is demonstrated that a liver which is discretized by about 2500 finite element nodes, can be graphically simulated in realtime, by making use of a GPU. Present work takes into consideration the time needed for the data transfer from CPU to GPU and back from GPU to CPU. Although behaviour of liver is very complicated, present computer simulation assumes linear elastostatics. One needs to use the commercial software ANSYS to obtain the global stiffness matrix of the liver. Results show that GPUs are useful for the real-time graphical simulation of liver, which in turn is needed in simulators that are used for training surgeons in laparoscopic surgery. Although the computer simulation should involve rendering also, neither rendering, nor the time needed for rendering and displaying the liver on a screen, is considered in the present work. The present work is just a demonstration of a concept; the concept is not really implemented and validated. Future work is to develop software which can accomplish real-time and very realistic graphical simulation of liver, with rendered image of liver on the screen changing in real-time according to the position of the surgical tool tip approximated as the mouse cursor in 3D.
Resumo:
Diffuse optical tomographic image reconstruction uses advanced numerical models that are computationally costly to be implemented in the real time. The graphics processing units (GPUs) offer desktop massive parallelization that can accelerate these computations. An open-source GPU-accelerated linear algebra library package is used to compute the most intensive matrix-matrix calculations and matrix decompositions that are used in solving the system of linear equations. These open-source functions were integrated into the existing frequency-domain diffuse optical image reconstruction algorithms to evaluate the acceleration capability of the GPUs (NVIDIA Tesla C 1060) with increasing reconstruction problem sizes. These studies indicate that single precision computations are sufficient for diffuse optical tomographic image reconstruction. The acceleration per iteration can be up to 40, using GPUs compared to traditional CPUs in case of three-dimensional reconstruction, where the reconstruction problem is more underdetermined, making the GPUs more attractive in the clinical settings. The current limitation of these GPUs in the available onboard memory (4 GB) that restricts the reconstruction of a large set of optical parameters, more than 13, 377. (C) 2010 Society of Photo-Optical Instrumentation Engineers. DOI: 10.1117/1.3506216]
Resumo:
The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.
Resumo:
The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem - both scheduling and assignment of filters to processors - as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipelin parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
This paper presents a GPU implementation of normalized cuts for road extraction problem using panchromatic satellite imagery. The roads have been extracted in three stages namely pre-processing, image segmentation and post-processing. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, vegetation,. and fallow regions). The road regions are then extracted using the normalized cuts algorithm. Normalized cuts algorithm is a graph-based partitioning `approach whose focus lies in extracting the global impression (perceptual grouping) of an image rather than local features. For the segmented image, post-processing is carried out using morphological operations - erosion and dilation. Finally, the road extracted image is overlaid on the original image. Here, a GPGPU (General Purpose Graphical Processing Unit) approach has been adopted to implement the same algorithm on the GPU for fast processing. A performance comparison of this proposed GPU implementation of normalized cuts algorithm with the earlier algorithm (CPU implementation) is presented. From the results, we conclude that the computational improvement in terms of time as the size of image increases for the proposed GPU implementation of normalized cuts. Also, a qualitative and quantitative assessment of the segmentation results has been projected.
Resumo:
The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multiple Trusted Execution Environments (TEEs). Apart from performance, power, and area research in the field of MPSoC, robust and secure system design is also gaining importance in the research community. To build a secure system, the designer must know beforehand all kinds of attack possibilities for the respective system (MPSoC). In this paper we survey the possible attack scenarios on present-day MPSoCs and investigate a new attack scenario, i.e., router attack targeted toward NoC architecture. We show the validity of this attack by analyzing different present-day NoC architectures and show that they are all vulnerable to this type of attack. By launching a router attack, an attacker can control the whole chip very easily, which makes it a very serious issue. Both routing tables and routing logic-based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose different monitoring-based countermeasures against routing table-based router attack in an MPSoC having multiple TEEs. Synthesis results show that proposed countermeasures, viz. Runtime-monitor, Restart-monitor, Intermediate manager, and Auditor, occupy areas that are 26.6, 22, 0.2, and 12.2 % of a routing table-based router area. Apart from these, we propose Ejection address checker and Local monitoring module inside a router that cause 3.4 and 10.6 % increase of a router area, respectively. Simulation results are also given, which shows effectiveness of proposed monitoring-based countermeasures.
Resumo:
Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi- core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs.
Resumo:
In this paper, we deal with low-complexity near-optimal detection/equalization in large-dimension multiple-input multiple-output inter-symbol interference (MIMO-ISI) channels using message passing on graphical models. A key contribution in the paper is the demonstration that near-optimal performance in MIMO-ISI channels with large dimensions can be achieved at low complexities through simple yet effective simplifications/approximations, although the graphical models that represent MIMO-ISI channels are fully/densely connected (loopy graphs). These include 1) use of Markov random field (MRF)-based graphical model with pairwise interaction, in conjunction with message damping, and 2) use of factor graph (FG)-based graphical model with Gaussian approximation of interference (GAI). The per-symbol complexities are O(K(2)n(t)(2)) and O(Kn(t)) for the MRF and the FG with GAI approaches, respectively, where K and n(t) denote the number of channel uses per frame, and number of transmit antennas, respectively. These low-complexities are quite attractive for large dimensions, i.e., for large Kn(t). From a performance perspective, these algorithms are even more interesting in large-dimensions since they achieve increasingly closer to optimum detection performance for increasing Kn(t). Also, we show that these message passing algorithms can be used in an iterative manner with local neighborhood search algorithms to improve the reliability/performance of M-QAM symbol detection.
Resumo:
It is well known that extremely long low-density parity-check (LDPC) codes perform exceptionally well for error correction applications, short-length codes are preferable in practical applications. However, short-length LDPC codes suffer from performance degradation owing to graph-based impairments such as short cycles, trapping sets and stopping sets and so on in the bipartite graph of the LDPC matrix. In particular, performance degradation at moderate to high E-b/N-0 is caused by the oscillations in bit node a posteriori probabilities induced by short cycles and trapping sets in bipartite graphs. In this study, a computationally efficient algorithm is proposed to improve the performance of short-length LDPC codes at moderate to high E-b/N-0. This algorithm makes use of the information generated by the belief propagation (BP) algorithm in previous iterations before a decoding failure occurs. Using this information, a reliability-based estimation is performed on each bit node to supplement the BP algorithm. The proposed algorithm gives an appreciable coding gain as compared with BP decoding for LDPC codes of a code rate equal to or less than 1/2 rate coding. The coding gains are modest to significant in the case of optimised (for bipartite graph conditioning) regular LDPC codes, whereas the coding gains are huge in the case of unoptimised codes. Hence, this algorithm is useful for relaxing some stringent constraints on the graphical structure of the LDPC code and for developing hardware-friendly designs.
Resumo:
This paper reports a new class of photo-cross-linkable side chain liquid crystalline polymers (PSCLCPs) based on the bis(benzylidene)cyclohexanone unit, which functions as both a mesogen and a photoactive center. Polymers with the bis(benzylidene)cyclohexanone unit and varying spacer length have been synthesized. Copolymers of bis(benzylidene)cyclohexanone containing monomer and cholesterol benzoate containing monomer with different compositions have also been prepared. All these polymers have been structurally characterized by spectroscopic techniques. Thermal transitions were studied by DSC, and mesophases were identified by polarized light optical microscopy (POM). The intermediate compounds OH-x, the monomers SCLCM-x, and the corresponding polymers PSCLCP-x, which are essentially based on bis(benzylidene)cyclohexanone, all show a nematic mesophase. Transition temperatures were observed to decrease with increasing spacer length. The copolymers with varying compositions exhibit a cholesteric mesophase, and the transition temperatures increase with the cholesteric benzoate units in the copolymer. Photolysis of the low molecular weight liquid crystalline bis(benzylidene)-cyclohexanone compound reveals that there are two kinds of photoreactions in these systems: the EZ photoisomerization and 2 pi + 2 pi addition. The EZ photoisomerization in the LC phase disrupts the parallel stacking of the mesogens, resulting in the transition from the LC phase to the isotropic phase. The photoreaction involving the 2 pi + 2 pi addition of the bis(benzylidene)cyclohexanone units in the polymer results in the cross-linking of the chains. The liquid crystalline induced circular dichroism (LCICD) studies of the cholesterol benzoate copolymers revealed that the cholesteric supramolecular order remains even after the photo-cross-linking.
Resumo:
The development of a microstructure in 304L stainless steel during industrial hot-forming operations, including press forging (mean strain rate of 0.15 s(-1)), rolling/extrusion (2-5 s(-1)), and hammer forging (100 s(-1)) at different temperatures in the range 600-1200 degrees C, was studied with a view to validating the predictions of the processing map. The results have shown that excellent correlation exists between the regimes exhibited by the map and the product microstructures. 304L stainless steel exhibits instability bands when hammer forged at temperatures below 1100 degrees C, rolled/extruded below 1000 degrees C, or press forged below 800 degrees C. All of these conditions must be avoided in mechanical processing of the material. On the other hand, ideally, the material may be rolled, extruded, or press forged at 1200 degrees C to obtain a defect-free microstructure.
Resumo:
The hot deformation behavior of hot isostatically pressed (HIPd) P/M IN-100 superalloy has been studied in the temperature range 1000-1200 degrees C and strain rate range 0.0003-10 s(-1) using hot compression testing. A processing map has been developed on the basis of these data and using the principles of dynamic materials modelling. The map exhibited three domains: one at 1050 degrees C and 0.01 s(-1), with a peak efficiency of power dissipation of approximate to 32%, the second at 1150 degrees C and 10 s(-1), with a peak efficiency of approximate to 36% and the third at 1200 degrees C and 0.1 s(-1), with a similar efficiency. On the basis of optical and electron microscopic observations, the first domain was interpreted to represent dynamic recovery of the gamma phase, the second domain represents dynamic recrystallization (DRX) of gamma in the presence of softer gamma', while the third domain represents DRX of the gamma phase only. The gamma' phase is stable upto 1150 degrees C, gets deformed below this temperature and the chunky gamma' accumulates dislocations, which at larger strains cause cracking of this phase. At temperatures lower than 1080 degrees C and strain rates higher than 0.1 s(-1), the material exhibits flow instability, manifested in the form of adiabatic shear bands. The material may be subjected to mechanical processing without cracking or instabilities at 1200 degrees C and 0.1 s(-1), which are the conditions for DRX of the gamma phase.
Resumo:
An overview of the synthesis of materials under microwave irradiation has been presented based on the work performed recently. A variety of reactions such as direct combination, carbothermal reduction, carbidation and nitridation have been described. Examples of microwave preparation of glasses are also presented. Great advantages of fast, clean and reduced reaction temperature of microwave methods are emphasized. The example of ZrO2-CeO2 ceramics has been used show the extraordinarily fast and effective sintering which occurs in microwave irradiation.