182 resultados para Compute Unified Device Architecture(CUDA)
em Indian Institute of Science - Bangalore - Índia
Resumo:
Real-time image reconstruction is essential for improving the temporal resolution of fluorescence microscopy. A number of unavoidable processes such as, optical aberration, noise and scattering degrade image quality, thereby making image reconstruction an ill-posed problem. Maximum likelihood is an attractive technique for data reconstruction especially when the problem is ill-posed. Iterative nature of the maximum likelihood technique eludes real-time imaging. Here we propose and demonstrate a compute unified device architecture (CUDA) based fast computing engine for real-time 3D fluorescence imaging. A maximum performance boost of 210x is reported. Easy availability of powerful computing engines is a boon and may accelerate to realize real-time 3D fluorescence imaging. Copyright 2012 Author(s). This article is distributed under a Creative Commons Attribution 3.0 Unported License. http://dx.doi.org/10.1063/1.4754604]
Resumo:
3-Dimensional Diffuse Optical Tomographic (3-D DOT) image reconstruction algorithm is computationally complex and requires excessive matrix computations and thus hampers reconstruction in real time. In this paper, we present near real time 3D DOT image reconstruction that is based on Broyden approach for updating Jacobian matrix. The Broyden method simplifies the algorithm by avoiding re-computation of the Jacobian matrix in each iteration. We have developed CPU and heterogeneous CPU/GPU code for 3D DOT image reconstruction in C and MatLab programming platform. We have used Compute Unified Device Architecture (CUDA) programming framework and CUDA linear algebra library (CULA) to utilize the massively parallel computational power of GPUs (NVIDIA Tesla K20c). The computation time achieved for C program based implementation for a CPU/GPU system for 3 planes measurement and FEM mesh size of 19172 tetrahedral elements is 806 milliseconds for an iteration.
Resumo:
Rapid reconstruction of multidimensional image is crucial for enabling real-time 3D fluorescence imaging. This becomes a key factor for imaging rapidly occurring events in the cellular environment. To facilitate real-time imaging, we have developed a graphics processing unit (GPU) based real-time maximum a-posteriori (MAP) image reconstruction system. The parallel processing capability of GPU device that consists of a large number of tiny processing cores and the adaptability of image reconstruction algorithm to parallel processing (that employ multiple independent computing modules called threads) results in high temporal resolution. Moreover, the proposed quadratic potential based MAP algorithm effectively deconvolves the images as well as suppresses the noise. The multi-node multi-threaded GPU and the Compute Unified Device Architecture (CUDA) efficiently execute the iterative image reconstruction algorithm that is similar to 200-fold faster (for large dataset) when compared to existing CPU based systems. (C) 2015 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported License.
Resumo:
In this paper, we propose an extension to the I/O device architecture, as recommended in the PCI-SIG IOV specification, for virtualizing network I/O devices. The aim is to enable fine-grained controls to a virtual machine on the I/O path of a shared device. The architecture allows native access of I/O devices to virtual machines and provides device level QoS hooks for controlling VM specific device usage. For evaluating the architecture we use layered queuing network (LQN) models. We implement the architecture and evaluate it using simulation techniques, on the LQN model, to demonstrate the benefits. With the architecture, the benefit for network I/O is 60% more than what can be expected on the existing architecture. Also, the proposed architecture improves scalability in terms of the number of virtual machines intending to share the I/O device.
Resumo:
The prevalent virtualization technologies provide QoS support within the software layers of the virtual machine monitor(VMM) or the operating system of the virtual machine(VM). The QoS features are mostly provided as extensions to the existing software used for accessing the I/O device because of which the applications sharing the I/O device experience loss of performance due to crosstalk effects or usable bandwidth. In this paper we examine the NIC sharing effects across VMs on a Xen virtualized server and present an alternate paradigm that improves the shared bandwidth and reduces the crosstalk effect on the VMs. We implement the proposed hardwaresoftware changes in a layered queuing network (LQN) model and use simulation techniques to evaluate the architecture. We find that simple changes in the device architecture and associated system software lead to application throughput improvement of up to 60%. The architecture also enables finer QoS controls at device level and increases the scalability of device sharing across multiple virtual machines. We find that the performance improvement derived using LQN model is comparable to that reported by similar but real implementations.
Resumo:
Due to extremely low off state current (IOFF) and excellent sub-threshold characteristics, the tunnel field effect transistor (TFET) has attracted a lot of attention for low standby power applications. In this work, we aim to increase the on state current (ION) of the device. A novel device architecture with a SiGe source is proposed. The proposed structure shows an order of improvement in ION compared to the conventional Si structure. A process flow adaptable to conventional CMOS technology is also addressed.
Resumo:
The electronic state in ultrathin gold nanowires is tuned by careful engineering of the device architecture via a chemical methodology. The electrons are localized to an insulating state (showing variable range hopping transport) by simply bringing them close to the substrate, while the insertion of an interlayer leads to a Tomonaga Luttinger liquid state.
Resumo:
High sensitivity gas sensors are typically realized using metal catalysts and nanostructured materials, utilizing non-conventional synthesis and processing techniques, incompatible with on-chip integration of sensor arrays. In this work, we report a new device architecture, suspended core-shell Pt-PtOx nanostructure that is fully CMOS-compatible. The device consists of a metal gate core, embedded within a partially suspended semiconductor shell with source and drain contacts in the anchored region. The reduced work function in suspended region, coupled with builtin electric field of metal-semiconductor junction, enables the modulation of drain current, due to room temperature Redox reactions on exposure to gas. The device architecture is validated using Pt-PtO2 suspended nanostructure for sensing H-2 down to 200 ppb under room temperature. By exploiting catalytic activity of PtO2, in conjunction with its p-type semiconducting behavior, we demonstrate about two orders of magnitude improvement in sensitivity and limit of detection, compared to the sensors reported in recent literature. Pt thin film, deposited on SiO2, is lithographically patterned and converted into suspended Pt-PtO2 sensor, in a single step isotropic SiO2 etching. An optimum design space for the sensor is elucidated with the initial Pt film thickness ranging between 10 nm and 30 nm, for low power (< 5 mu W), room temperature operation. (C) 2015 AIP Publishing LLC.
Resumo:
Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.
Resumo:
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. With the requirement to process packets at line rates, high-performance routers need to forward millions of packets every second with each packet needing up to seven memory accesses. Earlier work shows that a single cache for the nodes of a trie can reduce the number of external memory accesses. It is observed that the locality characteristics of the level-one nodes of a trie are significantly different from those of lower level nodes. Hence, we propose a heterogeneously segmented cache architecture (HSCA) which uses separate caches for level-one and lower level nodes, each with carefully chosen sizes. Besides reducing misses, segmenting the cache allows us to focus on optimizing the more frequently accessed level-one node segment. We find that due to the nonuniform distribution of nodes among cache sets, the level-one nodes cache is susceptible t high conflict misses. We reduce conflict misses by introducing a novel two-level mapping-based cache placement framework. We also propose an elegant way to fit the modified placement function into the cache organization with minimal increase in access time. Further, we propose an attribute preserving trace generation methodology which emulates real traces and can generate traces with varying locality. Performanc results reveal that our HSCA scheme results in a 32 percent speedup in average memory access time over a unified nodes cache. Also, HSC outperforms IHARC, a cache for lookup results, with as high as a 10-fold speedup in average memory access time. Two-level mappin further enhances the performance of the base HSCA by up to 13 percent leading to an overall improvement of up to 40 percent over the unified scheme.
Resumo:
In modern wireline and wireless communication systems, Viterbi decoder is one of the most compute intensive and essential elements. Each standard requires a different configuration of Viterbi decoder. Hence there is a need to design a flexible reconfigurable Viterbi decoder to support different configurations on a single platform. In this paper we present a reconfigurable Viterbi decoder which can be reconfigured for standards such as WCDMA, CDMA2000, IEEE 802.11, DAB, DVB, and GSM. Different parameters like code rate, constraint length, polynomials and truncation length can be configured to map any of the above mentioned standards. Our design provides higher throughput and scalable power consumption in various configuration of the reconfigurable Viterbi decoder. The power and throughput can also be optimized for different standards.
Resumo:
REDEFINE is a reconfigurable SoC architecture that provides a unique platform for high performance and low power computing by exploiting the synergistic interaction between coarse grain dynamic dataflow model of computation (to expose abundant parallelism in applications) and runtime composition of efficient compute structures (on the reconfigurable computation resources). We propose and study the throttling of execution in REDEFINE to maximize the architecture efficiency. A feature specific fast hybrid (mixed level) simulation framework for early in design phase study is developed and implemented to make the huge design space exploration practical. We do performance modeling in terms of selection of important performance criteria, ranking of the explored throttling schemes and investigate effectiveness of the design space exploration using statistical hypothesis testing. We find throttling schemes which give appreciable (24.8%) overall performance gain in the architecture and 37% resource usage gain in the throttling unit simultaneously.
Resumo:
Modern wireline and wireless communication devices are multimode and multifunctional communication devices. In order to support multiple standards on a single platform, it is necessary to develop a reconfigurable architecture that can provide the required flexibility and performance. The Channel decoder is one of the most compute intensive and essential elements of any communication system. Most of the standards require a reconfigurable Channel decoder that is capable of performing Viterbi decoding and Turbo decoding. Furthermore, the Channel decoder needs to support different configurations of Viterbi and Turbo decoders. In this paper, we propose a reconfigurable Channel decoder that can be reconfigured for standards such as WCDMA, CDMA2000, IEEE802.11, DAB, DVB and GSM. Different parameters like code rate, constraint length, polynomials and truncation length can be configured to map any of the above mentioned standards. A multiprocessor approach has been followed to provide higher throughput and scalable power consumption in various configurations of the reconfigurable Viterbi decoder and Turbo decoder. We have proposed A Hybrid register exchange approach for multiprocessor architecture to minimize power consumption.
Resumo:
We propose a unified model for large signal and small signal non-quasi-static analysis of long channel symmetric double gate MOSFET. The model is physics based and relies only on the very basic approximation needed for a charge-based model. It is based on the EKV formalism Enz C, Vittoz EA. Charge based MOS transistor modeling. Wiley; 2006] and is valid in all regions of operation and thus suitable for RF circuit design. Proposed model is verified with professional numerical device simulator and excellent agreement is found. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
FACTS controllers are emerging as viable and economic solutions to the problems of large interconnected ne networks, which can endanger the system security. These devices are characterized by their fast response, absence of inertia, and minimum maintenance requirements. Thyristor controlled equipment like Thyristor Controlled Series Capacitor (TCSC), Static Var Compensator (SVC), Thyristor Controlled Phase angle Regulator (TCPR) etc. which involve passive elements result in devices of large sizes with substantial cost and significant labour for installation. An all solid-state device using GTOs leads to reduction in equipment size and has improved performance. The Unified Power Flow Controller (UPFC) is a versatile controller which can be used to control the active and reactive power in the Line independently. The concept of UPFC makes it possible to handle practically all power flow control and transmission line compensation problems, using solid-state controllers, which provide functional flexibility, generally not attainable by conventional thyristor controlled systems. In this paper, we present the development of a control scheme for the series injected voltage of the UPFC to damp the power oscillations and improve transient stability in a power system. (C) 1998 Elsevier Science Ltd. All rights reserved.