139 resultados para Compute unified device architectures
Resumo:
Real-time image reconstruction is essential for improving the temporal resolution of fluorescence microscopy. A number of unavoidable processes such as, optical aberration, noise and scattering degrade image quality, thereby making image reconstruction an ill-posed problem. Maximum likelihood is an attractive technique for data reconstruction especially when the problem is ill-posed. Iterative nature of the maximum likelihood technique eludes real-time imaging. Here we propose and demonstrate a compute unified device architecture (CUDA) based fast computing engine for real-time 3D fluorescence imaging. A maximum performance boost of 210x is reported. Easy availability of powerful computing engines is a boon and may accelerate to realize real-time 3D fluorescence imaging. Copyright 2012 Author(s). This article is distributed under a Creative Commons Attribution 3.0 Unported License. http://dx.doi.org/10.1063/1.4754604]
Resumo:
3-Dimensional Diffuse Optical Tomographic (3-D DOT) image reconstruction algorithm is computationally complex and requires excessive matrix computations and thus hampers reconstruction in real time. In this paper, we present near real time 3D DOT image reconstruction that is based on Broyden approach for updating Jacobian matrix. The Broyden method simplifies the algorithm by avoiding re-computation of the Jacobian matrix in each iteration. We have developed CPU and heterogeneous CPU/GPU code for 3D DOT image reconstruction in C and MatLab programming platform. We have used Compute Unified Device Architecture (CUDA) programming framework and CUDA linear algebra library (CULA) to utilize the massively parallel computational power of GPUs (NVIDIA Tesla K20c). The computation time achieved for C program based implementation for a CPU/GPU system for 3 planes measurement and FEM mesh size of 19172 tetrahedral elements is 806 milliseconds for an iteration.
Resumo:
Rapid reconstruction of multidimensional image is crucial for enabling real-time 3D fluorescence imaging. This becomes a key factor for imaging rapidly occurring events in the cellular environment. To facilitate real-time imaging, we have developed a graphics processing unit (GPU) based real-time maximum a-posteriori (MAP) image reconstruction system. The parallel processing capability of GPU device that consists of a large number of tiny processing cores and the adaptability of image reconstruction algorithm to parallel processing (that employ multiple independent computing modules called threads) results in high temporal resolution. Moreover, the proposed quadratic potential based MAP algorithm effectively deconvolves the images as well as suppresses the noise. The multi-node multi-threaded GPU and the Compute Unified Device Architecture (CUDA) efficiently execute the iterative image reconstruction algorithm that is similar to 200-fold faster (for large dataset) when compared to existing CPU based systems. (C) 2015 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported License.
Resumo:
We propose a unified model for large signal and small signal non-quasi-static analysis of long channel symmetric double gate MOSFET. The model is physics based and relies only on the very basic approximation needed for a charge-based model. It is based on the EKV formalism Enz C, Vittoz EA. Charge based MOS transistor modeling. Wiley; 2006] and is valid in all regions of operation and thus suitable for RF circuit design. Proposed model is verified with professional numerical device simulator and excellent agreement is found. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
FACTS controllers are emerging as viable and economic solutions to the problems of large interconnected ne networks, which can endanger the system security. These devices are characterized by their fast response, absence of inertia, and minimum maintenance requirements. Thyristor controlled equipment like Thyristor Controlled Series Capacitor (TCSC), Static Var Compensator (SVC), Thyristor Controlled Phase angle Regulator (TCPR) etc. which involve passive elements result in devices of large sizes with substantial cost and significant labour for installation. An all solid-state device using GTOs leads to reduction in equipment size and has improved performance. The Unified Power Flow Controller (UPFC) is a versatile controller which can be used to control the active and reactive power in the Line independently. The concept of UPFC makes it possible to handle practically all power flow control and transmission line compensation problems, using solid-state controllers, which provide functional flexibility, generally not attainable by conventional thyristor controlled systems. In this paper, we present the development of a control scheme for the series injected voltage of the UPFC to damp the power oscillations and improve transient stability in a power system. (C) 1998 Elsevier Science Ltd. All rights reserved.
Resumo:
Frequent accesses to the register file make it one of the major sources of energy consumption in ILP architectures. The large number of functional units connected to a large unified register file in VLIW architectures make power dissipation in the register file even worse because of the need for a large number of ports. High power dissipation in a relatively smaller area occupied by a register file leads to a high power density in the register file and makes it one of the prime hot-spots. This makes it highly susceptible to the possibility of a catastrophic heatstroke. This in turn impacts the performance and cost because of the need for periodic cool down and sophisticated packaging and cooling techniques respectively. Clustered VLIW architectures partition the register file among clusters of functional units and reduce the number of ports required thereby reducing the power dissipation. However, we observe that the aggregate accesses to register files in clustered VLIW architectures (and associated energy consumption) become very high compared to the centralized VLIW architectures and this can be attributed to a large number of explicit inter-cluster communications. Snooping based clustered VLIW architectures provide very limited but very fast way of inter-cluster communication by allowing some of the functional units to directly read some of the operands from the register file of some of the other clusters. In this paper, we propose instruction scheduling algorithms that exploit the limited snooping capability to reduce the register file energy consumption on an average by 12% and 18% and improve the overall performance by 5% and 11% for a 2-clustered and a 4-clustered machine respectively, over an earlier state-of-the-art clustered scheduling algorithm when evaluated in the context of snooping based clustered VLIW architectures.
Resumo:
Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.
Resumo:
Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi- core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs.
Resumo:
The Artificial Neural Networks (ANNs) are being used to solve a variety of problems in pattern recognition, robotic control, VLSI CAD and other areas. In most of these applications, a speedy response from the ANNs is imperative. However, ANNs comprise a large number of artificial neurons, and a massive interconnection network among them. Hence, implementation of these ANNs involves execution of computer-intensive operations. The usage of multiprocessor systems therefore becomes necessary. In this article, we have presented the implementation of ART1 and ART2 ANNs on ring and mesh architectures. The overall system design and implementation aspects are presented. The performance of the algorithm on ring, 2-dimensional mesh and n-dimensional mesh topologies is presented. The parallel algorithm presented for implementation of ART1 is not specific to any particular architecture. The parallel algorithm for ARTE is more suitable for a ring architecture.
Resumo:
The properties of the manifold of a Lie groupG, fibered by the cosets of a sub-groupH, are exploited to obtain a geometrical description of gauge theories in space-timeG/H. Gauge potentials and matter fields are pullbacks of equivariant fields onG. Our concept of a connection is more restricted than that in the similar scheme of Ne'eman and Regge, so that its degrees of freedom are just those of a set of gauge potentials forG, onG/H, with no redundant components. The ldquotranslationalrdquo gauge potentials give rise in a natural way to a nonsingular tetrad onG/H. The underlying groupG to be gauged is the groupG of left translations on the manifoldG and is associated with a ldquotrivialrdquo connection, namely the Maurer-Cartan form. Gauge transformations are all those diffeomorphisms onG that preserve the fiber-bundle structure.
Resumo:
Experimental results pertaining to the initiation, dynamics and mechanism of cavitation erosion on poly(methyl methacrylate) specimens tested in a rotating disk device are described in detail. Erosion normally starts at the location nearest to the center of rotation (CR). As the exposure time to cavitation increases, additional erosion areas or sites appear away from the CR and secondary erosion (induced by eroded pits) spreads upstream and merges with the main pit. The microcracks increase in density towards the end of the incubation period and transform into macrocracks in most cases. A study of light optical photographs and scanning electron micrographs of the eroded area shows that material particles are removed from the network of cracks because of crack joining and pits indicate particle debris. Optical degradation (loss of transmittance) is observed to be greater on the back of the specimen than on the front.
Resumo:
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
Resumo:
We discuss micro ring resonator based optical logic gates using Kerr-type nonlinearity. Resonant wavelength selectivity is one key factor in achieving the desired gate. Based on basic gates like AND gate, OR gate etc. We proceed to propose a 3-bit binary adder circuit.Due to the presence of more than a single wavelength, the system gets complicated as we increase the number of components in the circuit. Hence it has been observed that for efficient designing and functioning of digital circuits in optical domain, we need a device which can give single wavelength output, filtering out all other wavelengths and at the same time preserve the digital characteristics of the output. We propose such filter-preserver device based on micro ring resonator.
Resumo:
A recent article on the unified theory of Elementary Particle Forces by Howard Georgi and Sheldon Glashow (September 1980, page 30) points out that the unification of strong, weak and electromagnetic interactions involves the appearance of particles having almost macroscopic masses of about a nanogram (~1014 GeV). Such superheavy particles seem to be an inevitable feature of most grand unified theories Gravitation is still, however, left out of these various schemes.