37 resultados para implementations
Resumo:
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2 D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.
Resumo:
The correctness of a hard real-time system depends its ability to meet all its deadlines. Existing real-time systems use either a pure real-time scheduler or a real-time scheduler embedded as a real-time scheduling class in the scheduler of an operating system (OS). Existing implementations of schedulers in multicore systems that support real-time and non-real-time tasks, permit the execution of non-real-time tasks in all the cores with priorities lower than those of real-time tasks, but interrupts and softirqs associated with these non-real-time tasks can execute in any core with priorities higher than those of real-time tasks. As a result, the execution overhead of real-time tasks is quite large in these systems, which, in turn, affects their runtime. In order that the hard real-time tasks can be executed in such systems with minimal interference from other Linux tasks, we propose, in this paper, an integrated scheduler architecture, called SchedISA, which aims to considerably reduce the execution overhead of real-time tasks in these systems. In order to test the efficacy of the proposed scheduler, we implemented partitioned earliest deadline first (P-EDF) scheduling algorithm in SchedISA on Linux kernel, version 3.8, and conducted experiments on Intel core i7 processor with eight logical cores. We compared the execution overhead of real-time tasks in the above implementation of SchedISA with that in SCHED_DEADLINE's P-EDF implementation, which concurrently executes real-time and non-real-time tasks in Linux OS in all the cores. The experimental results show that the execution overhead of real-time tasks in the above implementation of SchedISA is considerably less than that in SCHED_DEADLINE. We believe that, with further refinement of SchedISA, the execution overhead of real-time tasks in SchedISA can be reduced to a predictable maximum, making it suitable for scheduling hard real-time tasks without affecting the CPU share of Linux tasks.
Resumo:
Remote sensing of physiological parameters could be a cost effective approach to improving health care, and low-power sensors are essential for remote sensing because these sensors are often energy constrained. This paper presents a power optimized photoplethysmographic sensor interface to sense arterial oxygen saturation, a technique to dynamically trade off SNR for power during sensor operation, and a simple algorithm to choose when to acquire samples in photoplethysmography. A prototype of the proposed pulse oximeter built using commercial-off-the-shelf (COTS) components is tested on 10 adults. The dynamic adaptation techniques described reduce power consumption considerably compared to our reference implementation, and our approach is competitive to state-of-the-art implementations. The techniques presented in this paper may be applied to low-power sensor interface designs where acquiring samples is expensive in terms of power as epitomized by pulse oximetry.
Resumo:
This paper presents the experimental results for an attractive control scheme implementation using an 8 bit microcontroller. The power converter involved is a 3 phase full controlled bridge rectifier. A single quadrant DC drive has been realized and results have been presented for both open and closed loop implementations.
Resumo:
The charge-pump (CP) mismatch current is a dominant source of static phase error and reference spur in the nano-meter CMOS PLL implementations due to its worsened channel length modulation effect. This paper presents a charge-pump (CP) mismatch current reduction technique utilizing an adaptive body bias tuning of CP transistors and a zero CP mismatch current tracking PLL architecture for reference spur suppression. A chip prototype of the proposed circuit was implemented in 0.13 mu m CMOS technology. The frequency synthesizer consumes 8.2 mA current from a 13 V supply voltage and achieves a phase noise of -96.01 dBc/Hz @ 1 MHz offset from a 2.4 GHz RF carrier. The charge-pump measurements using the proposed calibration technique exhibited a mismatch current of less than 0.3 mu A (0.55%) over the VCO control voltage range of 0.3-1.0 V. The closed loop measurements show a minimized static phase error of within +/- 70 ps and a similar or equal to 9 dB reduction in reference spur level across the PLL output frequency range 2.4-2.5 GHz. The presented CP calibration technique compensates for the DC current mismatch and the mismatch due to channel length modulation effect and therefore improves the performance of CP-PLLs in nano-meter CMOS implementations. (C) 2015 Elsevier Ltd. All rights reserved.
Resumo:
Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.
Resumo:
This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective utilization of parallelism available on modern architectures. For applications that demand high performance, the traditional options are to use optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. The focus of our system, PolyMage, is on automatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. Our optimization approach primarily relies on the transformation and code generation capabilities of the polyhedral compiler framework. To the best of our knowledge, this is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically. Experimental results on a modern multicore system show that the performance achieved by our automatic approach is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler for image processing pipelines. For a camera raw image processing pipeline, our performance is comparable to that of a hand-tuned implementation.