857 resultados para run-time allocation
Resumo:
Floquet analysis is widely used for small-order systems (say, order M < 100) to find trim results of control inputs and periodic responses, and stability results of damping levels and frequencies, Presently, however, it is practical neither for design applications nor for comprehensive analysis models that lead to large systems (M > 100); the run time on a sequential computer is simply prohibitive, Accordingly, a massively parallel Floquet analysis is developed with emphasis on large systems, and it is implemented on two SIMD or single-instruction, multiple-data computers with 4096 and 8192 processors, The focus of this development is a parallel shooting method with damped Newton iteration to generate trim results; the Floquet transition matrix (FTM) comes out as a byproduct, The eigenvalues and eigenvectors of the FTM are computed by a parallel QR method, and thereby stability results are generated, For illustration, flap and flap-lag stability of isolated rotors are treated by the parallel analysis and by a corresponding sequential analysis with the conventional shooting and QR methods; linear quasisteady airfoil aerodynamics and a finite-state three-dimensional wake model are used, Computational reliability is quantified by the condition numbers of the Jacobian matrices in Newton iteration, the condition numbers of the eigenvalues and the residual errors of the eigenpairs, and reliability figures are comparable in both the parallel and sequential analyses, Compared to the sequential analysis, the parallel analysis reduces the run time of large systems dramatically, and the reduction increases with increasing system order; this finding offers considerable promise for design and comprehensive-analysis applications.
Resumo:
We propose a Low Noise Amplifier (LNA) architecture for power scalable receiver front end (FE) for Zigbee. The motivation for power scalable receiver is to enable minimum power operation while meeting the run-time performance needed. We use simple models to find empirical relations between the available signal and interference levels to come up with required Noise Figure (NF) and 3rd order Intermodulation Product (IIP3) numbers. The architecture has two independent digital knobs to control the NF and IIP3. Acceptable input match while using adaptation has been achieved by using an Active Inductor configuration for the source degeneration inductor of the LNA. The low IF receiver front end (LNA with I and Q mixers) was fabricated in 130nm RFCMOS process and tested.
Resumo:
Instruction reuse is a microarchitectural technique that improves the execution time of a program by removing redundant computations at run-time. Although this is the job of an optimizing compiler, they do not succeed many a time due to limited knowledge of run-time data. In this paper we examine instruction reuse of integer ALU and load instructions in network processing applications. Specifically, this paper attempts to answer the following questions: (1) How much of instruction reuse is inherent in network processing applications?, (2) Can reuse be improved by reducing interference in the reuse buffer?, (3) What characteristics of network applications can be exploited to improve reuse?, and (4) What is the effect of reuse on resource contention and memory accesses? We propose an aggregation scheme that combines the high-level concept of network traffic i.e. "flows" with a low level microarchitectural feature of programs i.e. repetition of instructions and data along with an architecture that exploits temporal locality in incoming packet data to improve reuse. We find that for the benchmarks considered, 1% to 50% of instructions are reused while the speedup achieved varies between 1% and 24%. As a side effect, instruction reuse reduces memory traffic and can therefore be considered as a scheme for low power.
Resumo:
Most Java programmers would agree that Java is a language that promotes a philosophy of “create and go forth”. By design, temporary objects are meant to be created on the heap, possibly used and then abandoned to be collected by the garbage collector. Excessive generation of temporary objects is termed “object churn” and is a form of software bloat that often leads to performance and memory problems. To mitigate this problem, many compiler optimizations aim at identifying objects that may be allocated on the stack. However, most such optimizations miss large opportunities for memory reuse when dealing with objects inside loops or when dealing with container objects. In this paper, we describe a novel algorithm that detects bloat caused by the creation of temporary container and String objects within a loop. Our analysis determines which objects created within a loop can be reused. Then we describe a source-to-source transformation that efficiently reuses such objects. Empirical evaluation indicates that our solution can reduce upto 40% of temporary object allocations in large programs, resulting in a performance improvement that can be as high as a 20% reduction in the run time, specifically when a program has a high churn rate or when the program is memory intensive and needs to run the GC often.
Resumo:
Frequent episode discovery is a popular framework for pattern discovery from sequential data. It has found many applications in domains like alarm management in telecommunication networks, fault analysis in the manufacturing plants, predicting user behavior in web click streams and so on. In this paper, we address the discovery of serial episodes. In the episodes context, there have been multiple ways to quantify the frequency of an episode. Most of the current algorithms for episode discovery under various frequencies are apriori-based level-wise methods. These methods essentially perform a breadth-first search of the pattern space. However currently there are no depth-first based methods of pattern discovery in the frequent episode framework under many of the frequency definitions. In this paper, we try to bridge this gap. We provide new depth-first based algorithms for serial episode discovery under non-overlapped and total frequencies. Under non-overlapped frequency, we present algorithms that can take care of span constraint and gap constraint on episode occurrences. Under total frequency we present an algorithm that can handle span constraint. We provide proofs of correctness for the proposed algorithms. We demonstrate the effectiveness of the proposed algorithms by extensive simulations. We also give detailed run-time comparisons with the existing apriori-based methods and illustrate scenarios under which the proposed pattern-growth algorithms perform better than their apriori counterparts. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
A power scalable receiver architecture is presented for low data rate Wireless Sensor Network (WSN) applications in 130nm RF-CMOS technology. Power scalable receiver is motivated by the ability to leverage lower run-time performance requirement to save power. The proposed receiver is able to switch power settings based on available signal and interference levels while maintaining requisite BER. The Low-IF receiver consists of Variable Noise and Linearity LNA, IQ Mixers, VGA, Variable Order Complex Bandpass Filter and Variable Gain and Bandwidth Amplifier (VGBWA) capable of driving variable sampling rate ADC. Various blocks have independent power scaling controls depending on their noise, gain and interference rejection (IR) requirements. The receiver is designed for constant envelope QPSK-type modulation with 2.4GHz RF input, 3MHz IF and 2MHz bandwidth. The chip operates at 1V Vdd with current scalable from 4.5mA to 1.3mA and chip area of 0.65mm2.
Resumo:
In wireless sensor networks (WSNs) the communication traffic is often time and space correlated, where multiple nodes in a proximity start transmitting at the same time. Such a situation is known as spatially correlated contention. The random access methods to resolve such contention suffers from high collision rate, whereas the traditional distributed TDMA scheduling techniques primarily try to improve the network capacity by reducing the schedule length. Usually, the situation of spatially correlated contention persists only for a short duration and therefore generating an optimal or sub-optimal schedule is not very useful. On the other hand, if the algorithm takes very large time to schedule, it will not only introduce additional delay in the data transfer but also consume more energy. To efficiently handle the spatially correlated contention in WSNs, we present a distributed TDMA slot scheduling algorithm, called DTSS algorithm. The DTSS algorithm is designed with the primary objective of reducing the time required to perform scheduling, while restricting the schedule length to maximum degree of interference graph. The algorithm uses randomized TDMA channel access as the mechanism to transmit protocol messages, which bounds the message delay and therefore reduces the time required to get a feasible schedule. The DTSS algorithm supports unicast, multicast and broadcast scheduling, simultaneously without any modification in the protocol. The protocol has been simulated using Castalia simulator to evaluate the run time performance. Simulation results show that our protocol is able to considerably reduce the time required to schedule.
Resumo:
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2 D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.
Resumo:
This paper presents exploratory and statistical analyses of the activity-travel behaviour of non-workers in Bangalore city in India. The study summarises the socio-demographic characteristics as well as the activity-travel behaviour of non-workers using a primary activity-travel survey data collected by the authors. Where possible, the research also compares the analysis findings with the case studies on activity-travel behaviour of non-workers, carried out in developed and developing countries. This gives an opportunity to understand the differences/similarities in the activity-travel behaviour of non-workers across diverse socio-cultural settings. The preliminary exploratory analysis shed light on the differences in activity participation, trip chaining, time-of-day preference for trip departure, and mode use behaviour of non-workers in Bangalore city. Statistical models were developed for investigating the effects of individual and household socio-demographics, land use parameters, and travel context attributes on activity participation, trip chaining, time-of-day choice, and mode choice decisions of non-workers. A few important results of the analysis are the influence of viewing television at home on out-of-home activity participation and trip-chaining behaviour, and the impact of in-home maintenance activity duration on time-of-day choice. Further, based on the findings of the initial analyses, an attempt has been made in this study to develop an integrated model that links time allocation, time-of-day choice, and trip chaining behaviour of non-workers. The study also discusses the implications of the research findings for transportation planning and policy for Bangalore city. (C) 2015 Elsevier Ltd. All rights reserved.
Resumo:
In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.
Resumo:
It was demonstrated in earlier work that, by approximating its range kernel using shiftable functions, the nonlinear bilateral filter can be computed using a series of fast convolutions. Previous approaches based on shiftable approximation have, however, been restricted to Gaussian range kernels. In this work, we propose a novel approximation that can be applied to any range kernel, provided it has a pointwise-convergent Fourier series. More specifically, we propose to approximate the Gaussian range kernel of the bilateral filter using a Fourier basis, where the coefficients of the basis are obtained by solving a series of least-squares problems. The coefficients can be efficiently computed using a recursive form of the QR decomposition. By controlling the cardinality of the Fourier basis, we can obtain a good tradeoff between the run-time and the filtering accuracy. In particular, we are able to guarantee subpixel accuracy for the overall filtering, which is not provided by the most existing methods for fast bilateral filtering. We present simulation results to demonstrate the speed and accuracy of the proposed algorithm.
Resumo:
Transient test facilities offer the potential for the simultaneous study of turbine aerodynamic performance, unsteady flow phenomena and the heat transfer characteristics of a turbine stage. This paper describes the development of aerodynamic performance measurement techniques in the Oxford Rotor Facility (ORF). The solutions to the technological issues involved with transient testing presented in this paper are expected to achieve levels of precision uncertainty comparable with traditional steady flow test rigs. The theoretical background to the measurement of aerodynamic performance is presented together with a comprehensive pre-test uncertainty analysis. The instrumentation scheme for the measurement of stage mass flow rate is discussed in detail, the measurements of shaft power, total inlet enthalpy, and stage pressure ratio are also outlined. The current working section features a 62% scale, 1-1/2 stage, high-pressure shroudless transonic turbine. The required inlet flow conditions are provided by an Isentropic Light Piston Tunnel (ILPT) with a quasi-steady state run time of approximately 70ms. The testing is conducted at engine representative specific speed, pressure ratio, gas-to-wall temperature ratio, Mach number and Reynolds number.
Resumo:
A new three-dimensional Navier-Stokes solver for flows in turbomachines has been developed. The new solver is based on the latest version of the Denton codes, but has been implemented to run on Graphics Processing Units (GPUs) instead of the traditional Central Processing Unit (CPU). The change in processor enables an order-of-magnitude reduction in run-time due to the higher performance of the GPU. Scaling results for a 16 node GPU cluster are also presented, showing almost linear scaling for typical turbomachinery cases. For validation purposes, a test case consisting of a three-stage turbine with complete hub and casing leakage paths is described. Good agreement is obtained with previously published experimental results. The simulation runs in less than 10 minutes on a cluster with four GPUs. Copyright © 2009 by ASME.
Resumo:
A type checking method for the functional language LFC is presented. A distinct feature of LFC is that it uses Context-Free (CF) languages as data types to represent compound data structures. This makes LFC a dynamically typed language. To improve efficiency, a practical type checking method is presented, which consists of both static and dynamic type checking. Although the inclusion relation of CF.languages is not decidable,a special subset of the relation is decidable, i.e., the sentential form relation, which can be statically checked.Moreover, most of the expressions in actual LFC programs appear to satisfy this relation according to the statistic data of experiments. So, despite that the static type checking is not complete, it undertakes most of the type checking task. Consequently the run-time efficiency is effectively improved. Another feature of the type checking is that it converts the expressions with implicit structures to structured representation. Structure reconstruction technique is presented.