942 resultados para 291605 Processor Architectures
Resumo:
This paper reports new results concerning the capabilities of a family of service disciplines aimed at providing per-connection end-to-end delay (and throughput) guarantees in high-speed networks. This family consists of the class of rate-controlled service disciplines, in which traffic from a connection is reshaped to conform to specific traffic characteristics, at every hop on its path. When used together with a scheduling policy at each node, this reshaping enables the network to provide end-to-end delay guarantees to individual connections. The main advantages of this family of service disciplines are their implementation simplicity and flexibility. On the other hand, because the delay guarantees provided are based on summing worst case delays at each node, it has also been argued that the resulting bounds are very conservative which may more than offset the benefits. In particular, other service disciplines such as those based on Fair Queueing or Generalized Processor Sharing (GPS), have been shown to provide much tighter delay bounds. As a result, these disciplines, although more complex from an implementation point-of-view, have been considered for the purpose of providing end-to-end guarantees in high-speed networks. In this paper, we show that through ''proper'' selection of the reshaping to which we subject the traffic of a connection, the penalty incurred by computing end-to-end delay bounds based on worst cases at each node can be alleviated. Specifically, we show how rate-controlled service disciplines can be designed to outperform the Rate Proportional Processor Sharing (RPPS) service discipline. Based on these findings, we believe that rate-controlled service disciplines provide a very powerful and practical solution to the problem of providing end-to-end guarantees in high-speed networks.
Resumo:
The paper examines the suitability of the generalized data rule in training artificial neural networks (ANN) for damage identification in structures. Several multilayer perceptron architectures are investigated for a typical bridge truss structure with simulated damage stares generated randomly. The training samples have been generated in terms of measurable structural parameters (displacements and strains) at suitable selected locations in the structure. Issues related to the performance of the network with reference to hidden layers and hidden. neurons are examined. Some heuristics are proposed for the design of neural networks for damage identification in structures. These are further supported by an investigation conducted on five other bridge truss configurations.
Resumo:
ASICs offer the best realization of DSP algorithms in terms of performance, but the cost is prohibitive, especially when the volumes involved are low. However, if the architecture synthesis trajectory for such algorithms is such that the target architecture can be identified as an interconnection of elementary parameterized computational structures, then it is possible to attain a close match, both in terms of performance and power with respect to an ASIC, for any algorithmic parameters of the given algorithm. Such an architecture is weakly programmable (configurable) and can be viewed as an application specific integrated processor (ASIP). In this work, we present a methodology to synthesize ASIPs for DSP algorithms. (C) 1999 Elsevier Science B.V. All rights reserved.
Resumo:
Simulation is an important means of evaluating new microarchitectures. With the invention of multi-core (CMP) platforms, simulators are becoming larger and more complex. However, with the availability of CMPs with larger caches and higher operating frequency, the wall clock time required for simulating an application has become comparatively shorter. Reducing this simulation time further is a great challenge, especially in the case of multi-threaded workload due to indeterminacy introduced due to simultaneously executing various threads. In this paper, we propose a technique for speeding multi-core simulation. The model of the processor core and cache are replaced with functional models, to achieve speedup. A timed Petri net model is used to estimate the execution time of the processor and the memory access latencies are estimated using hit/miss information obtained from the functional model of the cache. This model can be used to predict performance of data parallel applications or multiprogramming workload on CMP platform with various cache hierarchies and shared bus interconnect. The error in estimation of the execution time of an application is within 6%. The speedup achieved ranges between an average of 2x--4x over the cycle accurate simulator.
Resumo:
In this paper, we consider the application of belief propagation (BP) to achieve near-optimal signal detection in large multiple-input multiple-output (MIMO) systems at low complexities. Large-MIMO architectures based on spatial multiplexing (V-BLAST) as well as non-orthogonal space-time block codes(STBC) from cyclic division algebra (CDA) are considered. We adopt graphical models based on Markov random fields (MRF) and factor graphs (FG). In the MRF based approach, we use pairwise compatibility functions although the graphical models of MIMO systems are fully/densely connected. In the FG approach, we employ a Gaussian approximation (GA) of the multi-antenna interference, which significantly reduces the complexity while achieving very good performance for large dimensions. We show that i) both MRF and FG based BP approaches exhibit large-system behavior, where increasingly closer to optimal performance is achieved with increasing number of dimensions, and ii) damping of messages/beliefs significantly improves the bit error performance.
Resumo:
In this paper, a method of tracking the peak power in a wind energy conversion system (WECS) is proposed, which is independent of the turbine parameters and air density. The algorithm searches for the peak power by varying the speed in the desired direction. The generator is operated in the speed control mode with the speed reference being dynamically modified in accordance with the magnitude and direction of change of active power. The peak power points in the P-omega curve correspond to dP/domega = 0. This fact is made use of in the optimum point search algorithm. The generator considered is a wound rotor induction machine whose stator is connected directly to the grid and the rotor is fed through back-to-back pulse-width-modulation (PWM) converters. Stator flux-oriented vector control is applied to control the active and reactive current loops independently. The turbine characteristics are generated by a dc motor fed from a commercial dc drive. All of the control loops are executed by a single-chip digital signal processor (DSP) controller TMS320F240. Experimental results show that the performance of the control algorithm compares well with the conventional torque control method.
Resumo:
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.
Resumo:
We describe a System-C based framework we are developing, to explore the impact of various architectural and microarchitectural level parameters of the on-chip interconnection network elements on its power and performance. The framework enables one to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. We provide preliminary results of using this framework to study the power, latency and throughput of a 4x4 multi-core processing array using mesh, torus and folded torus, for two different communication patterns of dense and sparse linear algebra. The traffic consists of both Request-Response messages (mimicing cache accesses)and One-Way messages. We find that the average latency can be reduced by increasing the pipeline depth, as it enables higher link frequencies. We also find that there exists an optimum degree of pipelining which minimizes energy-delay product.
Resumo:
Segmental dynamic time warping (DTW) has been demonstrated to be a useful technique for finding acoustic similarity scores between segments of two speech utterances. Due to its high computational requirements, it had to be computed in an offline manner, limiting the applications of the technique. In this paper, we present results of parallelization of this task by distributing the workload in either a static or dynamic way on an 8-processor cluster and discuss the trade-offs among different distribution schemes. We show that online unsupervised pattern discovery using segmental DTW is plausible with as low as 8 processors. This brings the task within reach of today's general purpose multi-core servers. We also show results on a 32-processor system, and discuss factors affecting scalability of our methods.
Resumo:
Regular Expressions are generic representations for a string or a collection of strings. This paper focuses on implementation of a regular expression matching architecture on reconfigurable fabric like FPGA. We present a Nondeterministic Finite Automata based implementation with extended regular expression syntax set compared to previous approaches. We also describe a dynamically reconfigurable generic block that implements the supported regular expression syntax. This enables formation of the regular expression hardware by a simple cascade of generic blocks as well as a possibility for reconfiguring the generic blocks to change the regular expression being matched. Further,we have developed an HDL code generator to obtain the VHDL description of the hardware for any regular expression set. Our optimized regular expression engine achieves a throughput of 2.45 Gbps. Our dynamically reconfigurable regular expression engine achieves a throughput of 0.8 Gbps using 12 FPGA slices per generic block on Xilinx Virtex2Pro FPGA.
Resumo:
Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences critical system design objectives like area, power and performance. Hence the embedded system designer performs a complete memory architecture exploration to custom design a memory architecture for a given set of applications. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of exhaustive-search based memory exploration at the outer level and a two step based integrated data layout for SPRAM-Cache based architectures at the inner level. We present a two step integrated approach for data layout for SPRAM-Cache based hybrid architectures with the first step as data-partitioning that partitions data between SPRAM and Cache, and the second step is the cache conscious data layout. We formulate the cache-conscious data layout as a graph partitioning problem and show that our approach gives up to 34% improvement over an existing approach and also optimizes the off-chip memory address space. We experimented our approach with 3 embedded multimedia applications and our approach explores several hundred memory configurations for each application, yielding several optimal design points in a few hours of computation on a standard desktop.
Resumo:
Computational grids are increasingly being used for executing large multi-component scientific applications. The most widely reported advantages of application execution on grids are the performance benefits, in terms of speeds, problem sizes or quality of solutions, due to increased number of processors. We explore the possibility of improved performance on grids without increasing the application’s processor space. For this, we consider grids with multiple batch systems. We explore the challenges involved in and the advantages of executing long-running multi-component applications on multiple batch sites with a popular multi-component climate simulation application, CCSM, as the motivation.We have performed extensive simulation studies to estimate the single and multi-site execution rates of the applications for different system characteristics.Our experiments show that in many cases, multiple batch executions can have better execution rates than a single site execution.
Resumo:
Over past few years, the studies of cultured neuronal networks have opened up avenues for understanding the ion channels, receptor molecules, and synaptic plasticity that may form the basis of learning and memory. The hippocampal neurons from rats are dissociated and cultured on a surface containing a grid of 64 electrodes. The signals from these 64 electrodes are acquired using a fast data acquisition system MED64 (Alpha MED Sciences, Japan) at a sampling rate of 20 K samples with a precision of 16-bits per sample. A few minutes of acquired data runs in to a few hundreds of Mega Bytes. The data processing for the neural analysis is highly compute-intensive because the volume of data is huge. The major processing requirements are noise removal, pattern recovery, pattern matching, clustering and so on. In order to interface a neuronal colony to a physical world, these computations need to be performed in real-time. A single processor such as a desk top computer may not be adequate to meet this computational requirements. Parallel computing is a method used to satisfy the real-time computational requirements of a neuronal system that interacts with an external world while increasing the flexibility and scalability of the application. In this work, we developed a parallel neuronal system using a multi-node Digital Signal processing system. With 8 processors, the system is able to compute and map incoming signals segmented over a period of 200 ms in to an action in a trained cluster system in real time.
Resumo:
Scalable Networks on Chips (NoCs) are needed to match the ever-increasing communication demands of large-scale Multi-Processor Systems-on-chip (MPSoCs) for multi media communication applications. The heterogeneous nature of application specific on-chip cores along with the specific communication requirements among the cores calls for the design of application-specific NoCs for improved performance in terms of communication energy, latency, and throughput. In this work, we propose a methodology for the design of customized irregular networks-on-chip. The proposed method exploits a priori knowledge of the applications communication characteristic to generate an optimized network topology and corresponding routing tables.
Resumo:
Earlier studies have exploited statistical multiplexing of flows in the core of the Internet to reduce the buffer requirement in routers. Reducing the memory requirement of routers is important as it enables an improvement in performance and at the same time a decrease in the cost. In this paper, we observe that the links in the core of the Internet are typically over-provisioned and this can be exploited to reduce the buffering requirement in routers. The small on-chip memory of a network processor (NP) can be effectively used to buffer packets during most regimes of traffic. We propose a dynamic buffering strategy which buffers packets in the receive and transmit buffers of a NP when the memory requirement is low. When the buffer requirement increases due to bursts in the traffic, memory is allocated to packets in the off-chip DRAM. This scheme effectively mitigates the DRAM access bottleneck, as only a part of the traffic is stored in the DRAM. We build a Petri net model and evaluate the proposed scheme with core Internet like traffic. At 77% link utilization, the dynamic buffering scheme has a drop rate of just 0.65%, whereas the traditional DRAM buffering has 4.64% packet drop rate. Even with a high link utilization of 90%, which rarely happens in the core, our dynamic buffering results in a packet drop rate of only 2.17%, while supporting a throughput of 7.39 Gbps. We study the proposed scheme under different conditions to understand the provisioning of processing threads and to determine the queue length at which packets must be buffered in the DRAM. We show that the proposed dynamic buffering strategy drastically reduces the buffering requirement while still maintaining low packet drop rates.