218 resultados para 291605 Processor Architectures


Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, a method of tracking the peak power in a wind energy conversion system (WECS) is proposed, which is independent of the turbine parameters and air density. The algorithm searches for the peak power by varying the speed in the desired direction. The generator is operated in the speed control mode with the speed reference being dynamically modified in accordance with the magnitude and direction of change of active power. The peak power points in the P-omega curve correspond to dP/domega = 0. This fact is made use of in the optimum point search algorithm. The generator considered is a wound rotor induction machine whose stator is connected directly to the grid and the rotor is fed through back-to-back pulse-width-modulation (PWM) converters. Stator flux-oriented vector control is applied to control the active and reactive current loops independently. The turbine characteristics are generated by a dc motor fed from a commercial dc drive. All of the control loops are executed by a single-chip digital signal processor (DSP) controller TMS320F240. Experimental results show that the performance of the control algorithm compares well with the conventional torque control method.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe a System-C based framework we are developing, to explore the impact of various architectural and microarchitectural level parameters of the on-chip interconnection network elements on its power and performance. The framework enables one to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. We provide preliminary results of using this framework to study the power, latency and throughput of a 4x4 multi-core processing array using mesh, torus and folded torus, for two different communication patterns of dense and sparse linear algebra. The traffic consists of both Request-Response messages (mimicing cache accesses)and One-Way messages. We find that the average latency can be reduced by increasing the pipeline depth, as it enables higher link frequencies. We also find that there exists an optimum degree of pipelining which minimizes energy-delay product.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Segmental dynamic time warping (DTW) has been demonstrated to be a useful technique for finding acoustic similarity scores between segments of two speech utterances. Due to its high computational requirements, it had to be computed in an offline manner, limiting the applications of the technique. In this paper, we present results of parallelization of this task by distributing the workload in either a static or dynamic way on an 8-processor cluster and discuss the trade-offs among different distribution schemes. We show that online unsupervised pattern discovery using segmental DTW is plausible with as low as 8 processors. This brings the task within reach of today's general purpose multi-core servers. We also show results on a 32-processor system, and discuss factors affecting scalability of our methods.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Regular Expressions are generic representations for a string or a collection of strings. This paper focuses on implementation of a regular expression matching architecture on reconfigurable fabric like FPGA. We present a Nondeterministic Finite Automata based implementation with extended regular expression syntax set compared to previous approaches. We also describe a dynamically reconfigurable generic block that implements the supported regular expression syntax. This enables formation of the regular expression hardware by a simple cascade of generic blocks as well as a possibility for reconfiguring the generic blocks to change the regular expression being matched. Further,we have developed an HDL code generator to obtain the VHDL description of the hardware for any regular expression set. Our optimized regular expression engine achieves a throughput of 2.45 Gbps. Our dynamically reconfigurable regular expression engine achieves a throughput of 0.8 Gbps using 12 FPGA slices per generic block on Xilinx Virtex2Pro FPGA.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences critical system design objectives like area, power and performance. Hence the embedded system designer performs a complete memory architecture exploration to custom design a memory architecture for a given set of applications. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of exhaustive-search based memory exploration at the outer level and a two step based integrated data layout for SPRAM-Cache based architectures at the inner level. We present a two step integrated approach for data layout for SPRAM-Cache based hybrid architectures with the first step as data-partitioning that partitions data between SPRAM and Cache, and the second step is the cache conscious data layout. We formulate the cache-conscious data layout as a graph partitioning problem and show that our approach gives up to 34% improvement over an existing approach and also optimizes the off-chip memory address space. We experimented our approach with 3 embedded multimedia applications and our approach explores several hundred memory configurations for each application, yielding several optimal design points in a few hours of computation on a standard desktop.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Computational grids are increasingly being used for executing large multi-component scientific applications. The most widely reported advantages of application execution on grids are the performance benefits, in terms of speeds, problem sizes or quality of solutions, due to increased number of processors. We explore the possibility of improved performance on grids without increasing the application’s processor space. For this, we consider grids with multiple batch systems. We explore the challenges involved in and the advantages of executing long-running multi-component applications on multiple batch sites with a popular multi-component climate simulation application, CCSM, as the motivation.We have performed extensive simulation studies to estimate the single and multi-site execution rates of the applications for different system characteristics.Our experiments show that in many cases, multiple batch executions can have better execution rates than a single site execution.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Over past few years, the studies of cultured neuronal networks have opened up avenues for understanding the ion channels, receptor molecules, and synaptic plasticity that may form the basis of learning and memory. The hippocampal neurons from rats are dissociated and cultured on a surface containing a grid of 64 electrodes. The signals from these 64 electrodes are acquired using a fast data acquisition system MED64 (Alpha MED Sciences, Japan) at a sampling rate of 20 K samples with a precision of 16-bits per sample. A few minutes of acquired data runs in to a few hundreds of Mega Bytes. The data processing for the neural analysis is highly compute-intensive because the volume of data is huge. The major processing requirements are noise removal, pattern recovery, pattern matching, clustering and so on. In order to interface a neuronal colony to a physical world, these computations need to be performed in real-time. A single processor such as a desk top computer may not be adequate to meet this computational requirements. Parallel computing is a method used to satisfy the real-time computational requirements of a neuronal system that interacts with an external world while increasing the flexibility and scalability of the application. In this work, we developed a parallel neuronal system using a multi-node Digital Signal processing system. With 8 processors, the system is able to compute and map incoming signals segmented over a period of 200 ms in to an action in a trained cluster system in real time.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Scalable Networks on Chips (NoCs) are needed to match the ever-increasing communication demands of large-scale Multi-Processor Systems-on-chip (MPSoCs) for multi media communication applications. The heterogeneous nature of application specific on-chip cores along with the specific communication requirements among the cores calls for the design of application-specific NoCs for improved performance in terms of communication energy, latency, and throughput. In this work, we propose a methodology for the design of customized irregular networks-on-chip. The proposed method exploits a priori knowledge of the applications communication characteristic to generate an optimized network topology and corresponding routing tables.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Earlier studies have exploited statistical multiplexing of flows in the core of the Internet to reduce the buffer requirement in routers. Reducing the memory requirement of routers is important as it enables an improvement in performance and at the same time a decrease in the cost. In this paper, we observe that the links in the core of the Internet are typically over-provisioned and this can be exploited to reduce the buffering requirement in routers. The small on-chip memory of a network processor (NP) can be effectively used to buffer packets during most regimes of traffic. We propose a dynamic buffering strategy which buffers packets in the receive and transmit buffers of a NP when the memory requirement is low. When the buffer requirement increases due to bursts in the traffic, memory is allocated to packets in the off-chip DRAM. This scheme effectively mitigates the DRAM access bottleneck, as only a part of the traffic is stored in the DRAM. We build a Petri net model and evaluate the proposed scheme with core Internet like traffic. At 77% link utilization, the dynamic buffering scheme has a drop rate of just 0.65%, whereas the traditional DRAM buffering has 4.64% packet drop rate. Even with a high link utilization of 90%, which rarely happens in the core, our dynamic buffering results in a packet drop rate of only 2.17%, while supporting a throughput of 7.39 Gbps. We study the proposed scheme under different conditions to understand the provisioning of processing threads and to determine the queue length at which packets must be buffered in the DRAM. We show that the proposed dynamic buffering strategy drastically reduces the buffering requirement while still maintaining low packet drop rates.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Network processors today consist of multiple parallel processors (micro engines) with support for multiple threads to exploit packet level parallelism inherent in network workloads. With such concurrency, packet ordering at the output of the network processor cannot be guaranteed. This paper studies the effect of concurrency in network processors on packet ordering. We use a validated Petri net model of a commercial network processor, Intel IXP 2400, to determine the extent of packet reordering for IPv4 forwarding application. Our study indicates that in addition to the parallel processing in the network processor, the allocation scheme for the transmit buffer also adversely impacts packet ordering. In particular, our results reveal that these packet reordering results in a packet retransmission rate of up to 61%. We explore different transmit buffer allocation schemes namely, contiguous, strided, local, and global which reduces the packet retransmission to 24%. We propose an alternative scheme, packet sort, which guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Further, packet sort outperforms the in-built packet ordering schemes in the IXP processor by up to 35%.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a new abstract domain for static analysis of executable code. Concrete states are abstracted using circular linear progressions (CLPs). CLPs model computations using a finite word length as is seen in any real life processor. The finite abstraction allows handling overflow scenarios in a natural and straight-forward manner. Abstract transfer functions have been defined for a wide range of operations which makes this domain easily applicable for analyzing code for a wide range of ISAs. CLPs combine the scalability of interval domains with the discreteness of linear congruence domains. We also present a novel, lightweight method to track linear equality relations between static objects that is used by the analysis to improve precision. The analysis is efficient, the total space and time overhead being quadratic in the number of static objects being tracked.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes techniques to estimate the worst case execution time of executable code on architectures with data caches. The underlying mechanism is Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior. A simultaneous numeric and pointer analysis using an abstraction for discrete sets of values computes safe approximations of access addresses which are then used to predict cache behavior using Must Analysis. A heuristic is also proposed which generates likely worst case estimates. It can be used in soft real time systems and also for reasoning about the tightness of the safe estimate. The analysis methods can handle programs with non-affine access patterns, for which conventional Presburger Arithmetic formulations or Cache Miss Equations do not apply. The precision of the estimates is user-controlled and can be traded off against analysis time. Executables are analyzed directly, which, apart from enhancing precision, renders the method language independent.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Prior work on modeling interconnects has focused on optimizing the wire and repeater design for trading off energy and delay, and is largely based on low level circuit parameters. Hence these models are hard to use directly to make high level microarchitectural trade-offs in the initial exploration phase of a design. In this paper, we propose INTACTE, a tool that can be used by architects toget reasonably accurate interconnect area, delay, and power estimates based on a few architecture level parameters for the interconnect such as length, width (in number of bits), frequency, and latency for a specified technology and voltage. The tool uses well known models of interconnect delay and energy taking into account the wire pitch, repeater size, and spacing for a range of voltages and technologies.It then solves an optimization problem of finding the lowest energy interconnect design in terms of the low level circuit parameters, which meets the architectural constraintsgiven as inputs. In addition, the tool also provides the area, energy, and delay for a range of supply voltages and degrees of pipelining, which can be used for micro-architectural exploration of a chip. The delay and energy models used by the tool have been validated against low level circuit simulations. We discuss several potential applications of the tool and present an example of optimizing interconnect design in the context of clustered VLIW architectures. Copyright 2007 ACM.