Biblioteca Digital

109 resultados para intel processor

A Cache coherence protocol for MIN-based multiprocessors

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we present a cache coherence protocol for multistage interconnection network (MIN)-based multiprocessors with two distinct private caches: private-blocks caches (PCache) containing blocks private to a process and shared-blocks caches (SCache) containing data accessible by all processes. The architecture is extended by a coherence control bus connecting all shared-block cache controllers. Timing problems due to variable transit delays through the MIN are dealt with by introducing Transient states in the proposed cache coherence protocol. The impact of the coherence protocol on system performance is evaluated through a performance study of three phases. Assuming homogeneity of all nodes, a single-node queuing model (phase 3) is developed to analyze system performance. This model is solved for processor and coherence bus utilizations using the mean value analysis (MVA) technique with shared-blocks steady state probabilities (phase 1) and communication delays (phase 2) as input parameters. The performance of our system is compared to that of a system with an equivalent-sized unified cache and with a multiprocessor implementing a directory-based coherence protocol. System performance measures are verified through simulation.

Deblurring in a noncoherent optical processing system: pupil function synthesis and experimental implementation

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The bipolar point spread function (PSF) corresponding to the Wiener filter tor correcting linear-motion-blurred pictures is implemented in a noncoherent optical processor. The following two approaches are taken for this implementation: (1) the PSF is modulated and biased so that the resulting function is non-negative and (2) the PSF is split into its positive and sign-reversed negative parts, and these two parts are dealt with separately. The phase problem associated with arriving at the pupil function from these modified PSFs is solved using both analytical and combined analytical-iterative techniques available in the literature. The designed pupil functions are experimentally implemented, and deblurring in a noncoherent processor is demonstrated. The postprocessing required (i.e., demodulation in the first approach to modulating the PSF and intensity subtraction in the second approach) are carried out either in a coherent processor or with the help of a PC-based vision system. The deblurred outputs are presented.

A Flat Concurrent Prolog compiler for PARAM

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe a compiler for the Flat Concurrent Prolog language on a message passing multiprocessor architecture. This compiler permits symbolic and declarative programming in the syntax of Guarded Horn Rules, The implementation has been verified and tested on the 64-node PARAM parallel computer developed by C-DAC (Centre for the Development of Advanced Computing, India), Flat Concurrent Prolog (FCP) is a logic programming language designed for concurrent programming and parallel execution, It is a process oriented language, which embodies dataflow synchronization and guarded-command as its basic control mechanisms. An identical algorithm is executed on every processor in the network, We assume regular network topologies like mesh, ring, etc, Each node has a local memory, The algorithm comprises of two important parts: reduction and communication, The most difficult task is to integrate the solutions of problems that arise in the implementation in a coherent and efficient manner. We have tested the efficacy of the compiler on various benchmark problems of the ICOT project that have been reported in the recent book by Evan Tick, These problems include Quicksort, 8-queens, and Prime Number Generation, The results of the preliminary tests are favourable, We are currently examining issues like indexing and load balancing to further optimize our compiler.

The effect of traffic shaping in efficiently providing end-to-end performance guarantees

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper reports new results concerning the capabilities of a family of service disciplines aimed at providing per-connection end-to-end delay (and throughput) guarantees in high-speed networks. This family consists of the class of rate-controlled service disciplines, in which traffic from a connection is reshaped to conform to specific traffic characteristics, at every hop on its path. When used together with a scheduling policy at each node, this reshaping enables the network to provide end-to-end delay guarantees to individual connections. The main advantages of this family of service disciplines are their implementation simplicity and flexibility. On the other hand, because the delay guarantees provided are based on summing worst case delays at each node, it has also been argued that the resulting bounds are very conservative which may more than offset the benefits. In particular, other service disciplines such as those based on Fair Queueing or Generalized Processor Sharing (GPS), have been shown to provide much tighter delay bounds. As a result, these disciplines, although more complex from an implementation point-of-view, have been considered for the purpose of providing end-to-end guarantees in high-speed networks. In this paper, we show that through ''proper'' selection of the reshaping to which we subject the traffic of a connection, the penalty incurred by computing end-to-end delay bounds based on worst cases at each node can be alleviated. Specifically, we show how rate-controlled service disciplines can be designed to outperform the Rate Proportional Processor Sharing (RPPS) service discipline. Based on these findings, we believe that rate-controlled service disciplines provide a very powerful and practical solution to the problem of providing end-to-end guarantees in high-speed networks.

Synthesis of ASIPs for DSP algorithms

Relevância:

10.00% 10.00%

Publicador:

Resumo:

ASICs offer the best realization of DSP algorithms in terms of performance, but the cost is prohibitive, especially when the volumes involved are low. However, if the architecture synthesis trajectory for such algorithms is such that the target architecture can be identified as an interconnection of elementary parameterized computational structures, then it is possible to attain a close match, both in terms of performance and power with respect to an ASIC, for any algorithmic parameters of the given algorithm. Such an architecture is weakly programmable (configurable) and can be viewed as an application specific integrated processor (ASIP). In this work, we present a methodology to synthesize ASIPs for DSP algorithms. (C) 1999 Elsevier Science B.V. All rights reserved.

Accelerating multi-core simulators

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Simulation is an important means of evaluating new microarchitectures. With the invention of multi-core (CMP) platforms, simulators are becoming larger and more complex. However, with the availability of CMPs with larger caches and higher operating frequency, the wall clock time required for simulating an application has become comparatively shorter. Reducing this simulation time further is a great challenge, especially in the case of multi-threaded workload due to indeterminacy introduced due to simultaneously executing various threads. In this paper, we propose a technique for speeding multi-core simulation. The model of the processor core and cache are replaced with functional models, to achieve speedup. A timed Petri net model is used to estimate the execution time of the processor and the memory access latencies are estimated using hit/miss information obtained from the functional model of the cache. This model can be used to predict performance of data parallel applications or multiprogramming workload on CMP platform with various cache hierarchies and shared bus interconnect. The error in estimation of the execution time of an application is within 6%. The speedup achieved ranges between an average of 2x--4x over the cycle accurate simulator.

A method of tracking the peak power points for a variable speed wind energy conversion system

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper, a method of tracking the peak power in a wind energy conversion system (WECS) is proposed, which is independent of the turbine parameters and air density. The algorithm searches for the peak power by varying the speed in the desired direction. The generator is operated in the speed control mode with the speed reference being dynamically modified in accordance with the magnitude and direction of change of active power. The peak power points in the P-omega curve correspond to dP/domega = 0. This fact is made use of in the optimum point search algorithm. The generator considered is a wound rotor induction machine whose stator is connected directly to the grid and the rotor is fed through back-to-back pulse-width-modulation (PWM) converters. Stator flux-oriented vector control is applied to control the active and reactive current loops independently. The turbine characteristics are generated by a dc motor fed from a commercial dc drive. All of the control loops are executed by a single-chip digital signal processor (DSP) controller TMS320F240. Experimental results show that the performance of the control algorithm compares well with the conventional torque control method.

Online Unsupervised Pattern Discovery in Speech Using Parallelization

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Segmental dynamic time warping (DTW) has been demonstrated to be a useful technique for finding acoustic similarity scores between segments of two speech utterances. Due to its high computational requirements, it had to be computed in an offline manner, limiting the applications of the technique. In this paper, we present results of parallelization of this task by distributing the workload in either a static or dynamic way on an 8-processor cluster and discuss the trade-offs among different distribution schemes. We show that online unsupervised pattern discovery using segmental DTW is plausible with as low as 8 processors. This brings the task within reach of today's general purpose multi-core servers. We also show results on a 32-processor system, and discuss factors affecting scalability of our methods.

Executing Long-running Multi-component Applications on Batch Grids

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Computational grids are increasingly being used for executing large multi-component scientific applications. The most widely reported advantages of application execution on grids are the performance benefits, in terms of speeds, problem sizes or quality of solutions, due to increased number of processors. We explore the possibility of improved performance on grids without increasing the application’s processor space. For this, we consider grids with multiple batch systems. We explore the challenges involved in and the advantages of executing long-running multi-component applications on multiple batch sites with a popular multi-component climate simulation application, CCSM, as the motivation.We have performed extensive simulation studies to estimate the single and multi-site execution rates of the applications for different system characteristics.Our experiments show that in many cases, multiple batch executions can have better execution rates than a single site execution.

A Real-time clustering system for spatio-temporal signals from network of neurons

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Over past few years, the studies of cultured neuronal networks have opened up avenues for understanding the ion channels, receptor molecules, and synaptic plasticity that may form the basis of learning and memory. The hippocampal neurons from rats are dissociated and cultured on a surface containing a grid of 64 electrodes. The signals from these 64 electrodes are acquired using a fast data acquisition system MED64 (Alpha MED Sciences, Japan) at a sampling rate of 20 K samples with a precision of 16-bits per sample. A few minutes of acquired data runs in to a few hundreds of Mega Bytes. The data processing for the neural analysis is highly compute-intensive because the volume of data is huge. The major processing requirements are noise removal, pattern recovery, pattern matching, clustering and so on. In order to interface a neuronal colony to a physical world, these computations need to be performed in real-time. A single processor such as a desk top computer may not be adequate to meet this computational requirements. Parallel computing is a method used to satisfy the real-time computational requirements of a neuronal system that interacts with an external world while increasing the flexibility and scalability of the application. In this work, we developed a parallel neuronal system using a multi-node Digital Signal processing system. With 8 processors, the system is able to compute and map incoming signals segmented over a period of 200 ms in to an action in a trained cluster system in real time.

Cojoined Irregular Topology and Routing Table Generation for Network-on-Chip

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Scalable Networks on Chips (NoCs) are needed to match the ever-increasing communication demands of large-scale Multi-Processor Systems-on-chip (MPSoCs) for multi media communication applications. The heterogeneous nature of application specific on-chip cores along with the specific communication requirements among the cores calls for the design of application-specific NoCs for improved performance in terms of communication energy, latency, and throughput. In this work, we propose a methodology for the design of customized irregular networks-on-chip. The proposed method exploits a priori knowledge of the applications communication characteristic to generate an optimized network topology and corresponding routing tables.

Region Based Structure Layout Optimization by Selective Data Copying

Relevância:

10.00% 10.00%

Publicador:

Resumo:

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL

Reducing Buffer Requirements in Core Routers using Dynamic Buffering

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Earlier studies have exploited statistical multiplexing of flows in the core of the Internet to reduce the buffer requirement in routers. Reducing the memory requirement of routers is important as it enables an improvement in performance and at the same time a decrease in the cost. In this paper, we observe that the links in the core of the Internet are typically over-provisioned and this can be exploited to reduce the buffering requirement in routers. The small on-chip memory of a network processor (NP) can be effectively used to buffer packets during most regimes of traffic. We propose a dynamic buffering strategy which buffers packets in the receive and transmit buffers of a NP when the memory requirement is low. When the buffer requirement increases due to bursts in the traffic, memory is allocated to packets in the off-chip DRAM. This scheme effectively mitigates the DRAM access bottleneck, as only a part of the traffic is stored in the DRAM. We build a Petri net model and evaluate the proposed scheme with core Internet like traffic. At 77% link utilization, the dynamic buffering scheme has a drop rate of just 0.65%, whereas the traditional DRAM buffering has 4.64% packet drop rate. Even with a high link utilization of 90%, which rarely happens in the core, our dynamic buffering results in a packet drop rate of only 2.17%, while supporting a throughput of 7.39 Gbps. We study the proposed scheme under different conditions to understand the provisioning of processing threads and to determine the queue length at which packets must be buffered in the DRAM. We show that the proposed dynamic buffering strategy drastically reduces the buffering requirement while still maintaining low packet drop rates.

Executable Analysis using Abstract Interpretation with Circular Linear Progressions

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a new abstract domain for static analysis of executable code. Concrete states are abstracted using circular linear progressions (CLPs). CLPs model computations using a finite word length as is seen in any real life processor. The finite abstraction allows handling overflow scenarios in a natural and straight-forward manner. Abstract transfer functions have been defined for a wide range of operations which makes this domain easily applicable for analyzing code for a wide range of ISAs. CLPs combine the scalability of interval domains with the discreteness of linear congruence domains. We also present a novel, lightweight method to track linear equality relations between static objects that is used by the analysis to improve precision. The analysis is efficient, the total space and time overhead being quadratic in the number of static objects being tracked.

Adaptive Filtering Technique and DSP Based Implementation for High-Speed Distance Protection

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The paper presents an adaptive Fourier filtering technique and a relaying scheme based on a combination of a digital band-pass filter along with a three-sample algorithm, for applications in high-speed numerical distance protection. To enhance the performance of above-mentioned technique, a high-speed fault detector has been used. MATLAB based simulation studies show that the adaptive Fourier filtering technique provides fast tripping for near faults and security for farther faults. The digital relaying scheme based on a combination of digital band-pass filter along with three-sample data window algorithm also provides accurate and high-speed detection of faults. The paper also proposes a high performance 16-bit fixed point DSP (Texas Instruments TMS320LF2407A) processor based hardware scheme suitable for implementation of the above techniques. To evaluate the performance of the proposed relaying scheme under steady state and transient conditions, PC based menu driven relay test procedures are developed using National Instruments LabVIEW software. The test signals are generated in real time using LabVIEW compatible analog output modules. The results obtained from the simulation studies as well as hardware implementations are also presented.

«
1
2
3
4
5
6
7
8
»