Biblioteca Digital

985 resultados para intel processor

Construction and use of linear regression models for processor performance analysis

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Processor architects have a challenging task of evaluating a large design space consisting of several interacting parameters and optimizations. In order to assist architects in making crucial design decisions, we build linear regression models that relate Processor performance to micro-architecture parameters, using simulation based experiments. We obtain good approximate models using an iterative process in which Akaike's information criteria is used to extract a good linear model from a small set of simulations, and limited further simulation is guided by the model using D-optimal experimental designs. The iterative process is repeated until desired error bounds are achieved. We used this procedure to establish the relationship of the CPI performance response to 26 key micro-architectural parameters using a detailed cycle-by-cycle superscalar processor simulator The resulting models provide a significance ordering on all micro-architectural parameters and their interactions, and explain the performance variations of micro-architectural techniques.

Adaptive load control of the central processor in a distributed system with a star topology

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The author presents adaptive control techniques for controlling the flow of real-time jobs from the peripheral processors (PPs) to the central processor (CP) of a distributed system with a star topology. He considers two classes of flow control mechanisms: (1) proportional control, where a certain proportion of the load offered to each PP is sent to the CP, and (2) threshold control, where there is a maximum rate at which each PP can send jobs to the CP. The problem is to obtain good algorithms for dynamically adjusting the control level at each PP in order to prevent overload of the CP, when the load offered by the PPs is unknown and varying. The author formulates the problem approximately as a standard system control problem in which the system has unknown parameters that are subject to change. Using well-known techniques (e.g., naive-feedback-controller and stochastic approximation techniques), he derives adaptive controls for the system control problem. He demonstrates the efficacy of these controls in the original problem by using the control algorithms in simulations of a queuing model of the CP and the load controls.

A Parallel Progressive Refinement Image Rendering Algorithm on a Scalable Multithreaded VLSI Processor Array

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we develop a multithreaded VLSI processor linear array architecture to render complex environments based on the radiosity approach. The processing elements are identical and multithreaded. They work in Single Program Multiple Data (SPMD) mode. A new algorithm to do the radiosity computations based on the progressive refinement approach[2] is proposed. Simulation results indicate that the architecture is latency tolerant and scalable. It is shown that a linear array of 128 uni-threaded processing elements sustains a throughput close to 0.4 million patches/sec.

Optimal scheduling of a processor executing a communication protocol stack

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We consider the problem of optimally scheduling a processor executing a multilayer protocol in an intelligent Network Interface Controller (NIC). In particular, we assume a typical LAN environment with class 4 transport service, a connectionless network service, and a class 1 link level protocol. We develop a queuing model for the problem. In the most general case this becomes a cyclic queuing network in which some queues have dedicated servers, and the others have a common schedulable server. We use sample path arguments and Markov decision theory to determine optimal service schedules. The optimal throughputs are compared with those obtained with simple policies. The optimal policy yields upto 25% improvement in some cases. In some other cases, the optimal policy does only slightly better than much simpler policies.

Heuristic algorithms for scheduling of a batch processor in automobile gear manufacturing

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we address a scheduling problem for minimising total weighted tardiness. The motivation for the paper comes from the automobile gear manufacturing process. We consider the bottleneck operation of heat treatment stage of gear manufacturing. Real life scenarios like unequal release times, incompatible job families, non-identical job sizes and allowance for job splitting have been considered. A mathematical model taking into account dynamic starting conditions has been developed. Due to the NP-hard nature of the problem, a few heuristic algorithms have been proposed. The performance of the proposed heuristic algorithms is evaluated: (a) in comparison with optimal solution for small size problem instances, and (b) in comparison with `estimated optimal solution' for large size problem instances. Extensive computational analyses reveal that the proposed heuristic algorithms are capable of consistently obtaining near-optimal solutions (that is, statistically estimated one) in very reasonable computational time.

Symmetrizing a Hessenberg matrix: Designs for VLSI parallel processor arrays

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A symmetrizer of a nonsymmetric matrix A is the symmetric matrix X that satisfies the equation XA = A(t)X, where t indicates the transpose. A symmetrizer is useful in converting a nonsymmetric eigenvalue problem into a symmetric one which is relatively easy to solve and finds applications in stability problems in control theory and in the study of general matrices. Three designs based on VLSI parallel processor arrays are presented to compute a symmetrizer of a lower Hessenberg matrix. Their scope is discussed. The first one is the Leiserson systolic design while the remaining two, viz., the double pipe design and the fitted diagonal design are the derived versions of the first design with improved performance.

Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

Grid Power Quality Analysis of 3-Phase System Using Low Cost Digital Signal Processor

Relevância:

20.00% 20.00%

Publicador:

Fault-Tolerant Average Execution Time Optimization for General Purpose Multi-Processor System-on Chips

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Fault-tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault-tolerance. For a given job and a soft (transient) error probability, we define mathematical formulas for AET that includes bus communication overhead for both voting (active replication) and rollback-recovery with checkpointing (RRC). And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC, (2) finding the number of processors and job-to-processor assignment when using voting, and (3) defining fault-tolerance scheme (voting or RRC) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.

Packet Reordering in Network Processors

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Network processors today consist of multiple parallel processors (micro engines) with support for multiple threads to exploit packet level parallelism inherent in network workloads. With such concurrency, packet ordering at the output of the network processor cannot be guaranteed. This paper studies the effect of concurrency in network processors on packet ordering. We use a validated Petri net model of a commercial network processor, Intel IXP 2400, to determine the extent of packet reordering for IPv4 forwarding application. Our study indicates that in addition to the parallel processing in the network processor, the allocation scheme for the transmit buffer also adversely impacts packet ordering. In particular, our results reveal that these packet reordering results in a packet retransmission rate of up to 61%. We explore different transmit buffer allocation schemes namely, contiguous, strided, local, and global which reduces the packet retransmission to 24%. We propose an alternative scheme, packet sort, which guarantees complete packet ordering while achieving a throughput of 2.5 Gbps. Further, packet sort outperforms the in-built packet ordering schemes in the IXP processor by up to 35%.

Optimizing Multimedia Experience in a Thin Client Environment for a Resource Constrained Processor

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we study how TCP and UDP flows interact with each other when the end system is a CPU resource constrained thin client. The problem addressed is twofold, 1) the throughput of TCP flows degrades severely in the presence of heavily loaded UDP flows 2) fairness and minimum QoS requirements of UDP are not maintained. First, we identify the factors affecting the TCP throughput by providing an in-depth analysis of end to end delay and packet loss variations. The results obtained from the first part leads us to our second contribution. We propose and study the use of an algorithm that ensures fairness across flows. The algorithm improves the performance of TCP flows in the presence of multiple UDP flows admitted under an admission algorithm and maintains the minimum QoS requirements of the UDP flows. The advantage of the algorithm is that it requires no changes to TCP/IP stack and control is achieved through receiver window control.

On a programmable signal processor for VLSI

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a method of designing a programmable signal processor based on a bit parallel matrix vector matrix multiplier (linear transformer). The salient feature of this design is that the efficiency of the direct vector matrix multiplier is improved and VLSI design is made much simpler by trading off the more expensive arithematic operation (multiplication) for 'cheaper' manipulation (addition/subtraction) of the data.

Experimental implementation of quantumUlam’s problem in a nuclear magnetic resonance quantum information processor

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Ulam’s problem is a two person game in which one of the player tries to search, in minimum queries, a number thought by the other player. Classically the problem scales polynomially with the size of the number. The quantum version of the Ulam’s problem has a query complexity that is independent of the dimension of the search space. The experimental implementation of the quantum Ulam’s problem in a Nuclear Magnetic Resonance Information Processor with 3 quantum bits is reported here.

Performance modeling and architecture exploration of network processors

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes a Petri net model for a commercial network processor (Intel iXP architecture) which is a multithreaded multiprocessor architecture. We consider and model three different applications viz., IPv4 forwarding, network address translation, and IP security running on IXP 2400/2850. A salient feature of the Petri net model is its ability to model the application, architecture and their interaction in great detail. The model is validated using the Intel proprietary tool (SDK 3.51 for IXP architecture) over a range of configurations. We conduct a detailed performance evaluation, identify the bottleneck resource, and propose a few architectural extensions and evaluate them in detail.

On-Chip Memory Architecture Exploration Framework for DSP Processor-Based Embedded System on Chip

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Today's SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. In this article, we address the on-chip memory architecture exploration for DSP processors which are organized as multiple memory banks, where banks can be single/dual ported with non-uniform bank sizes. In this paper we propose two different methods for physical memory architecture exploration and identify the strengths and applicability of these methods in a systematic way. Both methods address the memory architecture exploration for a given target application by considering the application's data access characteristics and generates a set of Pareto-optimal design points that are interesting from a power, performance and VLSI area perspective. To the best of our knowledge, this is the first comprehensive work on memory space exploration at physical memory level that integrates data layout and memory exploration to address the system objectives from both hardware design and application software development perspective. Further we propose an automatic framework that explores the design space identifying 100's of Pareto-optimal design points within a few hours of running on a standard desktop configuration.

«
1
2
3
4
5
6
7
8
...
65
66
»