129 resultados para intel processor


Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper describes the design and the architecture of a bit-level systolic array processor. The bit-level systolic array described is directly applicable to a wide range of image processing operations where high performance and throughput are essential. The architecture is illustrated by describing the operation of the correlator and convolver chips which are being developed. The advantage of the system is also discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Details of a new low power fast Fourier transform (FFT) processor for use in digital television applications are presented. This has been fabricated using a 0.6-µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-time video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8 × 8 mm and dissipates 1 W. The chip design is based on a novel VLSI architecture which has been derived from a first principles factorization of the discrete Fourier transform (DFT) matrix and tailored to a direct silicon implementation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The inclusion of the Discrete Wavelet Transform in the JPEG-2000 standard has added impetus to the research of hardware architectures for the two-dimensional wavelet transform. In this paper, a VLSI architecture for performing the symmetrically extended two-dimensional transform is presented. This architecture conforms to the JPEG-2000 standard and is capable of near-optimal performance when dealing with the image boundaries. The architecture also achieves efficient processor utilization. Implementation results based on a Xilinx Virtex-2 FPGA device are included.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The design of a generic QR core for adaptive beamforming is presented. The work relies on an existing mapping technique that can be applied to a triangular QR array in such a way to allow the generation of a range of QR architectures. All scheduling of data inputs and retiming to include processor latency has been included within the generic representation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Methods are presented for developing synthesizable FFT cores. These are based on a modular approach in which parameterizable blocks are cascaded to implement the computations required across a range of typical FFT signal flow graphs. The underlying architectural approach combines the use of a digital serial data organization with generic commutator blocks to produce systems that offer 100% processor utilization with storage requirements less than previous designs. The approach has been used to create generators for the automated synthesis of FFT cores that are portable across a broad range of silicon technologies. Resulting chip designs are competitive with manual methods but with significant reductions in design times.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents the design of a single chip adaptive beamformer which contains 5 million transistors and can perform 50 GigaFlops. The core processor of the adaptive beamformer is a QR-array processor implemented on a fully efficient linear systolic architecture. The paper highlights a number of rapid design techniques that have been used to realize the design. These include an architecture synthesis tool for quickly developing the circuit architecture and the utilization of a library of parameterizable silicon intellectual property (IP) cores, to rapidly develop the circuit layouts.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Details of a new low power FFT processor for use in digital television applications are presented. This has been fabricated using a 0.6 µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-rime video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8×8 mm and dissipates 1 W. Its performance, in terms of computational rate per area per watt, is significantly higher than previously reported devices, leading to a cost-effective silicon solution for high quality video processing applications. This is the result of using a novel VLSI architecture which has been derived from a first principles factorisation of the DFT matrix and tailored to a direct silicon implementation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We describe recent progress of an ongoing research programme aimed at producing computational science software that can exploit high performance architectures in the atomic physics application domain. We examine the computational bottleneck of matrix construction in a suite of two-dimensional R-matrix propagation programs, 2DRMP, that are aimed at creating virtual electron collision experiments on HPC architectures. We build on Ixaru's extended frequency dependent quadrature rules (EFDQR) for Slater integrals and examine the challenge of constructing Hamiltonian matrices in parallel across an m-processor compute node in a block cyclic distribution for subsequent diagonalization by ScaLAPACK.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The R-matrix method when applied to the study of intermediate energy electron scattering by the hydrogen atom gives rise to a large number of two electron integrals between numerical basis functions. Each integral is evaluated independently of the others, thereby rendering this a prime candidate for a parallel implementation. In this paper, we present a parallel implementation of this routine which uses a Graphical Processing Unit as a co-processor, giving a speedup of approximately 20 times when compared with a sequential version. We briefly consider properties of this calculation which make a GPU implementation appropriate with a view to identifying other calculations which might similarly benet.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In intelligent video surveillance systems, scalability (of the number of simultaneous video streams) is important. Two key factors which hinder scalability are the time spent in decompressing the input video streams, and the limited computational power of the processor. This paper demonstrates how a combination of algorithmic and hardware techniques can overcome these limitations, and significantly increase the number of simultaneous streams. The techniques used are processing in the compressed domain, and exploitation of the multicore and vector processing capability of modern processors. The paper presents a system which performs background modeling, using a Mixture of Gaussians approach. This is an important first step in the segmentation of moving targets. The paper explores the effects of reducing the number of coefficients in the compressed domain, in terms of throughput speed and quality of the background modeling. The speedups achieved by exploiting compressed domain processing, multicore and vector processing are explored individually. Experiments show that a combination of all these techniques can give a speedup of 170 times on a single CPU compared to a purely serial, spatial domain implementation, with a slight gain in quality.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We address the problem of designing distributed algorithms for large scale networks that are robust to Byzantine faults. We consider a message passing, full information model: the adversary is malicious, controls a constant fraction of processors, and can view all messages in a round before sending out its own messages for that round. Furthermore, each bad processor may send an unlimited number of messages. The only constraint on the adversary is that it must choose its corrupt processors at the start, without knowledge of the processors’ private random bits.

A good quorum is a set of O(logn) processors, which contains a majority of good processors. In this paper, we give a synchronous algorithm which uses polylogarithmic time and Õ(vn) bits of communication per processor to bring all processors to agreement on a collection of n good quorums, solving Byzantine agreement as well. The collection is balanced in that no processor is in more than O(logn) quorums. This yields the first solution to Byzantine agreement which is both scalable and load-balanced in the full information model.

The technique which involves going from situation where slightly more than 1/2 fraction of processors are good and and agree on a short string with a constant fraction of random bits to a situation where all good processors agree on n good quorums can be done in a fully asynchronous model as well, providing an approach for extending the Byzantine agreement result to this model.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Processor architectures has taken a turn towards many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multi-core and single-core processors due to the need of writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.

Task dataflow programming models have a high potential to simplify parallel program- ming because they alleviate the programmer from identifying precisely all inter-task de- pendences when writing programs. Instead, the task dataflow runtime system detects and enforces inter-task dependences during execution based on the description of memory each task accesses. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel taking into account dependences specified in the task graph.

Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this paper we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs and lists. We also consider a fourth edge-less scheme that synchronizes tasks using integers. Analysis using micro-benchmarks shows that the graph representation is not always scalable and that the edge-less scheme introduces least overhead in nearly all situations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Application Specific Instruction Set Processor (ASIP) becomes an attractive substitute for ASIC as transistor density, logic complexity and market competition boost. Similar to ASIC, ASIP is based on customized and tailored architectures. In this way, ASIP delivers high performances with low overheads on cost and power whilst taking the advantages of high flexibility and fast time-to-market as a processor-based solution. To demonstrate this effective solution for embedded applications, this paper performs an overall investigation on ASIP's developments, challenges, trends in terms of architectures and design methodologies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Performance evaluation of parallel software and architectural exploration of innovative hardware support face a common challenge with emerging manycore platforms: they are limited by the slow running time and the low accuracy of software simulators. Manycore FPGA prototypes are difficult to build, but they offer great rewards. Software running on such prototypes runs orders of magnitude faster than current simulators. Moreover, researchers gain significant architectural insight during the modeling process. We use the Formic FPGA prototyping board [1], which specifically targets scalable and cost-efficient multi-board prototyping, to build and test a 64-board model of a 512-core, MicroBlaze-based, non-coherent hardware prototype with a full network-on-chip in a 3D-mesh topology. We expand the hardware architecture to include the ARM Versatile Express platforms and build a 520-core heterogeneous prototype of 8 Cortex-A9 cores and 512 MicroBlaze cores. We then develop an MPI library for the prototype and evaluate it extensively using several bare-metal and MPI benchmarks. We find that our processor prototype is highly scalable, models faithfully single-chip multicore architectures, and is a very efficient platform for parallel programming research, being 50,000 times faster than software simulation.