99 resultados para PROCESSOR
Resumo:
Caches hide the growing latency of accesses to the main memory from the processor by storing the most recently used data on-chip. To limit the search time through the caches, they are organized in a direct mapped or set-associative way. Such an organization introduces many conflict misses that hamper performance. This paper studies randomizing set index functions, a technique to place the data in the cache in such a way that conflict misses are avoided. The performance of such a randomized cache strongly depends on the randomization function. This paper discusses a methodology to generate randomization functions that perform well over a broad range of benchmarks. The methodology uses profiling information to predict the conflict miss rate of randomization functions. Then, using this information, a search algorithm finds the best randomization function. Due to implementation issues, it is preferable to use a randomization function that is extremely simple and can be evaluated in little time. For these reasons, we use randomization functions where each randomized address bit is computed as the XOR of a subset of the original address bits. These functions are chosen such that they operate on as few address bits as possible and have few inputs to each XOR. This paper shows that to index a 2(m)-set cache, it suffices to randomize m+2 or m+3 address bits and to limit the number of inputs to each XOR to 2 bits to obtain the full potential of randomization. Furthermore, it is shown that the randomization function that we generate for one set of benchmarks also works well for an entirely different set of benchmarks. Using the described methodology, it is possible to reduce the implementation cost of randomization functions with only an insignificant loss in conflict reduction.
Resumo:
Embedded processors are used in numerous devices executing dedicated applications. This setting makes it worthwhile to optimize the processor to the application it executes, in order to increase its power-efficiency. This paper proposes to enhance direct mapped data caches with automatically tuned randomized set index functions to achieve that goal. We show how randomization functions can be automatically generated and compare them to traditional set-associative caches in terms of performance and energy consumption. A 16 kB randomized direct mapped cache consumes 22% less energy than a 2-way set-associative cache, while it is less than 3% slower. When the randomization function is made configurable (i.e., it can be adapted to the program), the additional reduction of conflicts outweighs the added complexity of the hardware, provided there is a sufficient amount of conflict misses.
Resumo:
In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The front-end fetches instructions as fast as it can until it is stalled by a filled issue queue or some other blocking structure. This approach wastes energy: (i) speculative execution causes many wrong-path instructions to be fetched and executed, and (ii) back-end execution rate is usually less than its peak rate, but front-end structures are dimensioned to sustained peak performance. Dynamically reducing the front-end instruction rate and the active size of front-end structure (e.g. issue queue) is a required performance-energy trade-off. Techniques proposed in the literature attack only one of these effects.
In previous work, we have proposed Speculative Instruction Window Weighting (SIWW) [21], a fetch gating technique that allows to address both fetch gating and instruction issue queue dynamic sizing. SIWW computes a global weight on the set of inflight instructions. This weight depends on the number and types of inflight instructions (non-branches, high confidence or low confidence branches, ...). The front-end instruction rate can be continuously adapted based on this weight. This paper extends the analysis of SIWW performed in previous work. It shows that SIWW performs better than previously proposed fetch gating techniques and that SIWW allows to dynamically adapt the size of the active instruction queue.
Resumo:
The emergence of programmable logic devices as processing platforms for digital signal processing applications poses challenges concerning rapid implementation and high level optimization of algorithms on these platforms. This paper describes Abhainn, a rapid implementation methodology and toolsuite for translating an algorithmic expression of the system to a working implementation on a heterogeneous multiprocessor/field programmable gate array platform, or a standalone system on programmable chip solution. Two particular focuses for Abhainn are the automated but configurable realisation of inter-processor communuication fabrics, and the establishment of novel dedicated hardware component design methodologies allowing algorithm level transformation for system optimization. This paper outlines the approaches employed in both these particular instances.
Resumo:
To enable reliable data transfer in next generation Multiple-Input Multiple-Output (MIMO) communication systems, terminals must be able to react to fluctuating channel conditions by having flexible modulation schemes and antenna configurations. This creates a challenging real-time implementation problem: to provide the high performance required of cutting edge MIMO standards, such as 802.11n, with the flexibility for this behavioural variability. FPGA softcore processors offer a solution to this problem, and in this paper we show how heterogeneous SISD/SIMD/MIMD architectures can enable programmable multicore architectures on FPGA with similar performance and cost as traditional dedicated circuit-based architectures. When applied to a 4×4 16-QAM Fixed-Complexity Sphere Decoder (FSD) detector we present the first soft-processor based solution for real-time 802.11n MIMO.
Resumo:
Background: This article describes a 'back to the future' approach to case 'write-ups', with medical students producing handwritten instead of word-processed case reports during their clinical placements. Word-processed reports had been found to have a number of drawbacks, including the inappropriate use of 'cutting and pasting', undue length and lack of focus. Method: We developed a template to be completed by hand, based on the hospital 'clerking-in process', and matched this to a new assessment proforma. An electronic survey was conducted of both students and assessors after the first year of operation to evaluate impact and utility. Results: The new template was well received by both students and assessors. Most students said they preferred handwriting the case reports (55.6%), although a significant proportion (44.4%) preferred the word processor. Many commented that the template enabled them to effectively learn the structure of a case history and to improve their history-taking skills. Most assessors who had previously marked case reports felt the new system represented an improvement. The average time spent marking each report fell from 23.56 to 16.38minutes using the new proforma. Discussion: Free text comments from the survey have led to the development of a more flexible case report template better suited to certain specialties (e.g. dermatology). This is an evolving process and there will be opportunities for further adaptation as electronic medical records become more common in hospital. © Blackwell Publishing Ltd 2012.
Resumo:
Two major UK systolic array projects are described. The first concerns the development of a wavefront array processor for adaptive beamforming; the second concerns the design of bit-level systolic arrays for high-performance signal processing.
Resumo:
Methods are presented for developing synthesizable FFT cores. These are based on a modular approach in which parameterized commutator and processor blocks are cascaded to implement the computations required in many important FFT signal flow graphs. In addition, it is shown how the use of a digital serial data organization can be used to produce systems that offer 100% processor utilization along with reductions in storage requirements. The approach has been used to create generators for the automated synthesis of FFT cores that are portable across a broad range of silicon technologies. Resulting chip designs are competitive with ones created using manual methods but with significant reductions in design times.
Resumo:
A fiber-optic multichannel correlator/convolver based on a two-dimensional systolic array architecture is described. Experimental verification of processor performance is presented.
Resumo:
A novel design for multibit convolver circuits is described. The circuits take the form of systolic arrays of simple one-bit processor and memory cells, with the result that they can operate at very high data rates and should be easy to implement using VLSI technology. An efficient method for handling two's complement data within the array is described and the relative advantages of this convolver design compared with more conventional circuits is discussed.
Resumo:
This paper describes the design and the architecture of a bit-level systolic array processor. The bit-level systolic array described is directly applicable to a wide range of image processing operations where high performance and throughput are essential. The architecture is illustrated by describing the operation of the correlator and convolver chips which are being developed. The advantage of the system is also discussed.
Resumo:
Details of a new low power fast Fourier transform (FFT) processor for use in digital television applications are presented. This has been fabricated using a 0.6-µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-time video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8 × 8 mm and dissipates 1 W. The chip design is based on a novel VLSI architecture which has been derived from a first principles factorization of the discrete Fourier transform (DFT) matrix and tailored to a direct silicon implementation.
Resumo:
The inclusion of the Discrete Wavelet Transform in the JPEG-2000 standard has added impetus to the research of hardware architectures for the two-dimensional wavelet transform. In this paper, a VLSI architecture for performing the symmetrically extended two-dimensional transform is presented. This architecture conforms to the JPEG-2000 standard and is capable of near-optimal performance when dealing with the image boundaries. The architecture also achieves efficient processor utilization. Implementation results based on a Xilinx Virtex-2 FPGA device are included.
Resumo:
The design of a generic QR core for adaptive beamforming is presented. The work relies on an existing mapping technique that can be applied to a triangular QR array in such a way to allow the generation of a range of QR architectures. All scheduling of data inputs and retiming to include processor latency has been included within the generic representation.
Resumo:
Methods are presented for developing synthesizable FFT cores. These are based on a modular approach in which parameterizable blocks are cascaded to implement the computations required across a range of typical FFT signal flow graphs. The underlying architectural approach combines the use of a digital serial data organization with generic commutator blocks to produce systems that offer 100% processor utilization with storage requirements less than previous designs. The approach has been used to create generators for the automated synthesis of FFT cores that are portable across a broad range of silicon technologies. Resulting chip designs are competitive with manual methods but with significant reductions in design times.