129 resultados para intel processor
Resumo:
Apparatus for scanning a moving object includes a visible waveband sensor oriented to collect a series of images of the object as it passes through a field of view. An image processor uses the series of images to form a composite image. The image processor stores image pixel data for a current image and predecessor image in the series. It uses information in the current image and its predecessor to analyse images and derive likelihood measures indicating probabilities that current image pixels correspond to parts of the object. The image processor estimates motion between the current image and its predecessor from likelihood weighted pixels. It generates the composite image from frames positioned according to respective estimates of object image motion. Image motion may alternatively be detected be a speed sensor such as Doppler radar sensing object motion directly and providing image timing signals
Resumo:
Simultaneous multithreading processors dynamically share processor resources between multiple threads. In general, shared SMT resources may be managed explicitly, for instance, by dynamically setting queue occupation bounds for each thread as in the DCRA and Hill-Climbing policies. Alternatively, resources may be managed implicitly; that is, resource usage is controlled by placing the desired instruction mix in the resources. In this case, the main resource management tool is the instruction fetch policy which must predict the behavior of each thread (branch mispredictions, long-latency loads, etc.) as it fetches instructions.
Resumo:
Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exempli?ed by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful. This paper makes the following contributions: (i) A compiler framework for extracting pipeline-like parallelism from outer program loops is presented. (ii) Using a light-weight programming model based on annotations, the programmer helps the compiler to ?nd thread-level parallelism. Each of the annotations speci?es only a small piece of semantic information that compiler analysis misses, e.g. stating that a variable is dead at a certain program point. The annotations are designed such that correctness is easily veri?ed. Furthermore, we present a tool for suggesting annotations to the programmer. (iii) The methodology is applied to autoparallelize several SPECint benchmarks. For the benchmark with most parallelism (hmmer), we obtain a scalable 7-fold speedup on an AMD quad-core dual processor. The annotations constitute a parallel programming model that relies extensively on a sequential program representation. Hereby, the complexity of debugging is not increased and it does not obscure the source code. These properties could prove valuable to increase the ef?ciency of parallel programming.
Resumo:
Branch prediction feeds a speculative execution processor core with instructions. Branch mispredictions are inevitable and have negative effects on performance and energy consumption. With the advent of highly accurate conditional branch predictors, nonconditional branch instructions are gaining importance.
Resumo:
Caches hide the growing latency of accesses to the main memory from the processor by storing the most recently used data on-chip. To limit the search time through the caches, they are organized in a direct mapped or set-associative way. Such an organization introduces many conflict misses that hamper performance. This paper studies randomizing set index functions, a technique to place the data in the cache in such a way that conflict misses are avoided. The performance of such a randomized cache strongly depends on the randomization function. This paper discusses a methodology to generate randomization functions that perform well over a broad range of benchmarks. The methodology uses profiling information to predict the conflict miss rate of randomization functions. Then, using this information, a search algorithm finds the best randomization function. Due to implementation issues, it is preferable to use a randomization function that is extremely simple and can be evaluated in little time. For these reasons, we use randomization functions where each randomized address bit is computed as the XOR of a subset of the original address bits. These functions are chosen such that they operate on as few address bits as possible and have few inputs to each XOR. This paper shows that to index a 2(m)-set cache, it suffices to randomize m+2 or m+3 address bits and to limit the number of inputs to each XOR to 2 bits to obtain the full potential of randomization. Furthermore, it is shown that the randomization function that we generate for one set of benchmarks also works well for an entirely different set of benchmarks. Using the described methodology, it is possible to reduce the implementation cost of randomization functions with only an insignificant loss in conflict reduction.
Resumo:
Embedded processors are used in numerous devices executing dedicated applications. This setting makes it worthwhile to optimize the processor to the application it executes, in order to increase its power-efficiency. This paper proposes to enhance direct mapped data caches with automatically tuned randomized set index functions to achieve that goal. We show how randomization functions can be automatically generated and compare them to traditional set-associative caches in terms of performance and energy consumption. A 16 kB randomized direct mapped cache consumes 22% less energy than a 2-way set-associative cache, while it is less than 3% slower. When the randomization function is made configurable (i.e., it can be adapted to the program), the additional reduction of conflicts outweighs the added complexity of the hardware, provided there is a sufficient amount of conflict misses.
Resumo:
In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The front-end fetches instructions as fast as it can until it is stalled by a filled issue queue or some other blocking structure. This approach wastes energy: (i) speculative execution causes many wrong-path instructions to be fetched and executed, and (ii) back-end execution rate is usually less than its peak rate, but front-end structures are dimensioned to sustained peak performance. Dynamically reducing the front-end instruction rate and the active size of front-end structure (e.g. issue queue) is a required performance-energy trade-off. Techniques proposed in the literature attack only one of these effects.
In previous work, we have proposed Speculative Instruction Window Weighting (SIWW) [21], a fetch gating technique that allows to address both fetch gating and instruction issue queue dynamic sizing. SIWW computes a global weight on the set of inflight instructions. This weight depends on the number and types of inflight instructions (non-branches, high confidence or low confidence branches, ...). The front-end instruction rate can be continuously adapted based on this weight. This paper extends the analysis of SIWW performed in previous work. It shows that SIWW performs better than previously proposed fetch gating techniques and that SIWW allows to dynamically adapt the size of the active instruction queue.
Resumo:
The emergence of programmable logic devices as processing platforms for digital signal processing applications poses challenges concerning rapid implementation and high level optimization of algorithms on these platforms. This paper describes Abhainn, a rapid implementation methodology and toolsuite for translating an algorithmic expression of the system to a working implementation on a heterogeneous multiprocessor/field programmable gate array platform, or a standalone system on programmable chip solution. Two particular focuses for Abhainn are the automated but configurable realisation of inter-processor communuication fabrics, and the establishment of novel dedicated hardware component design methodologies allowing algorithm level transformation for system optimization. This paper outlines the approaches employed in both these particular instances.
Resumo:
To enable reliable data transfer in next generation Multiple-Input Multiple-Output (MIMO) communication systems, terminals must be able to react to fluctuating channel conditions by having flexible modulation schemes and antenna configurations. This creates a challenging real-time implementation problem: to provide the high performance required of cutting edge MIMO standards, such as 802.11n, with the flexibility for this behavioural variability. FPGA softcore processors offer a solution to this problem, and in this paper we show how heterogeneous SISD/SIMD/MIMD architectures can enable programmable multicore architectures on FPGA with similar performance and cost as traditional dedicated circuit-based architectures. When applied to a 4×4 16-QAM Fixed-Complexity Sphere Decoder (FSD) detector we present the first soft-processor based solution for real-time 802.11n MIMO.
Resumo:
The prevalence of multicore processors is bound to drive most kinds of software development towards parallel programming. To limit the difficulty and overhead of parallel software design and maintenance, it is crucial that parallel programming models allow an easy-to-understand, concise and dense representation of parallelism. Parallel programming models such as Cilk++ and Intel TBBs attempt to offer a better, higher-level abstraction for parallel programming than threads and locking synchronization. It is not straightforward, however, to express all patterns of parallelism in these models. Pipelines are an important parallel construct, although difficult to express in Cilk and TBBs in a straightfor- ward way, not without a verbose restructuring of the code. In this paper we demonstrate that pipeline parallelism can be easily and concisely expressed in a Cilk-like language, which we extend with input, output and input/output dependency types on procedure arguments, enforced at runtime by the scheduler. We evaluate our implementation on real applications and show that our Cilk-like scheduler, extended to track and enforce these dependencies has performance comparable to Cilk++.
Resumo:
Background: This article describes a 'back to the future' approach to case 'write-ups', with medical students producing handwritten instead of word-processed case reports during their clinical placements. Word-processed reports had been found to have a number of drawbacks, including the inappropriate use of 'cutting and pasting', undue length and lack of focus. Method: We developed a template to be completed by hand, based on the hospital 'clerking-in process', and matched this to a new assessment proforma. An electronic survey was conducted of both students and assessors after the first year of operation to evaluate impact and utility. Results: The new template was well received by both students and assessors. Most students said they preferred handwriting the case reports (55.6%), although a significant proportion (44.4%) preferred the word processor. Many commented that the template enabled them to effectively learn the structure of a case history and to improve their history-taking skills. Most assessors who had previously marked case reports felt the new system represented an improvement. The average time spent marking each report fell from 23.56 to 16.38minutes using the new proforma. Discussion: Free text comments from the survey have led to the development of a more flexible case report template better suited to certain specialties (e.g. dermatology). This is an evolving process and there will be opportunities for further adaptation as electronic medical records become more common in hospital. © Blackwell Publishing Ltd 2012.
Resumo:
Two major UK systolic array projects are described. The first concerns the development of a wavefront array processor for adaptive beamforming; the second concerns the design of bit-level systolic arrays for high-performance signal processing.
Resumo:
Methods are presented for developing synthesizable FFT cores. These are based on a modular approach in which parameterized commutator and processor blocks are cascaded to implement the computations required in many important FFT signal flow graphs. In addition, it is shown how the use of a digital serial data organization can be used to produce systems that offer 100% processor utilization along with reductions in storage requirements. The approach has been used to create generators for the automated synthesis of FFT cores that are portable across a broad range of silicon technologies. Resulting chip designs are competitive with ones created using manual methods but with significant reductions in design times.
Resumo:
A fiber-optic multichannel correlator/convolver based on a two-dimensional systolic array architecture is described. Experimental verification of processor performance is presented.
Resumo:
A novel design for multibit convolver circuits is described. The circuits take the form of systolic arrays of simple one-bit processor and memory cells, with the result that they can operate at very high data rates and should be easy to implement using VLSI technology. An efficient method for handling two's complement data within the array is described and the relative advantages of this convolver design compared with more conventional circuits is discussed.