686 resultados para Processors
Resumo:
Processor architectures has taken a turn towards many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multi-core and single-core processors due to the need of writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.
Task dataflow programming models have a high potential to simplify parallel program- ming because they alleviate the programmer from identifying precisely all inter-task de- pendences when writing programs. Instead, the task dataflow runtime system detects and enforces inter-task dependences during execution based on the description of memory each task accesses. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel taking into account dependences specified in the task graph.
Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this paper we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs and lists. We also consider a fourth edge-less scheme that synchronizes tasks using integers. Analysis using micro-benchmarks shows that the graph representation is not always scalable and that the edge-less scheme introduces least overhead in nearly all situations.
Resumo:
This paper investigates sub-integer implementations of the adaptive Gaussian mixture model (GMM) for background/foreground segmentation to allow the deployment of the method on low cost/low power processors that lack Floating Point Unit (FPU). We propose two novel integer computer arithmetic techniques to update Gaussian parameters. Specifically, the mean value and the variance of each Gaussian are updated by a redefined and generalised "round'' operation that emulates the original updating rules for a large set of learning rates. Weights are represented by counters that are updated following stochastic rules to allow a wider range of learning rates and the weight trend is approximated by a line or a staircase. We demonstrate that the memory footprint and computational cost of GMM are significantly reduced, without significantly affecting the performance of background/foreground segmentation.
Resumo:
Dynamic Voltage and Frequency Scaling (DVFS) exhibits fundamental limitations as a method to reduce energy consumption in computing systems. In the HPC domain, where performance is of highest priority and codes are heavily optimized to minimize idle time, DVFS has limited opportunity to achieve substantial energy savings. This paper explores if operating processors Near the transistor Threshold Volt- age (NTV) is a better alternative to DVFS for break- ing the power wall in HPC. NTV presents challenges, since it compromises both performance and reliability to reduce power consumption. We present a first of its kind study of a significance-driven execution paradigm that selectively uses NTV and algorithmic error tolerance to reduce energy consumption in performance- constrained HPC environments. Using an iterative algorithm as a use case, we present an adaptive execution scheme that switches between near-threshold execution on many cores and above-threshold execution on one core, as the computational significance of iterations in the algorithm evolves over time. Using this scheme on state-of-the-art hardware, we demonstrate energy savings ranging between 35% to 67%, while compromising neither correctness nor performance.
Resumo:
Polymer extrusion, in which a polymer is melted and conveyed to a mould or die, forms the basis of most polymer processing techniques. Extruders frequently run at non-optimised conditions and can account for 15–20% of overall process energy losses. In times of increasing energy efficiency such losses are a major concern for the industry. Product quality, which depends on the homogeneity and stability of the melt flow which in turn depends on melt temperature and screw speed, is also an issue of concern of processors. Gear pumps can be used to improve the stability of the production line, but the cost is usually high. Likewise it is possible to introduce energy meters but they also add to the capital cost of the machine. Advanced control incorporating soft sensing capabilities offers opportunities to this industry to improve both quality and energy efficiency. Due to strong correlations between the critical variables, such as the melt temperature and melt pressure, traditional decentralized PID (Proportional–Integral–Derivative) control is incapable of handling such processes if stricter product specifications are imposed or the material is changed from one batch to another. In this paper, new real-time energy monitoring methods have been introduced without the need to install power meters or develop data-driven models. The effects of process settings on energy efficiency and melt quality are then studied based on developed monitoring methods. Process variables include barrel heating temperature, water cooling temperature, and screw speed. Finally, a fuzzy logic controller is developed for a single screw extruder to achieve high melt quality. The resultant performance of the developed controller has shown it to be a satisfactory alternative to the expensive gear pump. Energy efficiency of the extruder can further be achieved by optimising the temperature settings. Experimental results from open-loop control and fuzzy control on a Killion 25 mm single screw extruder are presented to confirm the efficacy of the proposed approach.
Resumo:
Abstract—Power capping is an essential function for efficient power budgeting and cost management on modern server systems. Contemporary server processors operate under power caps by using dynamic voltage and frequency scaling (DVFS). However, these processors are often deployed in non-uniform memory
access (NUMA) architectures, where thread allocation between cores may significantly affect performance and power consumption. This paper proposes a method which maximizes performance under power caps on NUMA systems by dynamically optimizing two knobs: DVFS and thread allocation. The method selects the optimal combination of the two knobs with models based on artificial neural network (ANN) that captures the nonlinear effect of thread allocation on performance. We implement
the proposed method as a runtime system and evaluate it with twelve multithreaded benchmarks on a real AMD Opteron based NUMA system. The evaluation results show that our method outperforms a naive technique optimizing only DVFS by up to
67.1%, under a power cap.
Resumo:
The paper presents IPPro which is a high performance, scalable soft-core processor targeted for image processing applications. It has been based on the Xilinx DSP48E1 architecture using the ZYNQ Field Programmable Gate Array and is a scalar 16-bit RISC processor that operates at 526MHz, giving 526MIPS of performance. Each IPPro core uses 1 DSP48, 1 Block RAM and 330 Kintex-7 slice-registers, thus making the processor as compact as possible whilst maintaining flexibility and programmability. A key aspect of the approach is in reducing the application design time and implementation effort by using multiple IPPro processors in a SIMD mode. For different applications, this allows us to exploit different levels of parallelism and mapping for the specified processing architecture with the supported instruction set. In this context, a Traffic Sign Recognition (TSR) algorithm has been prototyped on a Zedboard with the colour and morphology operations accelerated using multiple IPPros. Simulation and experimental results demonstrate that the processing platform is able to achieve a speedup of 15 to 33 times for colour filtering and morphology operations respectively, with a reduced design effort and time.
Resumo:
In this paper, we present a unified approach to an energy-efficient variation-tolerant design of Discrete Wavelet Transform (DWT) in the context of image processing applications. It is to be noted that it is not necessary to produce exactly correct numerical outputs in most image processing applications. We exploit this important feature and propose a design methodology for DWT which shows energy quality tradeoffs at each level of design hierarchy starting from the algorithm level down to the architecture and circuit levels by taking advantage of the limited perceptual ability of the Human Visual System. A unique feature of this design methodology is that it guarantees robustness under process variability and facilitates aggressive voltage over-scaling. Simulation results show significant energy savings (74% - 83%) with minor degradations in output image quality and avert catastrophic failures under process variations compared to a conventional design. © 2010 IEEE.
Resumo:
Low-power processors and accelerators that were originally designed for the embedded systems market are emerging as building blocks for servers. Power capping has been actively explored as a technique to reduce the energy footprint of high-performance processors. The opportunities and limitations of power capping on the new low-power processor and accelerator ecosystem are less understood. This paper presents an efficient power capping and management infrastructure for heterogeneous SoCs based on hybrid ARM/FPGA designs. The infrastructure coordinates dynamic voltage and frequency scaling with task allocation on a customised Linux system for the Xilinx Zynq SoC. We present a compiler-assisted power model to guide voltage and frequency scaling, in conjunction with workload allocation between the ARM cores and the FPGA, under given power caps. The model achieves less than 5% estimation bias to mean power consumption. In an FFT case study, the proposed power capping schemes achieve on average 97.5% of the performance of the optimal execution and match the optimal execution in 87.5% of the cases, while always meeting power constraints.
Resumo:
The authors present a VLSI circuit for implementing wave digital filter (WDF) two-port adaptors. Considerable speedups over conventional designs have been obtained using fine grained pipelining. This has been achieved through the use of most significant bit (MSB) first carry-save arithmetic, which allows systems to be designed in which latency L is small and independent of either coefficient or input data wordlength. L is determined by the online delay associated with the computation required at each node in the circuit (in this case a multiply/add plus two separate additions). This in turn means that pipelining can be used to considerably enhance the sampling rate of a recursive digital filter. The level of pipelining which will offer enhancement is determined by L and is fine-grained rather than bit level. In the case of the circuit considered, L = 3. For this reason pipeline delays (half latches) have been introduced between every two rows of cells to produce a system with a once every cycle sample rate.
Resumo:
Molecular logic-based computation is a broad umbrella covering molecular sensors at its simplest level and logic gate arrays involving steadily increasing levels of parallel and serial integration. The fluorescent PET(photoinduced electron transfer) switching principle remains a loyal servant of this entire field. Applications arise from the convenient operation of molecular information processors in very small spaces.
Resumo:
The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes.