694 resultados para CMOS processs
Resumo:
Details of a new low power fast Fourier transform (FFT) processor for use in digital television applications are presented. This has been fabricated using a 0.6-µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-time video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8 × 8 mm and dissipates 1 W. The chip design is based on a novel VLSI architecture which has been derived from a first principles factorization of the discrete Fourier transform (DFT) matrix and tailored to a direct silicon implementation.
Resumo:
A high-performance VLSI architecture to perform multiply-accumulate, division and square root operations is proposed. The circuit is highly regular, requires only minimal control and can be pipelined right down to the bit level. The system can also be reconfigured on every cycle to perform any one of these operations. The gate count per row has been estimated at (27n+70) gate equivalents where n is the divisor wordlength. The throughput rate, which equals the clock speed, is the same for each operation and is independent of the wordlength. This is achieved through the combination of pipelining and redundant arithmetic. With a 1.0 µm CMOS technology and extensive pipelining, throughput rates in excess of 70 million operations per second are expected.
Resumo:
The design of a System-on-a-Chip (SoC) demonstrator for a baseline JPEG encoder core is presented. This combines a highly optimized Discrete Cosine Transform (DCT) and quantization unit with an entropy coder which has been realized using off-the-shelf synthesizable IP cores (Run-length coder, Huffman coder and data packer). When synthesized in a 0.35 µm CMOS process, the core can operate at speeds up to 100 MHz and contains 50 k gates plus 11.5 kbits of RAM. This is approximately 20% less than similar JPEG encoder designs reported in literature. When targeted at FPGA the core can operate up to 30 MHz and is capable of compressing 9-bit full-frame color input data at NTSC or PAL rates.
Resumo:
A generator for the automated design of Discrete Cosine Transform (DCT) cores is presented. This can be used to rapidly create silicon circuits from a high level specification. These compare very favourably with existing designs. The DCT cores produced are scaleable in terms of point size as well as input/output and coefficient wordlengths. This provides a high degree of flexibility. An example, 8-point 1D DCT design, produced occupies less than 0.92 mm when implemented in a 0.35µ double level metal CMOS technology. This can be clocked at a rate of 100MHz.
Resumo:
A high performance VLSI architecture to perform combined multiply-accumulate, divide, and square root operations is proposed. The circuit is highly regular, requires only minimal control, and can be pipelined right down to the bit level. The system can also be reconfigured on every cycle to perform one or more of these operations. The throughput rate for each operation is the same and is wordlength independent. This is achieved using redundant arithmetic. With current CMOS technology, throughput rates in excess of 80 million operations per second are expected.
Resumo:
This paper describes how worst-case error analysis can be applied to solve some of the practical issues in the development and implementation of a low power, high performance radix-4 FFT chip for digital video applications. The chip has been fabricated using a 0.6 µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-time video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8×8 mm and dissipates 1 W, leading to a cost-effective silicon solution for high quality video processing applications. The analysis focuses on the effect that different radix-4 architectural configurations and finite wordlengths has on the FFT output dynamic range. These issues are addressed using both mathematical error models and through extensive simulation.
Resumo:
A new high performance, programmable image processing chip targeted at video and HDTV applications is described. This was initially developed for image small object recognition but has much broader functional application including 1D and 2D FIR filtering as well as neural network computation. The core of the circuit is made up of an array of twenty one multiplication-accumulation cells based on systolic architecture. Devices can be cascaded to increase the order of the filter both vertically and horizontally. The chip has been fabricated in a 0.6 µ, low power CMOS technology and operates on 10 bit input data at over 54 Megasamples per second. The introduction gives some background to the chip design and highlights that there are few other comparable devices. Section 2 gives a brief introduction to small object detection. The chip architecture and the chip design will be described in detail in the later sections.
Resumo:
Details of a new low power FFT processor for use in digital television applications are presented. This has been fabricated using a 0.6 µm CMOS technology and can perform a 64 point complex forward or inverse FFT on real-rime video at up to 18 Megasamples per second. It comprises 0.5 million transistors in a die area of 7.8×8 mm and dissipates 1 W. Its performance, in terms of computational rate per area per watt, is significantly higher than previously reported devices, leading to a cost-effective silicon solution for high quality video processing applications. This is the result of using a novel VLSI architecture which has been derived from a first principles factorisation of the DFT matrix and tailored to a direct silicon implementation.
Resumo:
In this paper we present a design methodology for algorithm/architecture co-design of a voltage-scalable, process variation aware motion estimator based on significance driven computation. The fundamental premise of our approach lies in the fact that all computations are not equally significant in shaping the output response of video systems. We use a statistical technique to intelligently identify these significant/not-so-significant computations at the algorithmic level and subsequently change the underlying architecture such that the significant computations are computed in an error free manner under voltage over-scaling. Furthermore, our design includes an adaptive quality compensation (AQC) block which "tunes" the algorithm and architecture depending on the magnitude of voltage over-scaling and severity of process variations. Simulation results show average power savings of similar to 33% for the proposed architecture when compared to conventional implementation in the 90 nm CMOS technology. The maximum output quality loss in terms of Peak Signal to Noise Ratio (PSNR) was similar to 1 dB without incurring any throughput penalty.
Resumo:
The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes.
Resumo:
As a post-CMOS technology, the incipient Quantum-dot Cellular Automata technology has various advantages. A key aspect which makes it highly desirable is low power dissipation. One method that is used to analyse power dissipation in QCA circuits is bit erasure analysis. This method has been applied to analyse previously proposed QCA binary adders. However, a number of improved QCA adders have been proposed more recently that have only been evaluated in terms of area and speed. As the three key performance metrics for QCA circuits are speed, area and power, in this paper, a bit erasure analysis of these adders will be presented to determine their power dissipation. The adders to be analysed are the Carry Flow Adder (CFA), Brent-Kung Adder (B-K), Ladner-Fischer Adder (L-F) and a more recently developed area-delay efficient adder. This research will allow for a more comprehensive comparison between the different QCA adder proposals. To the best of the authors' knowledge, this is the first time power dissipation analysis has been carried out on these adders.
Resumo:
Quantum-dot cellular automata (QCA) is potentially a very attractive alternative to CMOS for future digital designs. Circuit designs in QCA have been extensively studied. However, how to properly evaluate the QCA circuits has not been carefully considered. To date, metrics and area-delay cost functions directly mapped from CMOS technology have been used to compare QCA designs, which is inappropriate due to the differences between these two technologies. In this paper, several cost metrics specifically aimed at QCA circuits are studied. It is found that delay, the number of QCA logic gates, and the number and type of crossovers, are important metrics that should be considered when comparing QCA designs. A family of new cost functions for QCA circuits is proposed. As fundamental components in QCA computing arithmetic, QCA adders are reviewed and evaluated with the proposed cost functions. By taking the new cost metrics into account, previous best adders become unattractive and it has been shown that different optimization goals lead to different “best” adders.
Resumo:
The design of a high-performance IIR (infinite impulse response) digital filter is described. The chip architecture operates on 11-b parallel, two's complement input data with a 12-b parallel two's complement coefficient to produce a 14-b two's complement output. The chip is implemented in 1.5-µm, double-layer-metal CMOS technology, consumes 0.5 W, and can operate up to 15 Msample/s. The main component of the system is a fine-grained systolic array that internally is based on a signed binary number representation (SBNR). Issues addressed include testing, clock distribution, and circuitry for conversion between two's complement and SBNR.
Resumo:
Current variation aware design methodologies, tuned for worst-case scenarios, are becoming increasingly pessimistic from the perspective of power and performance. A good example of such pessimism is setting the refresh rate of DRAMs according to the worst-case access statistics, thereby resulting in very frequent refresh cycles, which are responsible for the majority of the standby power consumption of these memories. However, such a high refresh rate may not be required, either due to extremely low probability of the actual occurrence of such a worst-case, or due to the inherent error resilient nature of many applications that can tolerate a certain number of potential failures. In this paper, we exploit and quantify the possibilities that exist in dynamic memory design by shifting to the so-called approximate computing paradigm in order to save power and enhance yield at no cost. The statistical characteristics of the retention time in dynamic memories were revealed by studying a fabricated 2kb CMOS compatible embedded DRAM (eDRAM) memory array based on gain-cells. Measurements show that up to 73% of the retention power can be saved by altering the refresh time and setting it such that a small number of failures is allowed. We show that these savings can be further increased by utilizing known circuit techniques, such as body biasing, which can help, not only in extending, but also in preferably shaping the retention time distribution. Our approach is one of the first attempts to access the data integrity and energy tradeoffs achieved in eDRAMs for utilizing them in error resilient applications and can prove helpful in the anticipated shift to approximate computing.
Resumo:
Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis.