50 resultados para pipelining
Resumo:
A novel bit-level systolic array architecture for implementing first-order IIR filter sections is presented. A latency of only two clock cycles is achieved by using a radix-4 redundant number representation, performing the recursive computation most-significant-digit first, and feeding back each digit of the result as soon as it is available.
Resumo:
The authors propose a bit serial pipeline used to perform the genetic operators in a hardware genetic algorithm. The bit-serial nature of the dataflow allows the operators to be pipelined, resulting in an architecture which is area efficient, easily scaled and is independent of the lengths of the chromosomes. An FPGA implementation of the device achieves a throughput of >25 million genes per second
Resumo:
This paper presents a semi-synchronous pipeline scheme, here referred as single-pulse pipeline, to the problem of mapping pipelined circuits to a Field Programmable Gate Array (FPGA). Area and timing considerations are given for a general case and later applied to a systolic circuit as illustration. The single-pulse pipeline can manage asynchronous worst-case data completion and it is evaluated against two chosen asynchronous pipelining: a four-phase bundle-data pipeline and a doubly-latched asynchronous pipeline. The semi-synchronous pipeline proposal takes less FPGA area and operates faster than the two selected fully-asynchronous schemes for an FPGA case.
Resumo:
Alongside the advances of technologies, embedded systems are increasingly present in our everyday. Due to increasing demand for functionalities, many tasks are split among processors, requiring more efficient communication architectures, such as networks on chip (NoC). The NoCs are structures that have routers with channel point-to-point interconnect the cores of system on chip (SoC), providing communication. There are several networks on chip in the literature, each with its specific characteristics. Among these, for this work was chosen the Integrated Processing System NoC (IPNoSyS) as a network on chip with different characteristics compared to general NoCs, because their routing components also accumulate processing function, ie, units have functional able to execute instructions. With this new model, packets are processed and routed by the router architecture. This work aims at improving the performance of applications that have repetition, since these applications spend more time in their execution, which occurs through repeated execution of his instructions. Thus, this work proposes to optimize the runtime of these structures by employing a technique of instruction-level parallelism, in order to optimize the resources offered by the architecture. The applications are tested on a dedicated simulator and the results compared with the original version of the architecture, which in turn, implements only packet level parallelism
Resumo:
This paper presents a power, latency and throughput trade-off study on NoCs by varying microarchitectural (e.g. pipelining) and circuit level (e.g. frequency and voltage) parameters. We change pipelining depth, operating frequency and supply voltage for 3 example NoCs - 16 node 2D Torus, Tree network and Reduced 2D Torus. We use an in-house NoC exploration framework capable of topology generation and comparison using parameterized models of Routers and links developed in SystemC. The framework utilizes interconnect power and delay models from a low-level modelling tool called Intacte[1]1. We find that increased pipelining can actually reduce latency. We also find that there exists an optimal degree of pipelining which is the most energy efficient in terms of minimizing energy-delay product.
Resumo:
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
Resumo:
The StreamIt programming model has been proposed to exploit parallelism in streaming applications oil general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of if StreamIt program oil a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task oil the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as all integrated Integer Linear Program (ILP) which can then be solved by an ILP solver. We also propose an efficient heuristic algorithm for the work-partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark Suite. The partitioned tasks are then software pipelined to execute oil the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.94X with it maximum of 51.96X over it single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over it partitioning strategy that maps only the filters that cannot be executed oil the GPU - the filters with state that is persistent across firings - onto the CPU.
Resumo:
The paper describes the application of the pipelining principle to the realization of an analogue-to-ternary converter. The circuit shows a considerable saving in hard-ware compared with an earlier proposed circuit. The main hardware components used are analogue comparators, subtractors and the delay elements; hence this method of A/T conversion can operate at a higher sampling frequency.
Resumo:
Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a ``future-proof'' custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16x in power and 8x in area when compared to an ASIC. REDEFINE implementation consumes 0.1x the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000x more than that of REDEFINE.
Resumo:
This paper presents design of a Low power 256x72 bit TCAM in 0.13um CMOS technology. In contrast to conventional Match line (ML) sensing scheme in which equal power is consumed irrespective of match or mismatch, the ML scheme employed in this design allocates less power to match decisions involving a large number of mismatched bits. Typically, the probability of mismatch is high so this scheme results in significant CAM power reduction. We propose to use this technique along with pipelining of search operation in which the MLs are broken into several segments. Since most words fail to match in first segment, the search operation for subsequent segments is discontinued, resulting in further reduction in power consumption. The above architecture provides 70% power reduction while performing search in 3ns.
Resumo:
The performance of a program will ultimately be limited by its serial (scalar) portion, as pointed out by Amdahl′s Law. Reported studies thus far of instruction-level parallelism have mixed data-parallel program portions with scalar program portions, often leading to contradictory and controversial results. We report an instruction-level behavioral characterization of scalar code containing minimal data-parallelism, extracted from highly vectorized programs of the PERFECT benchmark suite running on a Cray Y-MP system. We classify scalar basic blocks according to their instruction mix, characterize the data dependencies seen in each class, and, as a first step, measure the maximum intrablock instruction-level parallelism available. We observe skewed rather than balanced instruction distributions in scalar code and in individual basic block classes of scalar code; nonuniform distribution of parallelism across instruction classes; and, as expected, limited available intrablock parallelism. We identify frequently occurring data-dependence patterns and discuss new instructions to reduce latency. Toward effective scalar hardware, we study latency-pipelining trade-offs and restricted multiple instruction issue mechanisms.
Resumo:
Wave pipelining is a design technique for increasing the throughput of a digital circuit or system without introducing pipelining registers between adjacent combinational logic blocks in the circuit/system. However, this requires balancing of the delays along all the paths from the input to the output which comes the way of its implementation. Static CMOS is inherently susceptible to delay variation with input data, and hence, receives a low priority for wave pipelined digital design. On the other hand, ECL and CML, which are amenable to wave pipelining, lack the compactness and low power attributes of CMOS. In this paper we attempt to exploit wave pipelining in CMOS technology. We use a single generic building block in Normal Process Complementary Pass Transistor Logic (NPCPL), modeled after CPL, to achieve equal delay along all the propagation paths in the logic structure. An 8×8 b multiplier is designed using this logic in a 0.8 ?m technology. The carry-save multiplier architecture is modified suitably to support wave pipelining, viz., the logic depth of all the paths are made identical. The 1 mm×0.6 mm multiplier core supports a throughput of 400 MHz and dissipates a total power of 0.6 W. We develop simple enhancements to the NPCPL building blocks that allow the multiplier to sustain throughputs in excess of 600 MHz. The methodology can be extended to introduce wave pipelining in other circuits as well