79 resultados para network-on-chip
Resumo:
With proliferation of chip multicores (CMPs) on desktops and embedded platforms, multi-threaded programs have become ubiquitous. Existence of multiple threads may cause resource contention, such as, in on-chip shared cache and interconnects, depending upon how they access resources. Hence, we propose a tool - Thread Contention Predictor (TCP) to help quantify the number of threads sharing data and their sharing pattern. We demonstrate its use to predict a more profitable shared, last level on-chip cache (LLC) access policy on CMPs. Our cache configuration predictor is 2.2 times faster compared to the cycle-accurate simulations. We also demonstrate its use for identifying hot data structures in a program which may cause performance degradation due to false data sharing. We fix layout of such data structures and show up-to 10% and 18% improvement in execution time and energy-delay product (EDP), respectively.
Resumo:
We report a circuit technique to measure the on-chip delay of an individual logic gate (both inverting and non-inverting) in its unmodified form using digitally reconfigurable ring oscillator (RO). Solving a system of linear equations with different configuration setting of the RO gives delay of an individual gate. Experimental results from a test chip in 65nm process node show the feasibility of measuring the delay of an individual inverter to within 1pS accuracy. Delay measurements of different nominally identical inverters in close physical proximity show variations of up to 26% indicating the large impact of local or within-die variations.
Resumo:
We report the design and characterization of a circuit technique to measure the on-chip delay of an individual logic gate (both inverting and noninverting) in its unmodified form. The test circuit comprises of digitally reconfigurable ring oscillator (RO). The gate under test is embedded in each stage of the ring oscillator. A system of linear equations is then formed with different configuration settings of the RO, relating the individual gate delay to the measured period of the RO, whose solution gives the delay of the individual gates. Experimental results from a test chip in 65-nm process node show the feasibility of measuring the delay of an individual inverter to within 1 ps accuracy. Delay measurements of different nominally identicall inverters in close physical proximity show variations of up to 28% indicating the large impact of local variations. As a demonstration of this technique, we have studied delay variation with poly-pitch, length of diffusion (LOD) and different orientations of layout in silicon. The proposed technique is quite suitable for early process characterization, monitoring mature process in manufacturing and correlating model-to-hardware.
Resumo:
The increasing variability in device leakage has made the design of keepers for wide OR structures a challenging task. The conventional feedback keepers (CONV) can no longer improve the performance of wide dynamic gates for the future technologies. In this paper, we propose an adaptive keeper technique called rate sensing keeper (RSK) that enables faster switching and tracks the variation across different process corners. It can switch upto 1.9x faster (for 20 legs) than CONV and can scale upto 32 legs as against 20 legs for CONV in a 130-nm 1.2-V process. The delay tracking is within 8% across the different process corners. We demonstrate the circuit operation of RSK using a 32 x 8 register file implemented in an industrial 130-nm 1.2-V CMOS process. The performance of individual dynamic logic gates are also evaluated on chip for various keeper techniques. We show that the RSK technique gives superior performance compared to the other alternatives such as Conditional Keeper (CKP) and current mirror-based keeper (LCR).
Resumo:
his paper studies the problem of designing a logical topology over a wavelength-routed all-optical network (AON) physical topology, The physical topology consists of the nodes and fiber links in the network, On an AON physical topology, we can set up lightpaths between pairs of nodes, where a lightpath represents a direct optical connection without any intermediate electronics, The set of lightpaths along with the nodes constitutes the logical topology, For a given network physical topology and traffic pattern (relative traffic distribution among the source-destination pairs), our objective is to design the logical topology and the routing algorithm on that topology so as to minimize the network congestion while constraining the average delay seen by a source-destination pair and the amount of processing required at the nodes (degree of the logical topology), We will see that ignoring the delay constraints can result in fairly convoluted logical topologies with very long delays, On the other hand, in all our examples, imposing it results in a minimal increase in congestion, While the number of wavelengths required to imbed the resulting logical topology on the physical all optical topology is also a constraint in general, we find that in many cases of interest this number can be quite small, We formulate the combined logical topology design and routing problem described above (ignoring the constraint on the number of available wavelengths) as a mixed integer linear programming problem which we then solve for a number of cases of a six-node network, Since this programming problem is computationally intractable for larger networks, we split it into two subproblems: logical topology design, which is computationally hard and will probably require heuristic algorithms, and routing, which can be solved by a linear program, We then compare the performance of several heuristic topology design algorithms (that do take wavelength assignment constraints into account) against that of randomly generated topologies, as well as lower bounds derived in the paper.
Resumo:
In recent years, parallel computers have been attracting attention for simulating artificial neural networks (ANN). This is due to the inherent parallelism in ANN. This work is aimed at studying ways of parallelizing adaptive resonance theory (ART), a popular neural network algorithm. The core computations of ART are separated and different strategies of parallelizing ART are discussed. We present mapping strategies for ART 2-A neural network onto ring and mesh architectures. The required parallel architecture is simulated using a parallel architectural simulator, PROTEUS and parallel programs are written using a superset of C for the algorithms presented. A simulation-based scalability study of the algorithm-architecture match is carried out. The various overheads are identified in order to suggest ways of improving the performance. Our main objective is to find out the performance of the ART2-A network on different parallel architectures. (C) 1999 Elsevier Science B.V. All rights reserved.
Resumo:
In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.
Resumo:
Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences critical system design objectives like area, power and performance. Hence the embedded system designer performs a complete memory architecture exploration to custom design a memory architecture for a given set of applications. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of exhaustive-search based memory exploration at the outer level and a two step based integrated data layout for SPRAM-Cache based architectures at the inner level. We present a two step integrated approach for data layout for SPRAM-Cache based hybrid architectures with the first step as data-partitioning that partitions data between SPRAM and Cache, and the second step is the cache conscious data layout. We formulate the cache-conscious data layout as a graph partitioning problem and show that our approach gives up to 34% improvement over an existing approach and also optimizes the off-chip memory address space. We experimented our approach with 3 embedded multimedia applications and our approach explores several hundred memory configurations for each application, yielding several optimal design points in a few hours of computation on a standard desktop.
Resumo:
A low-power frequency multiplication technique, developed for ZigBee (IEEE 802.15.4) like applications is presented. We have provided an estimate for the power consumption for a given output voltage swing using our technique. The advantages and disadvantages which determine the application areas of the technique are discussed. The issues related to design, layout and process variation are also addressed. Finally, a design is presented for operation in 2.405-2.485-GHz band of ZigBee receiver. SpectreRF simulations show 30% improvement in efficiency for our circuit with regard to conversion of DC bias current to output amplitude, against a LC-VCO. To establish the low-power credentials, we have compared our circuit with an existing technique; our circuit performs better with just 1/3 of total current from supply, and uses one inductor as against three in the latter case. A test chip was implemented in UMC 0.13-mum RF process with spiral on-chip inductors and MIM (metal-insulator-metal) capacitor option.
Resumo:
Denial-of-service (DoS) attacks form a very important category of security threats that are prevalent in MIPv6 (mobile internet protocol version 6) today. Many schemes have been proposed to alleviate such threats, including one of our own [9]. However, reasoning about the correctness of such protocols is not trivial. In addition, new solutions to mitigate attacks may need to be deployed in the network on a frequent basis as and when attacks are detected, as it is practically impossible to anticipate all attacks and provide solutions in advance. This makes it necessary to validate the solutions in a timely manner before deployment in the real network. However, threshold schemes needed in group protocols make analysis complex. Model checking threshold-based group protocols that employ cryptography have not been successful so far. Here, we propose a new simulation based approach for validation using a tool called FRAMOGR that supports executable specification of group protocols that use cryptography. FRAMOGR allows one to specify attackers and track probability distributions of values or paths. We believe that infrastructure such as FRAMOGR would be required in future for validating new group based threshold protocols that may be needed for making MIPv6 more robust.
Resumo:
An all-digital on-chip clock skew measurement system via subsampling is presented. The clock nodes are sub-sampled with a near-frequency asynchronous sampling clock to result in beat signals which are themselves skewed in the same proportion but on a larger time scale. The beat signals are then suitably masked to extract only the skews of the rising edges of the clock signals. We propose a histogram of the arithmetic difference of the beat signals which decouples the relationship of clock jitter to the minimum measurable skew, and allows skews arbitrarily close to zero to be measured with a precision limited largely by measurement time, unlike the conventional XOR based histogram approach. We also analytically show that the proposed approach leads to an unbiased estimate of skew. The measured results from a 65 nm delay measurement front-end indicate that for an input skew range of +/- 1 fan-out-of-4 (FO4) delay, +/- 3 sigma resolution of 0.84 ps can be obtained with an integral error of 0.65 ps. We also experimentally demonstrate that a frequency modulation on a sampling clock maintains precision, indicating the robustness of the technique to jitter. We also show how FM modulation helps in restoring precision in case of rationally related clocks.
Resumo:
The success of an ABV IP depends highly on the associated debugging environment. An efficient debugging environment helps the user to find out the exact location of the failure. Moreover, it provides information to the user in a refined detail of abstraction and permit adequate interaction. It has also been realized that adequate visualization support helps in tracking the behavioral aspects of the Design Under Test (DUT). Currently, the debugging tools provide information in the signal level and do not provide any information about the high-level behavior of the DUT. We present a debugging framework that takes the design specification, assertions and the user intent in a simple format and provides detailed information by processing the design trace on-line, or off-line. We also present a visualization framework to ease the debugging procedure. We have experimented with industrial standard on-chip bus protocols that ensure that this utility can be incorporated successfully in the present functional verification flow.
Resumo:
A method of precise measurement of on-chip analog voltages in a mostly-digital manner, with minimal overhead, is presented. A pair of clock signals is routed to the node of an analog voltage. This analog voltage controls the delay between this pair of clock signals, which is then measured in an all-digital manner using the technique of sub-sampling. This sub-sampling technique, having measurement time and accuracy trade-off, is well suited for low bandwidth signals. This concept is validated by designing delay cells, using current starved inverters in UMC 130nm CMOS process. Sub-mV accuracy is demonstrated for a measurement time of few seconds.
Resumo:
Continuous advances in VLSI technology have made implementation of very complicated systems possible. Modern System-on -Chips (SoCs) have many processors, IP cores and other functional units. As a result, complete verification of whole systems before implementation is becoming infeasible; hence it is likely that these systems may have some errors after manufacturing. This increases the need to find design errors in chips after fabrication. The main challenge for post-silicon debug is the observability of the internal signals. Post-silicon debug is the problem of determining what's wrong when the fabricated chip of a new design behaves incorrectly. This problem now consumes over half of the overall verification effort on large designs, and the problem is growing worse.Traditional post-silicon debug methods concentrate on functional parts of systems and provide mechanisms to increase the observability of internal state of systems. Those methods may not be sufficient as modern SoCs have lots of blocks (processors, IP cores, etc.) which are communicating with one another and communication is another source of design errors. This tutorial will be provide an insight into various observability enhancement techniques, on chip instrumentation techniques and use of high level models to support the debug process targeting both inside blocks and communication among them. It will also cover the use of formal methods to help debug process.
Resumo:
Engineering devices with a large electrical response to magnetic field is of fundamental importance for a range of applications such as magnetic field sensing and magnetic read heads. We show that a colossal nonsaturating linear magnetoresistance (NLMR) arises in two-dimensional electron systems hosted in a GaAs/AlGaAs heterostructure in the strongly insulating regime. When operated at high source-drain bias, the magnetoresistance of our devices increases almost linearly with magnetic field, reaching nearly 10 000% at 8 T, thus surpassing many known nonmagnetic materials that exhibit giant NLMR. The temperature dependence and mobility analysis indicate that the NLMR has a purely classical origin, driven by nanoscale inhomogeneities. A large NLMR combined with small device dimensions makes these systems an attractive candidate for on-chip magnetic field sensing.