974 resultados para critical path timeline
Resumo:
A novel hardware architecture for elliptic curve cryptography (ECC) over GF(p) is introduced. This can perform the main prime field arithmetic functions needed in these cryptosystems including modular inversion and multiplication. This is based on a new unified modular inversion algorithm that offers considerable improvement over previous ECC techniques that use Fermat's Little Theorem for this operation. The processor described uses a full-word multiplier which requires much fewer clock cycles than previous methods, while still maintaining a competitive critical path delay. The benefits of the approach have been demonstrated by utilizing these techniques to create a field-programmable gate array (FPGA) design. This can perform a 256-bit prime field scalar point multiplication in 3.86 ms, the fastest FPGA time reported to date. The ECC architecture described can also perform four different types of modular inversion, making it suitable for use in many different ECC applications. © 2006 IEEE.
Resumo:
Optimized circuits for implementing high-performance bit-parallel IIR filters are presented. Circuits constructed mainly from simple carry save adders and based on most-significant-bit (MSB) first arithmetic are described. Two methods resulting in systems which are 100% efficient in that they are capable of sampling data every cycle are presented. In the first approach the basic circuit is modified so that the level of pipelining used is compatible with the small, but fixed, latency associated with the computation in question. This is achieved through insertion of pipeline delays (half latches) on every second row of cells. This produces an area-efficient solution in which the throughput rate is determined by a critical path of 76 gate delays. A second approach combines the MSB first arithmetic methods with the scattered look-ahead methods. Important design issues are addressed, including wordlength truncation, overflow detection, and saturation.
Resumo:
This paper presents a thorough investigation of the combined allocator design for Networks-on-Chip (NoC). Particularly, we discuss the interlock of the combined NoC allocator, which is caused by the lock mechanism of priority updating between the local and global arbiters. Architectures and implementations of three interlock-free combined allocators are presented in detail. Their cost, critical path, as well as network level performance are demonstrated based on 65-nm standard cell technology.
Resumo:
Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.
Resumo:
Larsen and Toubro (L&T) Limited is India’s largest construction conglomerate. L&T’s expertise is harnessed to execute high value projects that demand adherence to stringent timelines in a scenario where disparate disciplines of engineering are required to be coordinated on a critical path. However, no company can acquire such a feat without systematic management of its human resource. An investigation on the human resource management practices in orienting L&T’s success can help to identify some of the ethical human resource practices, especially in the context of Indian market. Accordingly, a well-designed employee satisfaction survey was conducted for assessment of the HRM practices being followed in L&T. Unlike other companies, L&T aims to meet the long-term needs of its employees rather than short-term needs. There were however few areas of concerns, such as yearly appraisal system and equality to treat the employees. It is postulated that the inequality to treat the male and female employees is primarily a typical stereotype due to the fact that construction is conventionally believed to be a male dominant activity. A periodic survey intended to provide 360° feedback system can help to avoid such irregularities. This study is thus expected to provide healthy practices of HRM to nurture the young talents of India. This may help them to evaluate their decisions by analyzing the complex relationship between HRM practices and output of an organization.
Resumo:
With the over-provisioned routing resource on FPGA, the topology choice for NoC implementation on FPGA is more flexible than on ASIC. However, it is well understood that the global wire routing impacts the performance of NoC on FPGA because the topology is routed by using fixed routing fabric. An important question that arises is: will the benefit of diameter reduction by using a highly connective topology outweigh the impact of global routing? To answer this question, we investigate FPGA based packet switched NoC implementations with different sizes and topologies, and quantitatively measure the impact of global routing to each of these networks. The result shows that with sufficient routing resources on modern FPGA, the global routing is not on the critical path of the system, and thus is not a dominating factor for the performance of practical multi-hop NoC system. © 2011 IEEE.
Resumo:
In this paper we propose a design methodology for low-power high-performance, process-variation tolerant architecture for arithmetic units. The novelty of our approach lies in the fact that possible delay failures due to process variations and/or voltage scaling are predicted in advance and addressed by employing an elastic clocking technique. The prediction mechanism exploits the dependence of delay of arithmetic units upon input data patterns and identifies specific inputs that activate the critical path. Under iso-yield conditions, the proposed design operates at a lower scaled down Vdd without any performance degradation, while it ensures a superlative yield under a design style employing nominal supply and transistor threshold voltage. Simulation results show power savings of upto 29%, energy per computation savings of upto 25.5% and yield enhancement of upto 11.1% compared to the conventional adders and multipliers implemented in the 70nm BPTM technology. We incorporated the proposed modules in the execution unit of a five stage DLX pipeline to measure performance using SPEC2000 benchmarks [9]. Maximum area and throughput penalty obtained were 10% and 3% respectively.
Resumo:
Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis.
Resumo:
This qualitative study explores Thomas Green's (1999) treatise, Voices: The Educational Formation of Conscience; for the purpose of reconstruing the transformative usefulness of conscience in moral education. Conscience is "reflexive judgment about things that matter" (Green, 1999, p. 21). Paul Lehmann (1963) suggested that we must "do the conscience over or do the conscience in" (p. 327). Thomas Green "does the conscience over", arguing that a philosophy of moral education, and not a moral philosophy, provides the only framework from which governance of moral behaviour can be understood. Narratives from four one-to-one interviews and a focus group are analysed and interpreted in search of: (a) awareness and understanding of conscience, (b) voices of conscience, (c) normation, (d) reflexive emotions, and (e) the idea of the sacred. Participants in this study (ages 16-21) demonstrated an active awareness of their conscience and a willingness to engage in a reflective process of their moral behaviour. They understood their conscience to be a process of self-judgment about what is right and wrong, and that its authority comes from within themselves. Narrative accounts from childhood indicated that conscience is there "from the beginning" with evidence of selfcorrecting behaviour. A maturing conscience is accompanied by an increased cognitive capacity, more complicated life experiences, and individualization. Moral motivation was grounded in " a desire to connect with things that are most important." A model for conscience formation is proposed, which visualizes a critical path of reflexive emotions. It is argued that schools, striving to shape good citizens, can promote conscience formation through a "curriculum of moral skills"; a curriculum that embraces complexity, diversity, social criticism, and selfhood.
Resumo:
Decimal multiplication is an integral part offinancial, commercial, and internet-based computations. The basic building block of a decimal multiplier is a single digit multiplier. It accepts two Binary Coded Decimal (BCD) inputs and gives a product in the range [0, 81] represented by two BCD digits. A novel design for single digit decimal multiplication that reduces the critical path delay and area is proposed in this research. Out of the possible 256 combinations for the 8-bit input, only hundred combinations are valid BCD inputs. In the hundred valid combinations only four combinations require 4 x 4 multiplication, combinations need x multiplication, and the remaining combinations use either x or x 3 multiplication. The proposed design makes use of this property. This design leads to more regular VLSI implementation, and does not require special registers for storing easy multiples. This is a fully parallel multiplier utilizing only combinational logic, and is extended to a Hex/Decimal multiplier that gives either a decimal output or a binary output. The accumulation ofpartial products generated using single digit multipliers is done by an array of multi-operand BCD adders for an (n-digit x n-digit) multiplication.
Resumo:
The recent trends envisage multi-standard architectures as a promising solution for the future wireless transceivers to attain higher system capacities and data rates. The computationally intensive decimation filter plays an important role in channel selection for multi-mode systems. An efficient reconfigurable implementation is a key to achieve low power consumption. To this end, this paper presents a dual-mode Residue Number System (RNS) based decimation filter which can be programmed for WCDMA and 802.16e standards. Decimation is done using multistage, multirate finite impulse response (FIR) filters. These FIR filters implemented in RNS domain offers high speed because of its carry free operation on smaller residues in parallel channels. Also, the FIR filters exhibit programmability to a selected standard by reconfiguring the hardware architecture. The total area is increased only by 24% to include WiMAX compared to a single mode WCDMA transceiver. In each mode, the unused parts of the overall architecture is powered down and bypassed to attain power saving. The performance of the proposed decimation filter in terms of critical path delay and area are tabulated.
Resumo:
Decimal multiplication is an integral part of financial, commercial, and internet-based computations. A novel design for single digit decimal multiplication that reduces the critical path delay and area for an iterative multiplier is proposed in this research. The partial products are generated using single digit multipliers, and are accumulated based on a novel RPS algorithm. This design uses n single digit multipliers for an n × n multiplication. The latency for the multiplication of two n-digit Binary Coded Decimal (BCD) operands is (n + 1) cycles and a new multiplication can begin every n cycle. The accumulation of final partial products and the first iteration of partial product generation for next set of inputs are done simultaneously. This iterative decimal multiplier offers low latency and high throughput, and can be extended for decimal floating-point multiplication.
Resumo:
The recent trends envisage multi-standard architectures as a promising solution for the future wireless transceivers. The computationally intensive decimation filter plays an important role in channel selection for multi-mode systems. An efficient reconfigurable implementation is a key to achieve low power consumption. To this end, this paper presents a dual-mode Residue Number System (RNS) based decimation filter which can be programmed for WCDMA and 802.11a standards. Decimation is done using multistage, multirate finite impulse response (FIR) filters. These FIR filters implemented in RNS domain offers high speed because of its carry free operation on smaller residues in parallel channels. Also, the FIR filters exhibit programmability to a selected standard by reconfiguring the hardware architecture. The total area is increased only by 33% to include WLANa compared to a single mode WCDMA transceiver. In each mode, the unused parts of the overall architecture is powered down and bypassed to attain power saving. The performance of the proposed decimation filter in terms of critical path delay and area are tabulated
Resumo:
Nanoparticles are of immense importance both from the fundamental and application points of view. They exhibit quantum size effects which are manifested in their improved magnetic and electric properties. Mechanical attrition by high energy ball milling (HEBM) is a top down process for producing fine particles. However, fineness is associated with high surface area and hence is prone to oxidation which has a detrimental effect on the useful properties of these materials. Passivation of nanoparticles is known to inhibit surface oxidation. At the same time, coating polymer film on inorganic materials modifies the surface properties drastically. In this work a modified set-up consisting of an RF plasma polymerization technique is employed to coat a thin layer of a polymer film on Fe nanoparticles produced by HEBM. Ball-milled particles having different particle size ranges are coated with polyaniline. Their electrical properties are investigated by measuring the dc conductivity in the temperature range 10–300 K. The low temperature dc conductivity (I–V ) exhibited nonlinearity. This nonlinearity observed is explained on the basis of the critical path model. There is clear-cut evidence for the occurrence of intergranular tunnelling. The results are presented here in this paper
Resumo:
Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies.