31 resultados para File organization (Computer science)
Location of concentrators in a computer communication network: a stochastic automation search method
Resumo:
The following problem is considered. Given the locations of the Central Processing Unit (ar;the terminals which have to communicate with it, to determine the number and locations of the concentrators and to assign the terminals to the concentrators in such a way that the total cost is minimized. There is alao a fixed cost associated with each concentrator. There is ail upper limit to the number of terminals which can be connected to a concentrator. The terminals can be connected directly to the CPU also In this paper it is assumed that the concentrators can bo located anywhere in the area A containing the CPU and the terminals. Then this becomes a multimodal optimization problem. In the proposed algorithm a stochastic automaton is used as a search device to locate the minimum of the multimodal cost function . The proposed algorithm involves the following. The area A containing the CPU and the terminals is divided into an arbitrary number of regions (say K). An approximate value for the number of concentrators is assumed (say m). The optimum number is determined by iteration later The m concentrators can be assigned to the K regions in (mk) ways (m > K) or (km) ways (K>m).(All possible assignments are feasible, i.e. a region can contain 0,1,…, to concentrators). Each possible assignment is assumed to represent a state of the stochastic variable structure automaton. To start with, all the states are assigned equal probabilities. At each stage of the search the automaton visits a state according to the current probability distribution. At each visit the automaton selects a 'point' inside that state with uniform probability. The cost associated with that point is calculated and the average cost of that state is updated. Then the probabilities of all the states are updated. The probabilities are taken to bo inversely proportional to the average cost of the states After a certain number of searches the search probabilities become stationary and the automaton visits a particular state again and again. Then the automaton is said to have converged to that state Then by conducting a local gradient search within that state the exact locations of the concentrators are determined This algorithm was applied to a set of test problems and the results were compared with those given by Cooper's (1964, 1967) EAC algorithm and on the average it was found that the proposed algorithm performs better.
Resumo:
With the increasing adoption of wireless technology, it is reasonable to expect an increase in file demand for supporting both real-time multimedia and high rate reliable data services. Next generation wireless systems employ Orthogonal Frequency Division Multiplexing (OFDM) physical layer owing, to the high data rate transmissions that are possible without increase in bandwidth. Towards improving file performance of these systems, we look at the design of resource allocation algorithms at medium-access layer, and their impact on higher layers. While TCP-based clastic traffic needs reliable transport, UDP-based real-time applications have stringent delay and rate requirements. The MAC algorithms while catering to the heterogeneous service needs of these higher layers, tradeoff between maximizing the system capacity and providing fairness among users. The novelly of this work is the proposal of various channel-aware resource allocation algorithms at the MAC layer. which call result in significant performance gains in an OFDM based wireless system.
Resumo:
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires which leads to delay in execution and significantly high energy consumption.In this paper, we propose a new instruction scheduling algorithm that exploits scheduling slacks of instructions and communication slacks of data values together to achieve better energy-performance trade-offs for clustered architectures with heterogeneous interconnect. Our instruction scheduling algorithm achieves 35% and 40% reduction in communication energy, whereas the overall energy-delay product improves by 4.5% and 6.5% respectively for 2 cluster and 4 cluster machines with marginal increase (1.6% and 1.1%) in execution time. Our test bed uses the Trimaran compiler infrastructure.
Resumo:
Clustered VLIW architectures solve the scalability problem associated with flat VLIW architectures by partitioning the register file and connecting only a subset of the functional units to a register file. However, inter-cluster communication in clustered architectures leads to increased leakage in functional components and a high number of register accesses. In this paper, we propose compiler scheduling algorithms targeting two previously ignored power-hungry components in clustered VLIW architectures, viz., instruction decoder and register file. We consider a split decoder design and propose a new energy-aware instruction scheduling algorithm that provides 14.5% and 17.3% benefit in the decoder power consumption on an average over a purely hardware based scheme in the context of 2-clustered and 4-clustered VLIW machines. In the case of register files, we propose two new scheduling algorithms that exploit limited register snooping capability to reduce extra register file accesses. The proposed algorithms reduce register file power consumption on an average by 6.85% and 11.90% (10.39% and 17.78%), respectively, along with performance improvement of 4.81% and 5.34% (9.39% and 11.16%) over a traditional greedy algorithm for 2-clustered (4-clustered) VLIW machine. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
In this paper cognitive science is defined as the study of intelligence as a computational process. The several disciplines which contribute to the study of cognitive science are briefly described so as to glean a birds-eye view of the subject of cognitive science.
Resumo:
In this paper we propose a new method of data handling for web servers. We call this method Network Aware Buffering and Caching (NABC for short). NABC facilitates reduction of data copies in web server's data sending path, by doing three things: (1) Layout the data in main memory in a way that protocol processing can be done without data copies (2) Keep a unified cache of data in kernel and ensure safe access to it by various processes and kernel and (3) Pass only the necessary meta data between processes so that bulk data handling time spent during IPC can be reduced. We realize NABC by implementing a set of system calls and an user library. The end product of the implementation is a set of APIs specifically designed for use by the web servers. We port an in house web server called SWEET, to NABC APIs and evaluate performance using a range of workloads both simulated and real. The results show a very impressive gain of 12% to 21% in throughput for static file serving and 1.6 to 4 times gain in throughput for lightweight dynamic content serving for a server using NABC APIs over the one using UNIX APIs.
Resumo:
We propose a scheme for the compression of tree structured intermediate code consisting of a sequence of trees specified by a regular tree grammar. The scheme is based on arithmetic coding, and the model that works in conjunction with the coder is automatically generated from the syntactical specification of the tree language. Experiments on data sets consisting of intermediate code trees yield compression ratios ranging from 2.5 to 8, for file sizes ranging from 167 bytes to 1 megabyte.
Resumo:
The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our pro- posed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.
Resumo:
A computer-aided procedure is described for analyzing the reliability of complicated networks. This procedure breaks down a network into small subnetworks whose reliability can be more readily calculated. The subnetworks which are searched for are those with only two nodes; this allows the original network to be considerably simplified.
Resumo:
A computer-aided procedure is described for analyzing the reliability of complicated networks. This procedure breaks down a network into small subnetworks whose reliability can be more readily calculated. The subnetworks which are searched for are those with only two nodes; this allows the original network to be considerably simplified.
Resumo:
Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular,register renaming a large number of instructions per cycle is diDcult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the registerjle into a global file and several local jles, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.
Resumo:
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy-performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy-delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure. (C) 2012 Elsevier Inc. All rights reserved.
Resumo:
In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.
Resumo:
The twin demands of energy-efficiency and higher performance on DRAM are highly emphasized in multicore architectures. A variety of schemes have been proposed to address either the latency or the energy consumption of DRAMs. These schemes typically require non-trivial hardware changes and end up improving latency at the cost of energy or vice-versa. One specific DRAM performance problem in multicores is that interleaved accesses from different cores can potentially degrade row-buffer locality. In this paper, based on the temporal and spatial locality characteristics of memory accesses, we propose a reorganization of the existing single large row-buffer in a DRAM bank into multiple sub-row buffers (MSRB). This re-organization not only improves row hit rates, and hence the average memory latency, but also brings down the energy consumed by the DRAM. The first major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves weighted speedup by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. The proposed MSRB organization enables opportunities for the management of multiple row-buffers at the memory controller level. As the memory controller is aware of the behaviour of individual cores it allows us to implement coordinated buffer allocation schemes for different cores that take into account program behaviour. We demonstrate two such schemes, namely Fairness Oriented Allocation and Performance Oriented Allocation, which show the flexibility that memory controllers can now exploit in our MSRB organization to improve overall performance and/or fairness. Further, the MSRB organization enables additional opportunities for DRAM intra-bank parallelism and selective early precharging of the LRU row-buffer to further improve memory access latencies. These two optimizations together provide an additional 5.9% performance improvement.