130 resultados para DL-PCBs
Resumo:
In achieving higher instruction level parallelism, software pipelining increases the register pressure in the loop. The usefulness of the generated schedule may be restricted to cases where the register pressure is less than the available number of registers. Spill instructions need to be introduced otherwise. But scheduling these spill instructions in the compact schedule is a difficult task. Several heuristics have been proposed to schedule spill code. These heuristics may generate more spill code than necessary, and scheduling them may necessitate increasing the initiation interval. We model the problem of register allocation with spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. The formulation minimizes the increase in initiation interval (II) by optimally placing spill code and simultaneously minimizes the amount of spill code produced. To the best of our knowledge, this is the first integrated formulation for register allocation, optimal spill code generation and scheduling for software pipelined loops. The proposed formulation performs better than the existing heuristics by preventing an increase in II in 11.11% of the loops and generating 18.48% less spill code on average among the loops extracted from Perfect Club and SPEC benchmarks with a moderate increase in compilation time.
Resumo:
In this paper, we study how TCP and UDP flows interact with each other when the end system is a CPU resource constrained thin client. The problem addressed is twofold, 1) the throughput of TCP flows degrades severely in the presence of heavily loaded UDP flows 2) fairness and minimum QoS requirements of UDP are not maintained. First, we identify the factors affecting the TCP throughput by providing an in-depth analysis of end to end delay and packet loss variations. The results obtained from the first part leads us to our second contribution. We propose and study the use of an algorithm that ensures fairness across flows. The algorithm improves the performance of TCP flows in the presence of multiple UDP flows admitted under an admission algorithm and maintains the minimum QoS requirements of the UDP flows. The advantage of the algorithm is that it requires no changes to TCP/IP stack and control is achieved through receiver window control.
Resumo:
Even though several techniques have been proposed in the literature for achieving multiclass classification using Support Vector Machine(SVM), the scalability aspect of these approaches to handle large data sets still needs much of exploration. Core Vector Machine(CVM) is a technique for scaling up a two class SVM to handle large data sets. In this paper we propose a Multiclass Core Vector Machine(MCVM). Here we formulate the multiclass SVM problem as a Quadratic Programming(QP) problem defining an SVM with vector valued output. This QP problem is then solved using the CVM technique to achieve scalability to handle large data sets. Experiments done with several large synthetic and real world data sets show that the proposed MCVM technique gives good generalization performance as that of SVM at a much lesser computational expense. Further, it is observed that MCVM scales well with the size of the data set.
Resumo:
This paper describes techniques to estimate the worst case execution time of executable code on architectures with data caches. The underlying mechanism is Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior. A simultaneous numeric and pointer analysis using an abstraction for discrete sets of values computes safe approximations of access addresses which are then used to predict cache behavior using Must Analysis. A heuristic is also proposed which generates likely worst case estimates. It can be used in soft real time systems and also for reasoning about the tightness of the safe estimate. The analysis methods can handle programs with non-affine access patterns, for which conventional Presburger Arithmetic formulations or Cache Miss Equations do not apply. The precision of the estimates is user-controlled and can be traded off against analysis time. Executables are analyzed directly, which, apart from enhancing precision, renders the method language independent.
Resumo:
A geometric and non parametric procedure for testing if two finite set of points are linearly separable is proposed. The Linear Separability Test is equivalent to a test that determines if a strictly positive point h > 0 exists in the range of a matrix A (related to the points in the two finite sets). The algorithm proposed in the paper iteratively checks if a strictly positive point exists in a subspace by projecting a strictly positive vector with equal co-ordinates (p), on the subspace. At the end of each iteration, the subspace is reduced to a lower dimensional subspace. The test is completed within r ≤ min(n, d + 1) steps, for both linearly separable and non separable problems (r is the rank of A, n is the number of points and d is the dimension of the space containing the points). The worst case time complexity of the algorithm is O(nr3) and space complexity of the algorithm is O(nd). A small review of some of the prominent algorithms and their time complexities is included. The worst case computational complexity of our algorithm is lower than the worst case computational complexity of Simplex, Perceptron, Support Vector Machine and Convex Hull Algorithms, if d
Resumo:
In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.
Resumo:
Structural alignments are the most widely used tools for comparing proteins with low sequence similarity. The main contribution of this paper is to derive various kernels on proteins from structural alignments, which do not use sequence information. Central to the kernels is a novel alignment algorithm which matches substructures of fixed size using spectral graph matching techniques. We derive positive semi-definite kernels which capture the notion of similarity between substructures. Using these as base more sophisticated kernels on protein structures are proposed. To empirically evaluate the kernels we used a 40% sequence non-redundant structures from 15 different SCOP superfamilies. The kernels when used with SVMs show competitive performance with CE, a state of the art structure comparison program.
Resumo:
Pricing is an effective tool to control congestion and achieve quality of service (QoS) provisioning for multiple differentiated levels of service. In this paper, we consider the problem of pricing for congestion control in the case of a network of nodes under a single service class and multiple queues, and present a multi-layered pricing scheme. We propose an algorithm for finding the optimal state dependent price levels for individual queues, at each node. The pricing policy used depends on a weighted average queue length at each node. This helps in reducing frequent price variations and is in the spirit of the random early detection (RED) mechanism used in TCP/IP networks. We observe in our numerical results a considerable improvement in performance using our scheme over that of a recently proposed related scheme in terms of both throughput and delay performance. In particular, our approach exhibits a throughput improvement in the range of 34 to 69 percent in all cases studied (over all routes) over the above scheme.
Resumo:
We present a fast algorithm for computing a Gomory-Hu tree or cut tree for an unweighted undirected graph G = (V,E). The expected running time of our algorithm is Õ(mc) where |E| = m and c is the maximum u-vedge connectivity, where u,v ∈ V. When the input graph is also simple (i.e., it has no parallel edges), then the u-v edge connectivity for each pair of vertices u and v is at most n-1; so the expected running time of our algorithm for simple unweighted graphs is Õ(mn).All the algorithms currently known for constructing a Gomory-Hu tree [8,9] use n-1 minimum s-t cut (i.e., max flow) subroutines. This in conjunction with the current fastest Õ(n20/9) max flow algorithm due to Karger and Levine [11] yields the current best running time of Õ(n20/9n) for Gomory-Hu tree construction on simpleunweighted graphs with m edges and n vertices. Thus we present the first Õ(mn) algorithm for constructing a Gomory-Hu tree for simple unweighted graphs.We do not use a max flow subroutine here; we present an efficient tree packing algorithm for computing Steiner edge connectivity and use this algorithm as our main subroutine. The advantage in using a tree packing algorithm for constructing a Gomory-Hu tree is that the work done in computing a minimum Steiner cut for a Steiner set S ⊆ V can be reused for computing a minimum Steiner cut for certain Steiner sets S' ⊆ S.
Resumo:
Frequent episode discovery is a popular framework for mining data available as a long sequence of events. An episode is essentially a short ordered sequence of event types and the frequency of an episode is some suitable measure of how often the episode occurs in the data sequence. Recently,we proposed a new frequency measure for episodes based on the notion of non-overlapped occurrences of episodes in the event sequence, and showed that, such a definition, in addition to yielding computationally efficient algorithms, has some important theoretical properties in connecting frequent episode discovery with HMM learning. This paper presents some new algorithms for frequent episode discovery under this non-overlapped occurrences-based frequency definition. The algorithms presented here are better (by a factor of N, where N denotes the size of episodes being discovered) in terms of both time and space complexities when compared to existing methods for frequent episode discovery. We show through some simulation experiments, that our algorithms are very efficient. The new algorithms presented here have arguably the least possible orders of spaceand time complexities for the task of frequent episode discovery.
Resumo:
Energy consumption has become a major constraint in providing increased functionality for devices with small form factors. Dynamic voltage and frequency scaling has been identified as an effective approach for reducing the energy consumption of embedded systems. Earlier works on dynamic voltage scaling focused mainly on performing voltage scaling when the CPU is waiting for memory subsystem or concentrated chiefly on loop nests and/or subroutine calls having sufficient number of dynamic instructions. This paper concentrates on coarser program regions and for the first time uses program phase behavior for performing dynamic voltage scaling. Program phases are annotated at compile time with mode switch instructions. Further, we relate the Dynamic Voltage Scaling Problem to the Multiple Choice Knapsack Problem, and use well known heuristics to solve it efficiently. Also, we develop a simple integer linear program formulation for this problem. Experimental evaluation on a set of media applications reveal that our heuristic method obtains a 38% reduction in energy consumption on an average, with a performance degradation of 1% and upto 45% reduction in energy with a performance degradation of 5%. Further, the energy consumed by the heuristic solution is within 1% of the optimal solution obtained from the ILP approach.
Resumo:
The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our pro- posed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.
Resumo:
Allgather is an important MPI collective communication. Most of the algorithms for allgather have been designed for homogeneous and tightly coupled systems. The existing algorithms for allgather on Gridsystems do not efficiently utilize the bandwidths available on slow wide-area links of the grid. In this paper, we present an algorithm for allgather on grids that efficiently utilizes wide-area bandwidths and is also wide-area optimal. Our algorithm is also adaptive to gridload dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments on a real-grid setup consisting of 3 sites show that our algorithm gives an average performance improvement of 52% over existing strategies.
Resumo:
Wireless LAN (WLAN) market consists of IEEE 802.11 MAC standard conformant devices (e.g., access points (APs), client adapters) from multiple vendors. Certain third party certifications such as those specified by the Wi-Fi alliance have been widely used by vendors to ensure basic conformance to the 802.11 standard, thus leading to the expectation that the available devices exhibit identical MAC level behavior. In this paper, however, we present what we believe to be the first ever set of experimental results that highlight the fact that WLAN devices from different vendors in the market can have heterogeneous MAC level behavior. Specifically, we demonstrate with examples and data that in certain cases, devices may not be conformant with the 802.11 standard while in other cases, they may differ in significant details that are not a part of mandatory specifications of the standard. We argue that heterogeneous MAC implementations can adversely impact WLAN operations leading to unfair bandwidth allocation, potential break-down of related MAC functionality and difficulties in provisioning the capacity of a WLAN. However, on the positive side, MAC level heterogeneity can be useful in applications such as vendor/model level device fingerprinting.
Resumo:
Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional large monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals which address energy efficiency issues by avoiding only the unnecessary tag-comparisons do not reduce broadcasts. These schemes also increase the issue latency.To address both these issues comprehensively, we propose the Scalable Lowpower Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to a monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ significantly reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively on average.