8 resultados para Classify
em Boston University Digital Common
Resumo:
Detecting and understanding anomalies in IP networks is an open and ill-defined problem. Toward this end, we have recently proposed the subspace method for anomaly diagnosis. In this paper we present the first large-scale exploration of the power of the subspace method when applied to flow traffic. An important aspect of this approach is that it fuses information from flow measurements taken throughout a network. We apply the subspace method to three different types of sampled flow traffic in a large academic network: multivariate timeseries of byte counts, packet counts, and IP-flow counts. We show that each traffic type brings into focus a different set of anomalies via the subspace method. We illustrate and classify the set of anomalies detected. We find that almost all of the anomalies detected represent events of interest to network operators. Furthermore, the anomalies span a remarkably wide spectrum of event types, including denial of service attacks (single-source and distributed), flash crowds, port scanning, downstream traffic engineering, high-rate flows, worm propagation, and network outage.
Resumo:
The majority of the traffic (bytes) flowing over the Internet today have been attributed to the Transmission Control Protocol (TCP). This strong presence of TCP has recently spurred further investigations into its congestion avoidance mechanism and its effect on the performance of short and long data transfers. At the same time, the rising interest in enhancing Internet services while keeping the implementation cost low has led to several service-differentiation proposals. In such service-differentiation architectures, much of the complexity is placed only in access routers, which classify and mark packets from different flows. Core routers can then allocate enough resources to each class of packets so as to satisfy delivery requirements, such as predictable (consistent) and fair service. In this paper, we investigate the interaction among short and long TCP flows, and how TCP service can be improved by employing a low-cost service-differentiation scheme. Through control-theoretic arguments and extensive simulations, we show the utility of isolating TCP flows into two classes based on their lifetime/size, namely one class of short flows and another of long flows. With such class-based isolation, short and long TCP flows have separate service queues at routers. This protects each class of flows from the other as they possess different characteristics, such as burstiness of arrivals/departures and congestion/sending window dynamics. We show the benefits of isolation, in terms of better predictability and fairness, over traditional shared queueing systems with both tail-drop and Random-Early-Drop (RED) packet dropping policies. The proposed class-based isolation of TCP flows has several advantages: (1) the implementation cost is low since it only requires core routers to maintain per-class (rather than per-flow) state; (2) it promises to be an effective traffic engineering tool for improved predictability and fairness for both short and long TCP flows; and (3) stringent delay requirements of short interactive transfers can be met by increasing the amount of resources allocated to the class of short flows.
Resumo:
The cost and complexity of deploying measurement infrastructure in the Internet for the purpose of analyzing its structure and behavior is considerable. Basic questions about the utility of increasing the number of measurements and/or measurement sites have not yet been addressed which has lead to a "more is better" approach to wide-area measurements. In this paper, we quantify the marginal utility of performing wide-area measurements in the context of Internet topology discovery. We characterize topology in terms of nodes, links, node degree distribution, and end-to-end flows using statistical and information-theoretic techniques. We classify nodes discovered on the routes between a set of 8 sources and 1277 destinations to differentiate nodes which make up the so called "backbone" from those which border the backbone and those on links between the border nodes and destination nodes. This process includes reducing nodes that advertise multiple interfaces to single IP addresses. We show that the utility of adding sources goes down significantly after 2 from the perspective of interface, node, link and node degree discovery. We show that the utility of adding destinations is constant for interfaces, nodes, links and node degree indicating that it is more important to add destinations than sources. Finally, we analyze paths through the backbone and show that shared link distributions approximate a power law indicating that a small number of backbone links in our study are very heavily utilized.
Resumo:
The increasing practicality of large-scale flow capture makes it possible to conceive of traffic analysis methods that detect and identify a large and diverse set of anomalies. However the challenge of effectively analyzing this massive data source for anomaly diagnosis is as yet unmet. We argue that the distributions of packet features (IP addresses and ports) observed in flow traces reveals both the presence and the structure of a wide range of anomalies. Using entropy as a summarization tool, we show that the analysis of feature distributions leads to significant advances on two fronts: (1) it enables highly sensitive detection of a wide range of anomalies, augmenting detections by volume-based methods, and (2) it enables automatic classification of anomalies via unsupervised learning. We show that using feature distributions, anomalies naturally fall into distinct and meaningful clusters. These clusters can be used to automatically classify anomalies and to uncover new anomaly types. We validate our claims on data from two backbone networks (Abilene and Geant) and conclude that feature distributions show promise as a key element of a fairly general network anomaly diagnosis framework.
Resumo:
Nearest neighbor retrieval is the task of identifying, given a database of objects and a query object, the objects in the database that are the most similar to the query. Retrieving nearest neighbors is a necessary component of many practical applications, in fields as diverse as computer vision, pattern recognition, multimedia databases, bioinformatics, and computer networks. At the same time, finding nearest neighbors accurately and efficiently can be challenging, especially when the database contains a large number of objects, and when the underlying distance measure is computationally expensive. This thesis proposes new methods for improving the efficiency and accuracy of nearest neighbor retrieval and classification in spaces with computationally expensive distance measures. The proposed methods are domain-independent, and can be applied in arbitrary spaces, including non-Euclidean and non-metric spaces. In this thesis particular emphasis is given to computer vision applications related to object and shape recognition, where expensive non-Euclidean distance measures are often needed to achieve high accuracy. The first contribution of this thesis is the BoostMap algorithm for embedding arbitrary spaces into a vector space with a computationally efficient distance measure. Using this approach, an approximate set of nearest neighbors can be retrieved efficiently - often orders of magnitude faster than retrieval using the exact distance measure in the original space. The BoostMap algorithm has two key distinguishing features with respect to existing embedding methods. First, embedding construction explicitly maximizes the amount of nearest neighbor information preserved by the embedding. Second, embedding construction is treated as a machine learning problem, in contrast to existing methods that are based on geometric considerations. The second contribution is a method for constructing query-sensitive distance measures for the purposes of nearest neighbor retrieval and classification. In high-dimensional spaces, query-sensitive distance measures allow for automatic selection of the dimensions that are the most informative for each specific query object. It is shown theoretically and experimentally that query-sensitivity increases the modeling power of embeddings, allowing embeddings to capture a larger amount of the nearest neighbor structure of the original space. The third contribution is a method for speeding up nearest neighbor classification by combining multiple embedding-based nearest neighbor classifiers in a cascade. In a cascade, computationally efficient classifiers are used to quickly classify easy cases, and classifiers that are more computationally expensive and also more accurate are only applied to objects that are harder to classify. An interesting property of the proposed cascade method is that, under certain conditions, classification time actually decreases as the size of the database increases, a behavior that is in stark contrast to the behavior of typical nearest neighbor classification systems. The proposed methods are evaluated experimentally in several different applications: hand shape recognition, off-line character recognition, online character recognition, and efficient retrieval of time series. In all datasets, the proposed methods lead to significant improvements in accuracy and efficiency compared to existing state-of-the-art methods. In some datasets, the general-purpose methods introduced in this thesis even outperform domain-specific methods that have been custom-designed for such datasets.
Resumo:
The SNBENCH is a general-purpose programming environment and run-time system targeted towards a variety of Sensor applications such as environmental sensing, location sensing, video sensing, etc. In its current structure, the run-time engine of the SNBENCH namely, the Sensorium Execution Environment (SXE) processes the entities of execution in a single thread of operation. In order to effectively support applications that are time-sensitive and need priority, it is imperative to process the tasks discretely so that specific policies can be applied at a much granular level. The goal of this project was to modify the SXE to enable efficient use of system resources by way of multi-tasking the individual components. Additionally, the transformed SXE offers the ability to classify and employ different schemes of processing to the individual tasks.
Resumo:
Fusion ARTMAP is a self-organizing neural network architecture for multi-channel, or multi-sensor, data fusion. Fusion ARTMAP generalizes the fuzzy ARTMAP architecture in order to adaptively classify multi-channel data. The network has a symmetric organization such that each channel can be dynamically configured to serve as either a data input or a teaching input to the system. An ART module forms a compressed recognition code within each channel. These codes, in turn, beco1ne inputs to a single ART system that organizes the global recognition code. When a predictive error occurs, a process called parallel match tracking simultaneously raises vigilances in multiple ART modules until reset is triggered in one of thmn. Parallel match tracking hereby resets only that portion of the recognition code with the poorest match, or minimum predictive confidence. This internally controlled selective reset process is a type of credit assignment that creates a parsimoniously connected learned network.
Resumo:
This article introduces a new neural network architecture, called ARTMAP, that autonomously learns to classify arbitrarily many, arbitrarily ordered vectors into recognition categories based on predictive success. This supervised learning system is built up from a pair of Adaptive Resonance Theory modules (ARTa and ARTb) that are capable of self-organizing stable recognition categories in response to arbitrary sequences of input patterns. During training trials, the ARTa module receives a stream {a^(p)} of input patterns, and ARTb receives a stream {b^(p)} of input patterns, where b^(p) is the correct prediction given a^(p). These ART modules are linked by an associative learning network and an internal controller that ensures autonomous system operation in real time. During test trials, the remaining patterns a^(p) are presented without b^(p), and their predictions at ARTb are compared with b^(p). Tested on a benchmark machine learning database in both on-line and off-line simulations, the ARTMAP system learns orders of magnitude more quickly, efficiently, and accurately than alternative algorithms, and achieves 100% accuracy after training on less than half the input patterns in the database. It achieves these properties by using an internal controller that conjointly maximizes predictive generalization and minimizes predictive error by linking predictive success to category size on a trial-by-trial basis, using only local operations. This computation increases the vigilance parameter ρa of ARTa by the minimal amount needed to correct a predictive error at ARTb· Parameter ρa calibrates the minimum confidence that ARTa must have in a category, or hypothesis, activated by an input a^(p) in order for ARTa to accept that category, rather than search for a better one through an automatically controlled process of hypothesis testing. Parameter ρa is compared with the degree of match between a^(p) and the top-down learned expectation, or prototype, that is read-out subsequent to activation of an ARTa category. Search occurs if the degree of match is less than ρa. ARTMAP is hereby a type of self-organizing expert system that calibrates the selectivity of its hypotheses based upon predictive success. As a result, rare but important events can be quickly and sharply distinguished even if they are similar to frequent events with different consequences. Between input trials ρa relaxes to a baseline vigilance pa When ρa is large, the system runs in a conservative mode, wherein predictions are made only if the system is confident of the outcome. Very few false-alarm errors then occur at any stage of learning, yet the system reaches asymptote with no loss of speed. Because ARTMAP learning is self stabilizing, it can continue learning one or more databases, without degrading its corpus of memories, until its full memory capacity is utilized.