866 resultados para High-Dimensional Space Geometrical Informatics (HDSGI)
Resumo:
Anomalies are unusual and significant changes in a network's traffic levels, which can often involve multiple links. Diagnosing anomalies is critical for both network operators and end users. It is a difficult problem because one must extract and interpret anomalous patterns from large amounts of high-dimensional, noisy data. In this paper we propose a general method to diagnose anomalies. This method is based on a separation of the high-dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and anomalous network conditions. We show that this separation can be performed effectively using Principal Component Analysis. Using only simple traffic measurements from links, we study volume anomalies and show that the method can: (1) accurately detect when a volume anomaly is occurring; (2) correctly identify the underlying origin-destination (OD) flow which is the source of the anomaly; and (3) accurately estimate the amount of traffic involved in the anomalous OD flow. We evaluate the method's ability to diagnose (i.e., detect, identify, and quantify) both existing and synthetically injected volume anomalies in real traffic from two backbone networks. Our method consistently diagnoses the largest volume anomalies, and does so with a very low false alarm rate.
Resumo:
Vector space models (VSMs) represent word meanings as points in a high dimensional space. VSMs are typically created using a large text corpora, and so represent word semantics as observed in text. We present a new algorithm (JNNSE) that can incorporate a measure of semantics not previously used to create VSMs: brain activation data recorded while people read words. The resulting model takes advantage of the complementary strengths and weaknesses of corpus and brain activation data to give a more complete representation of semantics. Evaluations show that the model 1) matches a behavioral measure of semantics more closely, 2) can be used to predict corpus data for unseen words and 3) has predictive power that generalizes across brain imaging technologies and across subjects. We believe that the model is thus a more faithful representation of mental vocabularies.
Resumo:
Cost-effective semantic description and annotation of shared knowledge resources has always been of great importance for digital libraries and large scale information systems in general. With the emergence of the Social Web and Web 2.0 technologies, a more effective semantic description and annotation, e.g., folksonomies, of digital library contents is envisioned to take place in collaborative and personalised environments. However, there is a lack of foundation and mathematical rigour for coping with contextualised management and retrieval of semantic annotations throughout their evolution as well as diversity in users and user communities. In this paper, we propose an ontological foundation for semantic annotations of digital libraries in terms of flexonomies. The proposed theoretical model relies on a high dimensional space with algebraic operators for contextualised access of semantic tags and annotations. The set of the proposed algebraic operators, however, is an adaptation of the set theoretic operators selection, projection, difference, intersection, union in database theory. To this extent, the proposed model is meant to lay the ontological foundation for a Digital Library 2.0 project in terms of geometric spaces rather than logic (description) based formalisms as a more efficient and scalable solution to the semantic annotation problem in large scale.
Resumo:
The identification and visualization of clusters formed by motor unit action potentials (MUAPs) is an essential step in investigations seeking to explain the control of the neuromuscular system. This work introduces the generative topographic mapping (GTM), a novel machine learning tool, for clustering of MUAPs, and also it extends the GTM technique to provide a way of visualizing MUAPs. The performance of GTM was compared to that of three other clustering methods: the self-organizing map (SOM), a Gaussian mixture model (GMM), and the neural-gas network (NGN). The results, based on the study of experimental MUAPs, showed that the rate of success of both GTM and SOM outperformed that of GMM and NGN, and also that GTM may in practice be used as a principled alternative to the SOM in the study of MUAPs. A visualization tool, which we called GTM grid, was devised for visualization of MUAPs lying in a high-dimensional space. The visualization provided by the GTM grid was compared to that obtained from principal component analysis (PCA). (c) 2005 Elsevier Ireland Ltd. All rights reserved.
Resumo:
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.
Resumo:
In the pattern recognition research field, Support Vector Machines (SVM) have been an effectiveness tool for classification purposes, being successively employed in many applications. The SVM input data is transformed into a high dimensional space using some kernel functions where linear separation is more likely. However, there are some computational drawbacks associated to SVM. One of them is the computational burden required to find out the more adequate parameters for the kernel mapping considering each non-linearly separable input data space, which reflects the performance of SVM. This paper introduces the Polynomial Powers of Sigmoid for SVM kernel mapping, and it shows their advantages over well-known kernel functions using real and synthetic datasets.
Resumo:
In this paper, we present a novel indexing technique called Multi-scale Similarity Indexing (MSI) to index image's multi-features into a single one-dimensional structure. Both for text and visual feature spaces, the similarity between a point and a local partition's center in individual space is used as the indexing key, where similarity values in different features are distinguished by different scale. Then a single indexing tree can be built on these keys. Based on the property that relevant images have similar similarity values from the center of the same local partition in any feature space, certain number of irrelevant images can be fast pruned based on the triangle inequity on indexing keys. To remove the dimensionality curse existing in high dimensional structure, we propose a new technique called Local Bit Stream (LBS). LBS transforms image's text and visual feature representations into simple, uniform and effective bit stream (BS) representations based on local partition's center. Such BS representations are small in size and fast for comparison since only bit operation are involved. By comparing common bits existing in two BSs, most of irrelevant images can be immediately filtered. To effectively integrate multi-features, we also investigated the following evidence combination techniques-Certainty Factor, Dempster Shafer Theory, Compound Probability, and Linear Combination. Our extensive experiment showed that single one-dimensional index on multi-features improves multi-indices on multi-features greatly. Our LBS method outperforms sequential scan on high dimensional space by an order of magnitude. And Certainty Factor and Dempster Shafer Theory perform best in combining multiple similarities from corresponding multiple features.
Resumo:
In this paper, we present a novel indexing technique called Multi-scale Similarity Indexing (MSI) to index imagersquos multi-features into a single one-dimensional structure. Both for text and visual feature spaces, the similarity between a point and a local partitionrsquos center in individual space is used as the indexing key, where similarity values in different features are distinguished by different scale. Then a single indexing tree can be built on these keys. Based on the property that relevant images haves similar similarity values from the center of the same local partition in any feature space, certain number of irrelevant images can be fast pruned based on the triangle inequity on indexing keys. To remove the ldquodimensionality curserdquo existing in high dimensional structure, we propose a new technique called Local Bit Stream (LBS). LBS transforms imagersquos text and visual feature representations into simple, uniform and effective bit stream (BS) representations based on local partitionrsquos center. Such BS representations are small in size and fast for comparison since only bit operation are involved. By comparing common bits existing in two BSs, most of irrelevant images can be immediately filtered. Our extensive experiment showed that single one-dimensional index on multi-features improves multi-indices on multi-features greatly. Our LBS method outperforms sequential scan on high dimensional space by an order of magnitude.
Resumo:
Visualization has proven to be a powerful and widely-applicable tool the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines and to data in 36 dimensions derived from satellite images.
Resumo:
Event extraction from texts aims to detect structured information such as what has happened, to whom, where and when. Event extraction and visualization are typically considered as two different tasks. In this paper, we propose a novel approach based on probabilistic modelling to jointly extract and visualize events from tweets where both tasks benefit from each other. We model each event as a joint distribution over named entities, a date, a location and event-related keywords. Moreover, both tweets and event instances are associated with coordinates in the visualization space. The manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the high-dimensional space is incorporated into the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.
Resumo:
This paper addresses the problem of determining an optimal (shortest) path in three dimensional space for a constant speed and turn-rate constrained aerial vehicle, that would enable the vehicle to converge to a rectilinear path, starting from any arbitrary initial position and orientation. Based on 3D geometry, we propose an optimal and also a suboptimal path planning approach. Unlike the existing numerical methods which are computationally intensive, this optimal geometrical method generates an optimal solution in lesser time. The suboptimal solution approach is comparatively more efficient and gives a solution that is very close to the optimal one. Due to its simplicity and low computational requirements this approach can be implemented on an aerial vehicle with constrained turn radius to reach a straight line with a prescribed orientation as required in several applications. But, if the distance between the initial point and the straight line to be followed along the vertical axis is high, then the generated path may not be flyable for an aerial vehicle with limited range of flight path angle and we resort to a numerical method for obtaining the optimal solution. The numerical method used here for simulation is based on multiple shooting and is found to be comparatively more efficient than other methods for solving such two point boundary value problem.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
We present a test for identifying clusters in high dimensional data based on the k-means algorithm when the null hypothesis is spherical normal. We show that projection techniques used for evaluating validity of clusters may be misleading for such data. In particular, we demonstrate that increasingly well-separated clusters are identified as the dimensionality increases, when no such clusters exist. Furthermore, in a case of true bimodality, increasing the dimensionality makes identifying the correct clusters more difficult. In addition to the original conservative test, we propose a practical test with the same asymptotic behavior that performs well for a moderate number of points and moderate dimensionality. ACM Computing Classification System (1998): I.5.3.
Resumo:
This thesis is concerned with change point analysis for time series, i.e. with detection of structural breaks in time-ordered, random data. This long-standing research field regained popularity over the last few years and is still undergoing, as statistical analysis in general, a transformation to high-dimensional problems. We focus on the fundamental »change in the mean« problem and provide extensions of the classical non-parametric Darling-Erdős-type cumulative sum (CUSUM) testing and estimation theory within highdimensional Hilbert space settings. In the first part we contribute to (long run) principal component based testing methods for Hilbert space valued time series under a rather broad (abrupt, epidemic, gradual, multiple) change setting and under dependence. For the dependence structure we consider either traditional m-dependence assumptions or more recently developed m-approximability conditions which cover, e.g., MA, AR and ARCH models. We derive Gumbel and Brownian bridge type approximations of the distribution of the test statistic under the null hypothesis of no change and consistency conditions under the alternative. A new formulation of the test statistic using projections on subspaces allows us to simplify the standard proof techniques and to weaken common assumptions on the covariance structure. Furthermore, we propose to adjust the principal components by an implicit estimation of a (possible) change direction. This approach adds flexibility to projection based methods, weakens typical technical conditions and provides better consistency properties under the alternative. In the second part we contribute to estimation methods for common changes in the means of panels of Hilbert space valued time series. We analyze weighted CUSUM estimates within a recently proposed »high-dimensional low sample size (HDLSS)« framework, where the sample size is fixed but the number of panels increases. We derive sharp conditions on »pointwise asymptotic accuracy« or »uniform asymptotic accuracy« of those estimates in terms of the weighting function. Particularly, we prove that a covariance-based correction of Darling-Erdős-type CUSUM estimates is required to guarantee uniform asymptotic accuracy under moderate dependence conditions within panels and that these conditions are fulfilled, e.g., by any MA(1) time series. As a counterexample we show that for AR(1) time series, close to the non-stationary case, the dependence is too strong and uniform asymptotic accuracy cannot be ensured. Finally, we conduct simulations to demonstrate that our results are practically applicable and that our methodological suggestions are advantageous.
Resumo:
In computational linguistics, information retrieval and applied cognition, words and concepts are often represented as vectors in high dimensional spaces computed from a corpus of text. These high dimensional spaces are often referred to as Semantic Spaces. We describe a novel and efficient approach to computing these semantic spaces via the use of complex valued vector representations. We report on the practical implementation of the proposed method and some associated experiments. We also briefly discuss how the proposed system relates to previous theoretical work in Information Retrieval and Quantum Mechanics and how the notions of probability, logic and geometry are integrated within a single Hilbert space representation. In this sense the proposed system has more general application and gives rise to a variety of opportunities for future research.