106 resultados para High dimensional
em Indian Institute of Science - Bangalore - Índia
Resumo:
Emerging high-dimensional data mining applications needs to find interesting clusters embeded in arbitrarily aligned subspaces of lower dimensionality. It is difficult to cluster high-dimensional data objects, when they are sparse and skewed. Updations are quite common in dynamic databases and they are usually processed in batch mode. In very large dynamic databases, it is necessary to perform incremental cluster analysis only to the updations. We present a incremental clustering algorithm for subspace clustering in very high dimensions, which handles both insertion and deletions of datapoints to the backend databases.
Operator-splitting finite element algorithms for computations of high-dimensional parabolic problems
Resumo:
An operator-splitting finite element method for solving high-dimensional parabolic equations is presented. The stability and the error estimates are derived for the proposed numerical scheme. Furthermore, two variants of fully-practical operator-splitting finite element algorithms based on the quadrature points and the nodal points, respectively, are presented. Both the quadrature and the nodal point based operator-splitting algorithms are validated using a three-dimensional (3D) test problem. The numerical results obtained with the full 3D computations and the operator-split 2D + 1D computations are found to be in a good agreement with the analytical solution. Further, the optimal order of convergence is obtained in both variants of the operator-splitting algorithms. (C) 2012 Elsevier Inc. All rights reserved.
Resumo:
Experimental characterization of high dimensional dynamic systems sometimes uses the proper orthogonal decomposition (POD). If there are many measurement locations and relatively fewer sensors, then steady-state behavior can still be studied by sequentially taking several sets of simultaneous measurements. The number required of such sets of measurements can be minimized if we solve a combinatorial optimization problem. We aim to bring this problem to the attention of engineering audiences, summarize some known mathematical results about this problem, and present a heuristic (suboptimal) calculation that gives reasonable, if not stellar, results.
Resumo:
Background: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. Results: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. Conclusion: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.
Resumo:
The development of techniques for scaling up classifiers so that they can be applied to problems with large datasets of training examples is one of the objectives of data mining. Recently, AdaBoost has become popular among machine learning community thanks to its promising results across a variety of applications. However, training AdaBoost on large datasets is a major problem, especially when the dimensionality of the data is very high. This paper discusses the effect of high dimensionality on the training process of AdaBoost. Two preprocessing options to reduce dimensionality, namely the principal component analysis and random projection are briefly examined. Random projection subject to a probabilistic length preserving transformation is explored further as a computationally light preprocessing step. The experimental results obtained demonstrate the effectiveness of the proposed training process for handling high dimensional large datasets.
Resumo:
Downscaling to station-scale hydrologic variables from large-scale atmospheric variables simulated by general circulation models (GCMs) is usually necessary to assess the hydrologic impact of climate change. This work presents CRF-downscaling, a new probabilistic downscaling method that represents the daily precipitation sequence as a conditional random field (CRF). The conditional distribution of the precipitation sequence at a site, given the daily atmospheric (large-scale) variable sequence, is modeled as a linear chain CRF. CRFs do not make assumptions on independence of observations, which gives them flexibility in using high-dimensional feature vectors. Maximum likelihood parameter estimation for the model is performed using limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization. Maximum a posteriori estimation is used to determine the most likely precipitation sequence for a given set of atmospheric input variables using the Viterbi algorithm. Direct classification of dry/wet days as well as precipitation amount is achieved within a single modeling framework. The model is used to project the future cumulative distribution function of precipitation. Uncertainty in precipitation prediction is addressed through a modified Viterbi algorithm that predicts the n most likely sequences. The model is applied for downscaling monsoon (June-September) daily precipitation at eight sites in the Mahanadi basin in Orissa, India, using the MIROC3.2 medium-resolution GCM. The predicted distributions at all sites show an increase in the number of wet days, and also an increase in wet day precipitation amounts. A comparison of current and future predicted probability density functions for daily precipitation shows a change in shape of the density function with decreasing probability of lower precipitation and increasing probability of higher precipitation.
Resumo:
We show that the extended Ananthakrishna's model exhibits all the features of the Portevin - Le Chatelier effect including the three types of bands. The model reproduces the recently observed crossover from a low dimensional chaotic state at low and medium strain rates to a high dimensional power law state of stress drops at high strain rates. The dynamics of crossover is elucidated through a study of the Lyapunov spectrum.
Resumo:
We propose, for the first time, a reinforcement learning (RL) algorithm with function approximation for traffic signal control. Our algorithm incorporates state-action features and is easily implementable in high-dimensional settings. Prior work, e. g., the work of Abdulhai et al., on the application of RL to traffic signal control requires full-state representations and cannot be implemented, even in moderate-sized road networks, because the computational complexity exponentially grows in the numbers of lanes and junctions. We tackle this problem of the curse of dimensionality by effectively using feature-based state representations that use a broad characterization of the level of congestion as low, medium, or high. One advantage of our algorithm is that, unlike prior work based on RL, it does not require precise information on queue lengths and elapsed times at each lane but instead works with the aforementioned described features. The number of features that our algorithm requires is linear to the number of signaled lanes, thereby leading to several orders of magnitude reduction in the computational complexity. We perform implementations of our algorithm on various settings and show performance comparisons with other algorithms in the literature, including the works of Abdulhai et al. and Cools et al., as well as the fixed-timing and the longest queue algorithms. For comparison, we also develop an RL algorithm that uses full-state representation and incorporates prioritization of traffic, unlike the work of Abdulhai et al. We observe that our algorithm outperforms all the other algorithms on all the road network settings that we consider.
Resumo:
Given a parametrized n-dimensional SQL query template and a choice of query optimizer, a plan diagram is a color-coded pictorial enumeration of the execution plan choices of the optimizer over the query parameter space. These diagrams have proved to be a powerful metaphor for the analysis and redesign of modern optimizers, and are gaining currency in diverse industrial and academic institutions. However, their utility is adversely impacted by the impractically large computational overheads incurred when standard brute-force exhaustive approaches are used for producing fine-grained diagrams on high-dimensional query templates. In this paper, we investigate strategies for efficiently producing close approximations to complex plan diagrams. Our techniques are customized to the features available in the optimizer's API, ranging from the generic optimizers that provide only the optimal plan for a query, to those that also support costing of sub-optimal plans and enumerating rank-ordered lists of plans. The techniques collectively feature both random and grid sampling, as well as inference techniques based on nearest-neighbor classifiers, parametric query optimization and plan cost monotonicity. Extensive experimentation with a representative set of TPC-H and TPC-DS-based query templates on industrial-strength optimizers indicates that our techniques are capable of delivering 90% accurate diagrams while incurring less than 15% of the computational overheads of the exhaustive approach. In fact, for full-featured optimizers, we can guarantee zero error with less than 10% overheads. These approximation techniques have been implemented in the publicly available Picasso optimizer visualization tool.
Resumo:
Applications in various domains often lead to very large and frequently high-dimensional data. Successful algorithms must avoid the curse of dimensionality but at the same time should be computationally efficient. Finding useful patterns in large datasets has attracted considerable interest recently. The primary goal of the paper is to implement an efficient Hybrid Tree based clustering method based on CF-Tree and KD-Tree, and combine the clustering methods with KNN-Classification. The implementation of the algorithm involves many issues like good accuracy, less space and less time. We will evaluate the time and space efficiency, data input order sensitivity, and clustering quality through several experiments.
Resumo:
We present a heterogeneous finite element method for the solution of a high-dimensional population balance equation, which depends both the physical and the internal property coordinates. The proposed scheme tackles the two main difficulties in the finite element solution of population balance equation: (i) spatial discretization with the standard finite elements, when the dimension of the equation is more than three, (ii) spurious oscillations in the solution induced by standard Galerkin approximation due to pure advection in the internal property coordinates. The key idea is to split the high-dimensional population balance equation into two low-dimensional equations, and discretize the low-dimensional equations separately. In the proposed splitting scheme, the shape of the physical domain can be arbitrary, and different discretizations can be applied to the low-dimensional equations. In particular, we discretize the physical and internal spaces with the standard Galerkin and Streamline Upwind Petrov Galerkin (SUPG) finite elements, respectively. The stability and error estimates of the Galerkin/SUPG finite element discretization of the population balance equation are derived. It is shown that a slightly more regularity, i.e. the mixed partial derivatives of the solution has to be bounded, is necessary for the optimal order of convergence. Numerical results are presented to support the analysis.
Resumo:
Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.
Resumo:
In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.
Resumo:
When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.
Resumo:
Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.