957 resultados para Points distribution in high dimensional space
Resumo:
We consider the problem of variable selection in regression modeling in high-dimensional spaces where there is known structure among the covariates. This is an unconventional variable selection problem for two reasons: (1) The dimension of the covariate space is comparable, and often much larger, than the number of subjects in the study, and (2) the covariate space is highly structured, and in some cases it is desirable to incorporate this structural information in to the model building process. We approach this problem through the Bayesian variable selection framework, where we assume that the covariates lie on an undirected graph and formulate an Ising prior on the model space for incorporating structural information. Certain computational and statistical problems arise that are unique to such high-dimensional, structured settings, the most interesting being the phenomenon of phase transitions. We propose theoretical and computational schemes to mitigate these problems. We illustrate our methods on two different graph structures: the linear chain and the regular graph of degree k. Finally, we use our methods to study a specific application in genomics: the modeling of transcription factor binding sites in DNA sequences. © 2010 American Statistical Association.
Resumo:
In general, particle filters need large numbers of model runs in order to avoid filter degeneracy in high-dimensional systems. The recently proposed, fully nonlinear equivalent-weights particle filter overcomes this requirement by replacing the standard model transition density with two different proposal transition densities. The first proposal density is used to relax all particles towards the high-probability regions of state space as defined by the observations. The crucial second proposal density is then used to ensure that the majority of particles have equivalent weights at observation time. Here, the performance of the scheme in a high, 65 500 dimensional, simplified ocean model is explored. The success of the equivalent-weights particle filter in matching the true model state is shown using the mean of just 32 particles in twin experiments. It is of particular significance that this remains true even as the number and spatial variability of the observations are changed. The results from rank histograms are less easy to interpret and can be influenced considerably by the parameter values used. This article also explores the sensitivity of the performance of the scheme to the chosen parameter values and the effect of using different model error parameters in the truth compared with the ensemble model runs.
Resumo:
Let P be a probability distribution on q -dimensional space. The so-called Diaconis-Freedman effect means that for a fixed dimension d<dimensional projections of P look like a scale mixture of spherically symmetric Gaussian distributions. The present paper provides necessary and sufficient conditions for this phenomenon in a suitable asymptotic framework with increasing dimension q . It turns out, that the conditions formulated by Diaconis and Freedman (1984) are not only sufficient but necessary as well. Moreover, letting P ^ be the empirical distribution of n independent random vectors with distribution P , we investigate the behavior of the empirical process n √ (P ^ −P) under random projections, conditional on P ^ .
Resumo:
Indexing high dimensional datasets has attracted extensive attention from many researchers in the last decade. Since R-tree type of index structures are known as suffering curse of dimensionality problems, Pyramid-tree type of index structures, which are based on the B-tree, have been proposed to break the curse of dimensionality. However, for high dimensional data, the number of pyramids is often insufficient to discriminate data points when the number of dimensions is high. Its effectiveness degrades dramatically with the increase of dimensionality. In this paper, we focus on one particular issue of curse of dimensionality; that is, the surface of a hypercube in a high dimensional space approaches 100% of the total hypercube volume when the number of dimensions approaches infinite. We propose a new indexing method based on the surface of dimensionality. We prove that the Pyramid tree technology is a special case of our method. The results of our experiments demonstrate clear priority of our novel method.
Resumo:
We present a test for identifying clusters in high dimensional data based on the k-means algorithm when the null hypothesis is spherical normal. We show that projection techniques used for evaluating validity of clusters may be misleading for such data. In particular, we demonstrate that increasingly well-separated clusters are identified as the dimensionality increases, when no such clusters exist. Furthermore, in a case of true bimodality, increasing the dimensionality makes identifying the correct clusters more difficult. In addition to the original conservative test, we propose a practical test with the same asymptotic behavior that performs well for a moderate number of points and moderate dimensionality. ACM Computing Classification System (1998): I.5.3.
Resumo:
Data structures such as k-D trees and hierarchical k-means trees perform very well in approximate k nearest neighbour matching, but are only marginally more effective than linear search when performing exact matching in high-dimensional image descriptor data. This paper presents several improvements to linear search that allows it to outperform existing methods and recommends two approaches to exact matching. The first method reduces the number of operations by evaluating the distance measure in order of significance of the query dimensions and terminating when the partial distance exceeds the search threshold. This method does not require preprocessing and significantly outperforms existing methods. The second method improves query speed further by presorting the data using a data structure called d-D sort. The order information is used as a priority queue to reduce the time taken to find the exact match and to restrict the range of data searched. Construction of the d-D sort structure is very simple to implement, does not require any parameter tuning, and requires significantly less time than the best-performing tree structure, and data can be added to the structure relatively efficiently.
Resumo:
In this paper, we propose a novel high-dimensional index method, the BM+-tree, to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a rotary binary hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the M+-tree. Compared with the key dimension concept in the M+-tree, the binary hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Experimental results using two types of real data sets illustrate a significantly improved filtering efficiency.
Resumo:
Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.
Resumo:
Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.
Resumo:
The mandarin keyword spotting system was investigated, and a new approach was proposed based on the principle of homology continuity and point location analysis in high-dimensional space geometry theory which are both parts of biomimetic pattern recognition theory. This approach constructed a hyper-polyhedron with sample points in the training set and calculated the distance between each test point and the hyper-polyhedron. The classification resulted from the value of those distances. The approach was tested by a speech database which was created by ourselves. The performance was compared with the classic HMM approach and the results show that the new approach is much better than HMM approach when the training data is not sufficient.
Resumo:
Digitization is the main feature of modern Information Science. Conjoining the digits and the coordinates, the relation between Information Science and high-dimensional space is consanguineous, and the information issues are transformed to the geometry problems in some high-dimensional spaces. From this basic idea, we propose Computational Information Geometry (CIG) to make information analysis and processing. Two kinds of applications of CIG are given, which are blurred image restoration and pattern recognition. Experimental results are satisfying. And in this paper, how to combine with groups of simple operators in some 2D planes to implement the geometrical computations in high-dimensional space is also introduced. Lots of the algorithms have been realized using software.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
The three-dimensional spatial distribution of Al in the high-k metal gates of metal-oxide-semiconductor field-effect-transistors is measured by atom probe tomography. Chemical distribution is correlated with the transistor voltage threshold (VTH) shift generated by the introduction of a metallic Al layer in the metal gate. After a 1050 °C annealing, it is shown that a 2-Å thick Al layer completely diffuses into oxide layers, while a positive VTH shift is measured. On the contrary, for thicker Al layers, Al precipitation in the metal gate stack is observed and the VTH shift becomes negative. © 2012 American Institute of Physics.
Resumo:
A question central to modelling and, ultimately, managing food webs concerns the dimensionality of trophic niche space, that is, the number of independent traits relevant for determining consumer-resource links. Food-web topologies can often be interpreted by assuming resource traits to be specified by points along a line and each consumer's diet to be given by resources contained in an interval on this line. This phenomenon, called intervality, has been known for 30 years and is widely acknowledged to indicate that trophic niche space is close to one-dimensional. We show that the degrees of intervality observed in nature can be reproduced in arbitrary-dimensional trophic niche spaces, provided that the processes of evolutionary diversification and adaptation are taken into account. Contrary to expectations, intervality is least pronounced at intermediate dimensions and steadily improves towards lower- and higher-dimensional trophic niche spaces.
Resumo:
A framework that connects computational mechanics and molecular dynamics has been developed and described. As the key parts of the framework, the problem of symbolising molecular trajectory and the associated interrelation between microscopic phase space variables and macroscopic observables of the molecular system are considered. Following Shalizi and Moore, it is shown that causal states, the constituent parts of the main construct of computational mechanics, the e-machine, define areas of the phase space that are optimal in the sense of transferring information from the micro-variables to the macro-observables. We have demonstrated that, based on the decay of their Poincare´ return times, these areas can be divided into two classes that characterise the separation of the phase space into resonant and chaotic areas. The first class is characterised by predominantly short time returns, typical to quasi-periodic or periodic trajectories. This class includes a countable number of areas corresponding to resonances. The second class includes trajectories with chaotic behaviour characterised by the exponential decay of return times in accordance with the Poincare´ theorem.