850 resultados para High-dimensional data visualization
Resumo:
Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.
Resumo:
Dimensionality reduction is employed for visual data analysis as a way to obtaining reduced spaces for high dimensional data or to mapping data directly into 2D or 3D spaces. Although techniques have evolved to improve data segregation on reduced or visual spaces, they have limited capabilities for adjusting the results according to user's knowledge. In this paper, we propose a novel approach to handling both dimensionality reduction and visualization of high dimensional data, taking into account user's input. It employs Partial Least Squares (PLS), a statistical tool to perform retrieval of latent spaces focusing on the discriminability of the data. The method employs a training set for building a highly precise model that can then be applied to a much larger data set very effectively. The reduced data set can be exhibited using various existing visualization techniques. The training data is important to code user's knowledge into the loop. However, this work also devises a strategy for calculating PLS reduced spaces when no training data is available. The approach produces increasingly precise visual mappings as the user feeds back his or her knowledge and is capable of working with small and unbalanced training sets.
Resumo:
The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector--a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a technique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational complexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse representations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets. and relevance is measured by the number of common clusters. This technique significantly improves retrieval performance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.
Resumo:
Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.
Resumo:
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of "signature" protein profiles specific to each pathologic state (e.g., normal vs. cancer) or differential profiles between experimental conditions (e.g., treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data analytic strategy for discovering protein biomarkers based on such high-dimensional mass-spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data analytic strategy takes properties of the SELDI mass-spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After these pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Emerging high-dimensional data mining applications needs to find interesting clusters embeded in arbitrarily aligned subspaces of lower dimensionality. It is difficult to cluster high-dimensional data objects, when they are sparse and skewed. Updations are quite common in dynamic databases and they are usually processed in batch mode. In very large dynamic databases, it is necessary to perform incremental cluster analysis only to the updations. We present a incremental clustering algorithm for subspace clustering in very high dimensions, which handles both insertion and deletions of datapoints to the backend databases.
Resumo:
In Information Visualization, adding and removing data elements can strongly impact the underlying visual space. We have developed an inherently incremental technique (incBoard) that maintains a coherent disposition of elements from a dynamic multidimensional data set on a 2D grid as the set changes. Here, we introduce a novel layout that uses pairwise similarity from grid neighbors, as defined in incBoard, to reposition elements on the visual space, free from constraints imposed by the grid. The board continues to be updated and can be displayed alongside the new space. As similar items are placed together, while dissimilar neighbors are moved apart, it supports users in the identification of clusters and subsets of related elements. Densely populated areas identified in the incSpace can be efficiently explored with the corresponding incBoard visualization, which is not susceptible to occlusion. The solution remains inherently incremental and maintains a coherent disposition of elements, even for fully renewed sets. The algorithm considers relative positions for the initial placement of elements, and raw dissimilarity to fine tune the visualization. It has low computational cost, with complexity depending only on the size of the currently viewed subset, V. Thus, a data set of size N can be sequentially displayed in O(N) time, reaching O(N (2)) only if the complete set is simultaneously displayed.
Resumo:
Indexing high dimensional datasets has attracted extensive attention from many researchers in the last decade. Since R-tree type of index structures are known as suffering curse of dimensionality problems, Pyramid-tree type of index structures, which are based on the B-tree, have been proposed to break the curse of dimensionality. However, for high dimensional data, the number of pyramids is often insufficient to discriminate data points when the number of dimensions is high. Its effectiveness degrades dramatically with the increase of dimensionality. In this paper, we focus on one particular issue of curse of dimensionality; that is, the surface of a hypercube in a high dimensional space approaches 100% of the total hypercube volume when the number of dimensions approaches infinite. We propose a new indexing method based on the surface of dimensionality. We prove that the Pyramid tree technology is a special case of our method. The results of our experiments demonstrate clear priority of our novel method.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
The Self-OrganizingMap (SOM) is a neural network model that performs an ordered projection of a high dimensional input space in a low-dimensional topological structure. The process in which such mapping is formed is defined by the SOM algorithm, which is a competitive, unsupervised and nonparametric method, since it does not make any assumption about the input data distribution. The feature maps provided by this algorithm have been successfully applied for vector quantization, clustering and high dimensional data visualization processes. However, the initialization of the network topology and the selection of the SOM training parameters are two difficult tasks caused by the unknown distribution of the input signals. A misconfiguration of these parameters can generate a feature map of low-quality, so it is necessary to have some measure of the degree of adaptation of the SOM network to the input data model. The topologypreservation is the most common concept used to implement this measure. Several qualitative and quantitative methods have been proposed for measuring the degree of SOM topologypreservation, particularly using Kohonen's model. In this work, two methods for measuring the topologypreservation of the Growing Cell Structures (GCSs) model are proposed: the topographic function and the topology preserving map
Resumo:
Chow and Liu introduced an algorithm for fitting a multivariate distribution with a tree (i.e. a density model that assumes that there are only pairwise dependencies between variables) and that the graph of these dependencies is a spanning tree. The original algorithm is quadratic in the dimesion of the domain, and linear in the number of data points that define the target distribution $P$. This paper shows that for sparse, discrete data, fitting a tree distribution can be done in time and memory that is jointly subquadratic in the number of variables and the size of the data set. The new algorithm, called the acCL algorithm, takes advantage of the sparsity of the data to accelerate the computation of pairwise marginals and the sorting of the resulting mutual informations, achieving speed ups of up to 2-3 orders of magnitude in the experiments.