982 resultados para Dimension reduction
Resumo:
The aim of this paper is to provide a comparison of various algorithms and parameters to build reduced semantic spaces. The effect of dimension reduction, the stability of the representation and the effect of word order are examined in the context of the five algorithms bearing on semantic vectors: Random projection (RP), singular value decom- position (SVD), non-negative matrix factorization (NMF), permutations and holographic reduced representations (HRR). The quality of semantic representation was tested by means of synonym finding task using the TOEFL test on the TASA corpus. Dimension reduction was found to improve the quality of semantic representation but it is hard to find the optimal parameter settings. Even though dimension reduction by RP was found to be more generally applicable than SVD, the semantic vectors produced by RP are somewhat unstable. The effect of encoding word order into the semantic vector representation via HRR did not lead to any increase in scores over vectors constructed from word co-occurrence in context information. In this regard, very small context windows resulted in better semantic vectors for the TOEFL test.
Resumo:
Ce mémoire de maîtrise présente une nouvelle approche non supervisée pour détecter et segmenter les régions urbaines dans les images hyperspectrales. La méthode proposée n ́ecessite trois étapes. Tout d’abord, afin de réduire le coût calculatoire de notre algorithme, une image couleur du contenu spectral est estimée. A cette fin, une étape de réduction de dimensionalité non-linéaire, basée sur deux critères complémentaires mais contradictoires de bonne visualisation; à savoir la précision et le contraste, est réalisée pour l’affichage couleur de chaque image hyperspectrale. Ensuite, pour discriminer les régions urbaines des régions non urbaines, la seconde étape consiste à extraire quelques caractéristiques discriminantes (et complémentaires) sur cette image hyperspectrale couleur. A cette fin, nous avons extrait une série de paramètres discriminants pour décrire les caractéristiques d’une zone urbaine, principalement composée d’objets manufacturés de formes simples g ́eométriques et régulières. Nous avons utilisé des caractéristiques texturales basées sur les niveaux de gris, la magnitude du gradient ou des paramètres issus de la matrice de co-occurrence combinés avec des caractéristiques structurelles basées sur l’orientation locale du gradient de l’image et la détection locale de segments de droites. Afin de réduire encore la complexité de calcul de notre approche et éviter le problème de la ”malédiction de la dimensionnalité” quand on décide de regrouper des données de dimensions élevées, nous avons décidé de classifier individuellement, dans la dernière étape, chaque caractéristique texturale ou structurelle avec une simple procédure de K-moyennes et ensuite de combiner ces segmentations grossières, obtenues à faible coût, avec un modèle efficace de fusion de cartes de segmentations. Les expérimentations données dans ce rapport montrent que cette stratégie est efficace visuellement et se compare favorablement aux autres méthodes de détection et segmentation de zones urbaines à partir d’images hyperspectrales.
Resumo:
Approximate Bayesian computation (ABC) methods make use of comparisons between simulated and observed summary statistics to overcome the problem of computationally intractable likelihood functions. As the practical implementation of ABC requires computations based on vectors of summary statistics, rather than full data sets, a central question is how to derive low-dimensional summary statistics from the observed data with minimal loss of information. In this article we provide a comprehensive review and comparison of the performance of the principal methods of dimension reduction proposed in the ABC literature. The methods are split into three nonmutually exclusive classes consisting of best subset selection methods, projection techniques and regularization. In addition, we introduce two new methods of dimension reduction. The first is a best subset selection method based on Akaike and Bayesian information criteria, and the second uses ridge regression as a regularization procedure. We illustrate the performance of these dimension reduction techniques through the analysis of three challenging models and data sets.
DIMENSION REDUCTION FOR POWER SYSTEM MODELING USING PCA METHODS CONSIDERING INCOMPLETE DATA READINGS
Resumo:
Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.
Resumo:
Pathway based genome wide association study evolves from pathway analysis for microarray gene expression and is under rapid development as a complementary for single-SNP based genome wide association study. However, it faces new challenges, such as the summarization of SNP statistics to pathway statistics. The current study applies the ridge regularized Kernel Sliced Inverse Regression (KSIR) to achieve dimension reduction and compared this method to the other two widely used methods, the minimal-p-value (minP) approach of assigning the best test statistics of all SNPs in each pathway as the statistics of the pathway and the principal component analysis (PCA) method of utilizing PCA to calculate the principal components of each pathway. Comparison of the three methods using simulated datasets consisting of 500 cases, 500 controls and100 SNPs demonstrated that KSIR method outperformed the other two methods in terms of causal pathway ranking and the statistical power. PCA method showed similar performance as the minP method. KSIR method also showed a better performance over the other two methods in analyzing a real dataset, the WTCCC Ulcerative Colitis dataset consisting of 1762 cases, 3773 controls as the discovery cohort and 591 cases, 1639 controls as the replication cohort. Several immune and non-immune pathways relevant to ulcerative colitis were identified by these methods. Results from the current study provided a reference for further methodology development and identified novel pathways that may be of importance to the development of ulcerative colitis.^
Resumo:
2002 Mathematics Subject Classification: 62J05, 62G35.
Resumo:
Published as an article in: Studies in Nonlinear Dynamics & Econometrics, 2004, vol. 8, issue 3, article 6.
Resumo:
2000 Mathematics Subject Classification: 68T01, 62H30, 32C09.
Resumo:
Two decades after its inception, Latent Semantic Analysis(LSA) has become part and parcel of every modern introduction to Information Retrieval. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows: (1) that LSA recovers latent semantic factors underlying the document space, (2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and (3) that the latter can best be achieved by Singular Value Decomposition. For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.
Resumo:
The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.
Resumo:
An important responsibility of the Environment Protection Authority, Victoria, is to set objectives for levels of environmental contaminants. To support the development of environmental objectives for water quality, a need has been identified to understand the dual impacts of concentration and duration of a contaminant on biota in freshwater streams. For suspended solids contamination, information reported by Newcombe and Jensen [ North American Journal of Fisheries Management , 16(4):693--727, 1996] study of freshwater fish and the daily suspended solids data from the United States Geological Survey stream monitoring network is utilised. The study group was requested to examine both the utility of the Newcombe and Jensen and the USA data, as well as the formulation of a procedure for use by the Environment Protection Authority Victoria that takes concentration and duration of harmful episodes into account when assessing water quality. The extent to which the impact of a toxic event on fish health could be modelled deterministically was also considered. It was found that concentration and exposure duration were the main compounding factors on the severity of effects of suspended solids on freshwater fish. A protocol for assessing the cumulative effect on fish health and a simple deterministic model, based on the biology of gill harm and recovery, was proposed. References D. W. T. Au, C. A. Pollino, R. S. S Wu, P. K. S. Shin, S. T. F. Lau, and J. Y. M. Tang. Chronic effects of suspended solids on gill structure, osmoregulation, growth, and triiodothyronine in juvenile green grouper epinephelus coioides . Marine Ecology Press Series , 266:255--264, 2004. J.C. Bezdek, S.K. Chuah, and D. Leep. Generalized k-nearest neighbor rules. Fuzzy Sets and Systems , 18:237--26, 1986. E. T. Champagne, K. L. Bett-Garber, A. M. McClung, and C. Bergman. {Sensory characteristics of diverse rice cultivars as influenced by genetic and environmental factors}. Cereal Chem. , {81}:{237--243}, {2004}. S. G. Cheung and P. K. S. Shin. Size effects of suspended particles on gill damage in green-lipped mussel perna viridis. Marine Pollution Bulletin , 51(8--12):801--810, 2005. D. H. Evans. The fish gill: site of action and model for toxic effects of environmental pollutants. Environmental Health Perspectives , 71:44--58, 1987. G. C. Grigg. The failure of oxygen transport in a fish at low levels of ambient oxygen. Comp. Biochem. Physiol. , 29:1253--1257, 1969. G. Holmes, A. Donkin, and I.H. Witten. {Weka: A machine learning workbench}. In Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems , volume {24}, pages {357--361}, {Brisbane, Australia}, {1994}. {IEEE Computer Society}. D. D. Macdonald and C. P. Newcombe. Utility of the stress index for predicting suspended sediment effects: response to comments. North American Journal of Fisheries Management , 13:873--876, 1993. C. P. Newcombe. Suspended sediment in aquatic ecosystems: ill effects as a function of concentration and duration of exposure. Technical report, British Columbia Ministry of Environment, Lands and Parks, Habitat Protection branch, Victoria, 1994. C. P. Newcombe and J. O. T. Jensen. Channel suspended sediment and fisheries: A synthesis for quantitative assessment of risk and impact. North American Journal of Fisheries Management , 16(4):693--727, 1996. C. P. Newcombe and D. D. Macdonald. Effects of suspended sediments on aquatic ecosystems. North American Journal of Fisheries Management , 11(1):72--82, 1991. K. Schmidt-Nielsen. Scaling. Why is animal size so important? Cambridge University Press, NY, 1984. J. S. Schwartz, A. Simon, and L. Klimetz. Use of fish functional traits to associate in-stream suspended sediment transport metrics with biological impairment. Environmental Monitoring and Assessment , 179(1--4):347--369, 2011. E. Al Shaw and J. S. Richardson. Direct and indirect effects of sediment pulse duration on stream invertebrate assemb ages and rainbow trout ( Oncorhynchus mykiss ) growth and survival. Canadian Journal of Fish and Aquatic Science , 58:2213--2221, 2001. P. Tiwari and H. Hasegawa. {Demand for housing in Tokyo: A discrete choice analysis}. Regional Studies , {38}:{27--42}, {2004}. Y. Tramblay, A. Saint-Hilaire, T. B. M. J. Ouarda, F. Moatar, and B Hecht. Estimation of local extreme suspended sediment concentrations in california rivers. Science of the Total Environment , 408:4221--
Resumo:
Big Datasets are endemic, but they are often notoriously difficult to analyse because of their size, heterogeneity, history and quality. The purpose of this paper is to open a discourse on the use of modern experimental design methods to analyse Big Data in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has wide generality and advantageous inferential and computational properties. In particular, the principled experimental design approach is shown to provide a flexible framework for analysis that, for certain classes of objectives and utility functions, delivers near equivalent answers compared with analyses of the full dataset under a controlled error rate. It can also provide a formalised method for iterative parameter estimation, model checking, identification of data gaps and evaluation of data quality. Finally, it has the potential to add value to other Big Data sampling algorithms, in particular divide-and-conquer strategies, by determining efficient sub-samples.