52 resultados para Principal components
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
In human Population Genetics, routine applications of principal component techniques are oftenrequired. Population biologists make widespread use of certain discrete classifications of humansamples into haplotypes, the monophyletic units of phylogenetic trees constructed from severalsingle nucleotide bimorphisms hierarchically ordered. Compositional frequencies of the haplotypesare recorded within the different samples. Principal component techniques are then required as adimension-reducing strategy to bring the dimension of the problem to a manageable level, say two,to allow for graphical analysis.Population biologists at large are not aware of the special features of compositional data and normally make use of the crude covariance of compositional relative frequencies to construct principalcomponents. In this short note we present our experience with using traditional linear principalcomponents or compositional principal components based on logratios, with reference to a specificdataset
Resumo:
This paper addresses the application of a PCA analysis on categorical data prior to diagnose a patients data set using a Case-Based Reasoning (CBR) system. The particularity is that the standard PCA techniques are designed to deal with numerical attributes, but our medical data set contains many categorical data and alternative methods as RS-PCA are required. Thus, we propose to hybridize RS-PCA (Regular Simplex PCA) and a simple CBR. Results show how the hybrid system produces similar results when diagnosing a medical data set, that the ones obtained when using the original attributes. These results are quite promising since they allow to diagnose with less computation effort and memory storage
Resumo:
The present study discusses retention criteria for principal components analysis (PCA) applied to Likert scale items typical in psychological questionnaires. The main aim is to recommend applied researchers to restrain from relying only on the eigenvalue-than-one criterion; alternative procedures are suggested for adjusting for sampling error. An additional objective is to add evidence on the consequences of applying this rule when PCA is used with discrete variables. The experimental conditions were studied by means of Monte Carlo sampling including several sample sizes, different number of variables and answer alternatives, and four non-normal distributions. The results suggest that even when all the items and thus the underlying dimensions are independent, eigenvalues greater than one are frequent and they can explain up to 80% of the variance in data, meeting the empirical criterion. The consequences of using Kaiser"s rule are illustrated with a clinical psychology example. The size of the eigenvalues resulted to be a function of the sample size and the number of variables, which is also the case for parallel analysis as previous research shows. To enhance the application of alternative criteria, an R package was developed for deciding the number of principal components to retain by means of confidence intervals constructed about the eigenvalues corresponding to lack of relationship between discrete variables.
Resumo:
This paper addresses the application of a PCA analysis on categorical data prior to diagnose a patients data set using a Case-Based Reasoning (CBR) system. The particularity is that the standard PCA techniques are designed to deal with numerical attributes, but our medical data set contains many categorical data and alternative methods as RS-PCA are required. Thus, we propose to hybridize RS-PCA (Regular Simplex PCA) and a simple CBR. Results show how the hybrid system produces similar results when diagnosing a medical data set, that the ones obtained when using the original attributes. These results are quite promising since they allow to diagnose with less computation effort and memory storage
Resumo:
Principal curves have been defined Hastie and Stuetzle (JASA, 1989) assmooth curves passing through the middle of a multidimensional dataset. They are nonlinear generalizations of the first principalcomponent, a characterization of which is the basis for the principalcurves definition.In this paper we propose an alternative approach based on a differentproperty of principal components. Consider a point in the space wherea multivariate normal is defined and, for each hyperplane containingthat point, compute the total variance of the normal distributionconditioned to belong to that hyperplane. Choose now the hyperplaneminimizing this conditional total variance and look for thecorresponding conditional mean. The first principal component of theoriginal distribution passes by this conditional mean and it isorthogonal to that hyperplane. This property is easily generalized todata sets with nonlinear structure. Repeating the search from differentstarting points, many points analogous to conditional means are found.We call them principal oriented points. When a one-dimensional curveruns the set of these special points it is called principal curve oforiented points. Successive principal curves are recursively definedfrom a generalization of the total variance.
Resumo:
A new drift compensation method based on Common Principal Component Analysis (CPCA) is proposed. The drift variance in data is found as the principal components computed by CPCA. This method finds components that are common for all gasses in feature space. The method is compared in classification task with respect to the other approaches published where the drift direction is estimated through a Principal Component Analysis (PCA) of a reference gas. The proposed new method ¿ employing no specific reference gas, but information from all gases ¿has shown the same performance as the traditional approach with the best-fitted reference gas. Results are shown with data lasting 7-months including three gases at different concentrations for an array of 17 polymeric sensors.
Resumo:
A continuous random variable is expanded as a sum of a sequence of uncorrelated random variables. These variables are principal dimensions in continuous scaling on a distance function, as an extension of classic scaling on a distance matrix. For a particular distance, these dimensions are principal components. Then some properties are studied and an inequality is obtained. Diagonal expansions are considered from the same continuous scaling point of view, by means of the chi-square distance. The geometric dimension of a bivariate distribution is defined and illustrated with copulas. It is shown that the dimension can have the power of continuum.
Resumo:
In this paper we examine the out-of-sample forecast performance of high-yield credit spreads regarding real-time and revised data on employment and industrial production in the US. We evaluate models using both a point forecast and a probability forecast exercise. Our main findings suggest the use of few factors obtained by pooling information from a number of sector-specific high-yield credit spreads. This can be justified by observing that, especially for employment, there is a gain from using a principal components model fitted to high-yield credit spreads compared to the prediction produced by benchmarks, such as an AR, and ARDL models that use either the term spread or the aggregate high-yield spread as exogenous regressor. Moreover, forecasts based on real-time data are generally comparable to forecasts based on revised data. JEL Classification: C22; C53; E32 Keywords: Credit spreads; Principal components; Forecasting; Real-time data.
Resumo:
Projecte de recerca elaborat a partir d’una estada a la Universidad Politécnica de Madrid, Espanya, entre setembre i o desembre del 2007. Actualment la indústria aeroespacial i aeronàutica té com prioritat millorar la fiabilitat de las seves estructures a través del desenvolupament de nous sistemes per a la monitorització i detecció d’impactes. Hi ha diverses tècniques potencialment útils, i la seva aplicabilitat en una situació particular depèn críticament de la mida del defecte que permet l’estructura. Qualsevol defecte canviarà la resposta vibratòria de l’element estructural, així com el transitori de l’ona que es propaga per l’estructura elàstica. Correlacionar aquests canvis, que poden ser detectats experimentalment amb l’ocurrència del defecte, la seva localització i quantificació, és un problema molt complex. Aquest treball explora l’ús de l'Anàlisis de Components Principals (Principal Component Analysis - PCA-) basat en la formulació dels estadístics T2 i Q per tal de detectar i distingir els defectes a l'estructura, tot correlacionant els seus canvis a la resposta vibratòria. L’estructura utilitzada per l’estudi és l’ala d’una turbina d’un avió comercial. Aquesta ala s’excita en un extrem utilitzant un vibrador, i a la qual s'han adherit set sensors PZT a la superfície. S'aplica un senyal conegut i s'analitzen les respostes. Es construeix un model PCA utilitzant dades de l’estructura sense defecte. Per tal de provar el model, s'adhereix un tros d’alumini en quatre posicions diferents. Les dades dels assajos de l'estructura amb defecte es projecten sobre el model. Les components principals i les distàncies de Q-residual i T2-Hotelling s'utilitzaran per a l'anàlisi de les incidències. Q-residual indica com de bé s'adiu cadascuna de les mostres al model PCA, ja que és una mesura de la diferència, o residu, entre la mostra i la seva projecció sobre les components principals retingudes en el model. La distància T2-Hotelling és una mesura de la variació de cada mostra dins del model PCA, o el que vindria a ser el mateix, la distància al centre del model PCA.
Resumo:
Functional Data Analysis (FDA) deals with samples where a whole function is observedfor each individual. A particular case of FDA is when the observed functions are densityfunctions, that are also an example of infinite dimensional compositional data. In thiswork we compare several methods for dimensionality reduction for this particular typeof data: functional principal components analysis (PCA) with or without a previousdata transformation and multidimensional scaling (MDS) for diferent inter-densitiesdistances, one of them taking into account the compositional nature of density functions. The difeerent methods are applied to both artificial and real data (householdsincome distributions)
Resumo:
Compositional data naturally arises from the scientific analysis of the chemicalcomposition of archaeological material such as ceramic and glass artefacts. Data of thistype can be explored using a variety of techniques, from standard multivariate methodssuch as principal components analysis and cluster analysis, to methods based upon theuse of log-ratios. The general aim is to identify groups of chemically similar artefactsthat could potentially be used to answer questions of provenance.This paper will demonstrate work in progress on the development of a documentedlibrary of methods, implemented using the statistical package R, for the analysis ofcompositional data. R is an open source package that makes available very powerfulstatistical facilities at no cost. We aim to show how, with the aid of statistical softwaresuch as R, traditional exploratory multivariate analysis can easily be used alongside, orin combination with, specialist techniques of compositional data analysis.The library has been developed from a core of basic R functionality, together withpurpose-written routines arising from our own research (for example that reported atCoDaWork'03). In addition, we have included other appropriate publicly availabletechniques and libraries that have been implemented in R by other authors. Availablefunctions range from standard multivariate techniques through to various approaches tolog-ratio analysis and zero replacement. We also discuss and demonstrate a smallselection of relatively new techniques that have hitherto been little-used inarchaeometric applications involving compositional data. The application of the libraryto the analysis of data arising in archaeometry will be demonstrated; results fromdifferent analyses will be compared; and the utility of the various methods discussed
Resumo:
”compositions” is a new R-package for the analysis of compositional and positive data.It contains four classes corresponding to the four different types of compositional andpositive geometry (including the Aitchison geometry). It provides means for computation,plotting and high-level multivariate statistical analysis in all four geometries.These geometries are treated in an fully analogous way, based on the principle of workingin coordinates, and the object-oriented programming paradigm of R. In this way,called functions automatically select the most appropriate type of analysis as a functionof the geometry. The graphical capabilities include ternary diagrams and tetrahedrons,various compositional plots (boxplots, barplots, piecharts) and extensive graphical toolsfor principal components. Afterwards, ortion and proportion lines, straight lines andellipses in all geometries can be added to plots. The package is accompanied by ahands-on-introduction, documentation for every function, demos of the graphical capabilitiesand plenty of usage examples. It allows direct and parallel computation inall four vector spaces and provides the beginner with a copy-and-paste style of dataanalysis, while letting advanced users keep the functionality and customizability theydemand of R, as well as all necessary tools to add own analysis routines. A completeexample is included in the appendix
Resumo:
Isotopic data are currently becoming an important source of information regardingsources, evolution and mixing processes of water in hydrogeologic systems. However, itis not clear how to treat with statistics the geochemical data and the isotopic datatogether. We propose to introduce the isotopic information as new parts, and applycompositional data analysis with the resulting increased composition. Results areequivalent to downscale the classical isotopic delta variables, because they are alreadyrelative (as needed in the compositional framework) and isotopic variations are almostalways very small. This methodology is illustrated and tested with the study of theLlobregat River Basin (Barcelona, NE Spain), where it is shown that, though verysmall, isotopic variations comp lement geochemical principal components, and help inthe better identification of pollution sources
Resumo:
This paper establishes a general framework for metric scaling of any distance measure between individuals based on a rectangular individuals-by-variables data matrix. The method allows visualization of both individuals and variables as well as preserving all the good properties of principal axis methods such as principal components and correspondence analysis, based on the singular-value decomposition, including the decomposition of variance into components along principal axes which provide the numerical diagnostics known as contributions. The idea is inspired from the chi-square distance in correspondence analysis which weights each coordinate by an amount calculated from the margins of the data table. In weighted metric multidimensional scaling (WMDS) we allow these weights to be unknown parameters which are estimated from the data to maximize the fit to the original distances. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing a matrix and displaying its rows and columns in biplots.
Resumo:
Dual scaling of a subjects-by-objects table of dominance data (preferences,paired comparisons and successive categories data) has been contrasted with correspondence analysis, as if the two techniques were somehow different. In this note we show that dual scaling of dominance data is equivalent to the correspondence analysis of a table which is doubled with respect to subjects. We also show that the results of both methods can be recovered from a principal components analysis of the undoubled dominance table which is centred with respect to subject means.