937 resultados para principal components analysis (PCA) algorithm
Resumo:
Analyzing functional data often leads to finding common factors, for which functional principal component analysis proves to be a useful tool to summarize and characterize the random variation in a function space. The representation in terms of eigenfunctions is optimal in the sense of L-2 approximation. However, the eigenfunctions are not always directed towards an interesting and interpretable direction in the context of functional data and thus could obscure the underlying structure. To overcome such difficulty, an alternative to functional principal component analysis is proposed that produces directed components which may be more informative and easier to interpret. These structural components are similar to principal components, but are adapted to situations in which the domain of the function may be decomposed into disjoint intervals such that there is effectively independence between intervals and positive correlation within intervals. The approach is demonstrated with synthetic examples as well as real data. Properties for special cases are also studied.
Resumo:
This paper aims to provide empirical support for the use of the principal-agent framework in the analysis of public sector and public policies. After reviewing the different conditions to be met for a relevant analysis of the relationship between population and government using the principal-agent theory, our paper focuses on the assumption of conflicting goals between the principal and the agent. A principal-agent analysis assumes in effect that inefficiencies may arise because principal and agent pursue different goals. Using data collected during an amalgamation project of two Swiss municipalities, we show the existence of a gap between the goals of the population and those of the government. Consequently, inefficiencies as predicted by the principal-agent model may arise during the implementation of a public policy, i.e. an amalgamation project. In a context of direct democracy where policies are regularly subjected to referendum, the conflict of objectives may even lead to a total failure of the policy at the polls.
Resumo:
Principal curves have been defined Hastie and Stuetzle (JASA, 1989) assmooth curves passing through the middle of a multidimensional dataset. They are nonlinear generalizations of the first principalcomponent, a characterization of which is the basis for the principalcurves definition.In this paper we propose an alternative approach based on a differentproperty of principal components. Consider a point in the space wherea multivariate normal is defined and, for each hyperplane containingthat point, compute the total variance of the normal distributionconditioned to belong to that hyperplane. Choose now the hyperplaneminimizing this conditional total variance and look for thecorresponding conditional mean. The first principal component of theoriginal distribution passes by this conditional mean and it isorthogonal to that hyperplane. This property is easily generalized todata sets with nonlinear structure. Repeating the search from differentstarting points, many points analogous to conditional means are found.We call them principal oriented points. When a one-dimensional curveruns the set of these special points it is called principal curve oforiented points. Successive principal curves are recursively definedfrom a generalization of the total variance.
Resumo:
We consider the joint visualization of two matrices which have common rowsand columns, for example multivariate data observed at two time pointsor split accord-ing to a dichotomous variable. Methods of interest includeprincipal components analysis for interval-scaled data, or correspondenceanalysis for frequency data or ratio-scaled variables on commensuratescales. A simple result in matrix algebra shows that by setting up thematrices in a particular block format, matrix sum and difference componentscan be visualized. The case when we have more than two matrices is alsodiscussed and the methodology is applied to data from the InternationalSocial Survey Program.
Resumo:
We construct a weighted Euclidean distance that approximates any distance or dissimilarity measure between individuals that is based on a rectangular cases-by-variables data matrix. In contrast to regular multidimensional scaling methods for dissimilarity data, the method leads to biplots of individuals and variables while preserving all the good properties of dimension-reduction methods that are based on the singular-value decomposition. The main benefits are the decomposition of variance into components along principal axes, which provide the numerical diagnostics known as contributions, and the estimation of nonnegative weights for each variable. The idea is inspired by the distance functions used in correspondence analysis and in principal component analysis of standardized data, where the normalizations inherent in the distances can be considered as differential weighting of the variables. In weighted Euclidean biplots we allow these weights to be unknown parameters, which are estimated from the data to maximize the fit to the chosen distances or dissimilarities. These weights are estimated using a majorization algorithm. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing the matrix and displaying its rows and columns in biplots.
Resumo:
The spatial variability of soil and plant properties exerts great influence on the yeld of agricultural crops. This study analyzed the spatial variability of the fertility of a Humic Rhodic Hapludox with Arabic coffee, using principal component analysis, cluster analysis and geostatistics in combination. The experiment was carried out in an area under Coffea arabica L., variety Catucai 20/15 - 479. The soil was sampled at a depth 0.20 m, at 50 points of a sampling grid. The following chemical properties were determined: P, K+, Ca2+, Mg2+, Na+, S, Al3+, pH, H + Al, SB, t, T, V, m, OM, Na saturation index (SSI), remaining phosphorus (P-rem), and micronutrients (Zn, Fe, Mn, Cu and B). The data were analyzed with descriptive statistics, followed by principal component and cluster analyses. Geostatistics were used to check and quantify the degree of spatial dependence of properties, represented by principal components. The principal component analysis allowed a dimensional reduction of the problem, providing interpretable components, with little information loss. Despite the characteristic information loss of principal component analysis, the combination of this technique with geostatistical analysis was efficient for the quantification and determination of the structure of spatial dependence of soil fertility. In general, the availability of soil mineral nutrients was low and the levels of acidity and exchangeable Al were high.
Resumo:
Abstract : This work is concerned with the development and application of novel unsupervised learning methods, having in mind two target applications: the analysis of forensic case data and the classification of remote sensing images. First, a method based on a symbolic optimization of the inter-sample distance measure is proposed to improve the flexibility of spectral clustering algorithms, and applied to the problem of forensic case data. This distance is optimized using a loss function related to the preservation of neighborhood structure between the input space and the space of principal components, and solutions are found using genetic programming. Results are compared to a variety of state-of--the-art clustering algorithms. Subsequently, a new large-scale clustering method based on a joint optimization of feature extraction and classification is proposed and applied to various databases, including two hyperspectral remote sensing images. The algorithm makes uses of a functional model (e.g., a neural network) for clustering which is trained by stochastic gradient descent. Results indicate that such a technique can easily scale to huge databases, can avoid the so-called out-of-sample problem, and can compete with or even outperform existing clustering algorithms on both artificial data and real remote sensing images. This is verified on small databases as well as very large problems. Résumé : Ce travail de recherche porte sur le développement et l'application de méthodes d'apprentissage dites non supervisées. Les applications visées par ces méthodes sont l'analyse de données forensiques et la classification d'images hyperspectrales en télédétection. Dans un premier temps, une méthodologie de classification non supervisée fondée sur l'optimisation symbolique d'une mesure de distance inter-échantillons est proposée. Cette mesure est obtenue en optimisant une fonction de coût reliée à la préservation de la structure de voisinage d'un point entre l'espace des variables initiales et l'espace des composantes principales. Cette méthode est appliquée à l'analyse de données forensiques et comparée à un éventail de méthodes déjà existantes. En second lieu, une méthode fondée sur une optimisation conjointe des tâches de sélection de variables et de classification est implémentée dans un réseau de neurones et appliquée à diverses bases de données, dont deux images hyperspectrales. Le réseau de neurones est entraîné à l'aide d'un algorithme de gradient stochastique, ce qui rend cette technique applicable à des images de très haute résolution. Les résultats de l'application de cette dernière montrent que l'utilisation d'une telle technique permet de classifier de très grandes bases de données sans difficulté et donne des résultats avantageusement comparables aux méthodes existantes.
Resumo:
This study describes the use of Principal Component Analysis to evaluate the chemical composition of water produced from eight oil wells in three different production areas. A total of 609 samples of produced water, and a reference sample of seawater, were characterized according to their levels of salinity, calcium, magnesium, strontium, barium and sulphate (mg L-1) contents, and analyzed by using PCA with autoscaled data. The method allowed the identification of variables salinity, calcium and strontium as tracers for formation water, and variables magnesium and sulphate as tracers for seawater.
Resumo:
The penetration resistance (PR) is a soil attribute that allows identifies areas with restrictions due to compaction, which results in mechanical impedance for root growth and reduced crop yield. The aim of this study was to characterize the PR of an agricultural soil by geostatistical and multivariate analysis. Sampling was done randomly in 90 points up to 0.60 m depth. It was determined spatial distribution models of PR, and defined areas with mechanical impedance for roots growth. The PR showed a random distribution to 0.55 and 0.60 m depth. PR in other depths analyzed showed spatial dependence, with adjustments to exponential and spherical models. The cluster analysis that considered sampling points allowed establishing areas with compaction problem identified in the maps by kriging interpolation. The analysis with main components identified three soil layers, where the middle layer showed the highest values of PR.
Resumo:
The aim of this study was to investigate the effect of pre-slaughter handling on the occurrence of PSE (Pale, Soft, and Exudative) meat in swine slaughtered at a commercial slaughterhouse located in the metropolitan region of Dourados, Mato Grosso do Sul, Brazil. Based on the database (n=1,832 carcasses), it was possible to apply the integrated multivariate analysis for the purpose of identifying, among the selected variables, those of greatest relevance to this study. Results of the Principal Component Analysis showed that the first five components explained 89.28% of total variance. In the Factor Analysis, the first factor represented the thermal stress and fatiguing conditions for swine during pre-slaughter handling. In general, this study indicated the importance of the pre-slaughter handling stages, evidencing those of greatest stress and threat to animal welfare and pork quality, which are transport time, resting period, lairage time before unloading, unloading time, and ambience.
Resumo:
ABSTRACT This study aimed to develop a methodology based on multivariate statistical analysis of principal components and cluster analysis, in order to identify the most representative variables in studies of minimum streamflow regionalization, and to optimize the identification of the hydrologically homogeneous regions for the Doce river basin. Ten variables were used, referring to the river basin climatic and morphometric characteristics. These variables were individualized for each of the 61 gauging stations. Three dependent variables that are indicative of minimum streamflow (Q7,10, Q90 and Q95). And seven independent variables that concern to climatic and morphometric characteristics of the basin (total annual rainfall – Pa; total semiannual rainfall of the dry and of the rainy season – Pss and Psc; watershed drainage area – Ad; length of the main river – Lp; total length of the rivers – Lt; and average watershed slope – SL). The results of the principal component analysis pointed out that the variable SL was the least representative for the study, and so it was discarded. The most representative independent variables were Ad and Psc. The best divisions of hydrologically homogeneous regions for the three studied flow characteristics were obtained using the Mahalanobis similarity matrix and the complete linkage clustering method. The cluster analysis enabled the identification of four hydrologically homogeneous regions in the Doce river basin.
Resumo:
The objective of this research was to use the technique of Exploratory Factor Analysis (EFA) for the adequacy of a tool for the assessment of fish consumption and the characteristics involved in this process. Data were collected during a campaign to encourage fish consumption in Brazil with the voluntarily participation of members of a university community. An assessment instrument consisting of multiple-choice questions and a five-point Likert scale was designed and used to measure the importance of certain attributes that influence the choice and consumption of fish. This study sample was composed of of 224 individuals, the majority were women (65.6%). With regard to the frequency of fish consumption, 37.67% of the volunteers interviewed said they consume the product two or three times a month, and 29.6% once a week. The Exploratory Factor Analysis (EFA) was used to group the variables; the extraction was made using the principal components and the rotation using the Quartimax method. The results show clusters in two main constructs, quality and consumption with Cronbach Alpha coefficients of 0.75 and 0.69, respectively, indicating good internal consistency.
Resumo:
We study the workings of the factor analysis of high-dimensional data using artificial series generated from a large, multi-sector dynamic stochastic general equilibrium (DSGE) model. The objective is to use the DSGE model as a laboratory that allow us to shed some light on the practical benefits and limitations of using factor analysis techniques on economic data. We explain in what sense the artificial data can be thought of having a factor structure, study the theoretical and finite sample properties of the principal components estimates of the factor space, investigate the substantive reason(s) for the good performance of di¤usion index forecasts, and assess the quality of the factor analysis of highly dissagregated data. In all our exercises, we explain the precise relationship between the factors and the basic macroeconomic shocks postulated by the model.
Resumo:
Dans un premier temps, nous avons modélisé la structure d’une famille d’ARN avec une grammaire de graphes afin d’identifier les séquences qui en font partie. Plusieurs autres méthodes de modélisation ont été développées, telles que des grammaires stochastiques hors-contexte, des modèles de covariance, des profils de structures secondaires et des réseaux de contraintes. Ces méthodes de modélisation se basent sur la structure secondaire classique comparativement à nos grammaires de graphes qui se basent sur les motifs cycliques de nucléotides. Pour exemplifier notre modèle, nous avons utilisé la boucle E du ribosome qui contient le motif Sarcin-Ricin qui a été largement étudié depuis sa découverte par cristallographie aux rayons X au début des années 90. Nous avons construit une grammaire de graphes pour la structure du motif Sarcin-Ricin et avons dérivé toutes les séquences qui peuvent s’y replier. La pertinence biologique de ces séquences a été confirmée par une comparaison des séquences d’un alignement de plus de 800 séquences ribosomiques bactériennes. Cette comparaison a soulevée des alignements alternatifs pour quelques unes des séquences que nous avons supportés par des prédictions de structures secondaires et tertiaires. Les motifs cycliques de nucléotides ont été observés par les membres de notre laboratoire dans l'ARN dont la structure tertiaire a été résolue expérimentalement. Une étude des séquences et des structures tertiaires de chaque cycle composant la structure du Sarcin-Ricin a révélé que l'espace des séquences dépend grandement des interactions entre tous les nucléotides à proximité dans l’espace tridimensionnel, c’est-à-dire pas uniquement entre deux paires de bases adjacentes. Le nombre de séquences générées par la grammaire de graphes est plus petit que ceux des méthodes basées sur la structure secondaire classique. Cela suggère l’importance du contexte pour la relation entre la séquence et la structure, d’où l’utilisation d’une grammaire de graphes contextuelle plus expressive que les grammaires hors-contexte. Les grammaires de graphes que nous avons développées ne tiennent compte que de la structure tertiaire et négligent les interactions de groupes chimiques spécifiques avec des éléments extra-moléculaires, comme d’autres macromolécules ou ligands. Dans un deuxième temps et pour tenir compte de ces interactions, nous avons développé un modèle qui tient compte de la position des groupes chimiques à la surface des structures tertiaires. L’hypothèse étant que les groupes chimiques à des positions conservées dans des séquences prédéterminées actives, qui sont déplacés dans des séquences inactives pour une fonction précise, ont de plus grandes chances d’être impliqués dans des interactions avec des facteurs. En poursuivant avec l’exemple de la boucle E, nous avons cherché les groupes de cette boucle qui pourraient être impliqués dans des interactions avec des facteurs d'élongation. Une fois les groupes identifiés, on peut prédire par modélisation tridimensionnelle les séquences qui positionnent correctement ces groupes dans leurs structures tertiaires. Il existe quelques modèles pour adresser ce problème, telles que des descripteurs de molécules, des matrices d’adjacences de nucléotides et ceux basé sur la thermodynamique. Cependant, tous ces modèles utilisent une représentation trop simplifiée de la structure d’ARN, ce qui limite leur applicabilité. Nous avons appliqué notre modèle sur les structures tertiaires d’un ensemble de variants d’une séquence d’une instance du Sarcin-Ricin d’un ribosome bactérien. L’équipe de Wool à l’université de Chicago a déjà étudié cette instance expérimentalement en testant la viabilité de 12 variants. Ils ont déterminé 4 variants viables et 8 létaux. Nous avons utilisé cet ensemble de 12 séquences pour l’entraînement de notre modèle et nous avons déterminé un ensemble de propriétés essentielles à leur fonction biologique. Pour chaque variant de l’ensemble d’entraînement nous avons construit des modèles de structures tertiaires. Nous avons ensuite mesuré les charges partielles des atomes exposés sur la surface et encodé cette information dans des vecteurs. Nous avons utilisé l’analyse des composantes principales pour transformer les vecteurs en un ensemble de variables non corrélées, qu’on appelle les composantes principales. En utilisant la distance Euclidienne pondérée et l’algorithme du plus proche voisin, nous avons appliqué la technique du « Leave-One-Out Cross-Validation » pour choisir les meilleurs paramètres pour prédire l’activité d’une nouvelle séquence en la faisant correspondre à ces composantes principales. Finalement, nous avons confirmé le pouvoir prédictif du modèle à l’aide d’un nouvel ensemble de 8 variants dont la viabilité à été vérifiée expérimentalement dans notre laboratoire. En conclusion, les grammaires de graphes permettent de modéliser la relation entre la séquence et la structure d’un élément structural d’ARN, comme la boucle E contenant le motif Sarcin-Ricin du ribosome. Les applications vont de la correction à l’aide à l'alignement de séquences jusqu’au design de séquences ayant une structure prédéterminée. Nous avons également développé un modèle pour tenir compte des interactions spécifiques liées à une fonction biologique donnée, soit avec des facteurs environnants. Notre modèle est basé sur la conservation de l'exposition des groupes chimiques qui sont impliqués dans ces interactions. Ce modèle nous a permis de prédire l’activité biologique d’un ensemble de variants de la boucle E du ribosome qui se lie à des facteurs d'élongation.
Resumo:
Learning Disability (LD) is a neurological condition that affects a child’s brain and impairs his ability to carry out one or many specific tasks. LD affects about 15 % of children enrolled in schools. The prediction of LD is a vital and intricate job. The aim of this paper is to design an effective and powerful tool, using the two intelligent methods viz., Artificial Neural Network and Adaptive Neuro-Fuzzy Inference System, for measuring the percentage of LD that affected in school-age children. In this study, we are proposing some soft computing methods in data preprocessing for improving the accuracy of the tool as well as the classifier. The data preprocessing is performed through Principal Component Analysis for attribute reduction and closest fit algorithm is used for imputing missing values. The main idea in developing the LD prediction tool is not only to predict the LD present in children but also to measure its percentage along with its class like low or minor or major. The system is implemented in Mathworks Software MatLab 7.10. The results obtained from this study have illustrated that the designed prediction system or tool is capable of measuring the LD effectively