901 resultados para Techniques of data analysis
Resumo:
Examples of compositional data. The simplex, a suitable sample space for compositional data and Aitchison's geometry. R, a free language and environment for statistical computing and graphics
Resumo:
”compositions” is a new R-package for the analysis of compositional and positive data.It contains four classes corresponding to the four different types of compositional andpositive geometry (including the Aitchison geometry). It provides means for computation,plotting and high-level multivariate statistical analysis in all four geometries.These geometries are treated in an fully analogous way, based on the principle of workingin coordinates, and the object-oriented programming paradigm of R. In this way,called functions automatically select the most appropriate type of analysis as a functionof the geometry. The graphical capabilities include ternary diagrams and tetrahedrons,various compositional plots (boxplots, barplots, piecharts) and extensive graphical toolsfor principal components. Afterwards, ortion and proportion lines, straight lines andellipses in all geometries can be added to plots. The package is accompanied by ahands-on-introduction, documentation for every function, demos of the graphical capabilitiesand plenty of usage examples. It allows direct and parallel computation inall four vector spaces and provides the beginner with a copy-and-paste style of dataanalysis, while letting advanced users keep the functionality and customizability theydemand of R, as well as all necessary tools to add own analysis routines. A completeexample is included in the appendix
Resumo:
Isotopic data are currently becoming an important source of information regardingsources, evolution and mixing processes of water in hydrogeologic systems. However, itis not clear how to treat with statistics the geochemical data and the isotopic datatogether. We propose to introduce the isotopic information as new parts, and applycompositional data analysis with the resulting increased composition. Results areequivalent to downscale the classical isotopic delta variables, because they are alreadyrelative (as needed in the compositional framework) and isotopic variations are almostalways very small. This methodology is illustrated and tested with the study of theLlobregat River Basin (Barcelona, NE Spain), where it is shown that, though verysmall, isotopic variations comp lement geochemical principal components, and help inthe better identification of pollution sources
Resumo:
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison’s idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by asimplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able togenerate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow definingmonitors to be used to assess the extent of possible environmental contamination. Case study on running and ground waters from Chiavenna Valley (Northern Italy) by using Na+, K+, Ca2+, Mg2+, HCO3-, SO4 2- and Cl- concentrations will be illustrated
Resumo:
One of the disadvantages of old age is that there is more past than future: this,however, may be turned into an advantage if the wealth of experience and, hopefully,wisdom gained in the past can be reflected upon and throw some light on possiblefuture trends. To an extent, then, this talk is necessarily personal, certainly nostalgic,but also self critical and inquisitive about our understanding of the discipline ofstatistics. A number of almost philosophical themes will run through the talk: searchfor appropriate modelling in relation to the real problem envisaged, emphasis onsensible balances between simplicity and complexity, the relative roles of theory andpractice, the nature of communication of inferential ideas to the statistical layman, theinter-related roles of teaching, consultation and research. A list of keywords might be:identification of sample space and its mathematical structure, choices betweentransform and stay, the role of parametric modelling, the role of a sample spacemetric, the underused hypothesis lattice, the nature of compositional change,particularly in relation to the modelling of processes. While the main theme will berelevance to compositional data analysis we shall point to substantial implications forgeneral multivariate analysis arising from experience of the development ofcompositional data analysis…
Resumo:
We consider the application of normal theory methods to the estimation and testing of a general type of multivariate regressionmodels with errors--in--variables, in the case where various data setsare merged into a single analysis and the observable variables deviatepossibly from normality. The various samples to be merged can differ on the set of observable variables available. We show that there is a convenient way to parameterize the model so that, despite the possiblenon--normality of the data, normal--theory methods yield correct inferencesfor the parameters of interest and for the goodness--of--fit test. Thetheory described encompasses both the functional and structural modelcases, and can be implemented using standard software for structuralequations models, such as LISREL, EQS, LISCOMP, among others. An illustration with Monte Carlo data is presented.
Resumo:
The application of correspondence analysis to square asymmetrictables is often unsuccessful because of the strong role played by thediagonal entries of the matrix, obscuring the data off the diagonal. A simplemodification of the centering of the matrix, coupled with the correspondingchange in row and column masses and row and column metrics, allows the tableto be decomposed into symmetric and skew--symmetric components, which canthen be analyzed separately. The symmetric and skew--symmetric analyses canbe performed using a simple correspondence analysis program if the data areset up in a special block format.
Resumo:
The class of Schoenberg transformations, embedding Euclidean distances into higher dimensional Euclidean spaces, is presented, and derived from theorems on positive definite and conditionally negative definite matrices. Original results on the arc lengths, angles and curvature of the transformations are proposed, and visualized on artificial data sets by classical multidimensional scaling. A distance-based discriminant algorithm and a robust multidimensional centroid estimate illustrate the theory, closely connected to the Gaussian kernels of Machine Learning.
Resumo:
The graphical representation of spatial soil properties in a digital environment is complex because it requires a conversion of data collected in a discrete form onto a continuous surface. The objective of this study was to apply three-dimension techniques of interpolation and visualization on soil texture and fertility properties and establish relationships with pedogenetic factors and processes in a slope area. The GRASS Geographic Information System was used to generate three-dimensional models and ParaView software to visualize soil volumes. Samples of the A, AB, BA, and B horizons were collected in a regular 122-point grid in an area of 13 ha, in Pinhais, PR, in southern Brazil. Geoprocessing and graphic computing techniques were effective in identifying and delimiting soil volumes of distinct ranges of fertility properties confined within the soil matrix. Both three-dimensional interpolation and the visualization tool facilitated interpretation in a continuous space (volumes) of the cause-effect relationships between soil texture and fertility properties and pedological factors and processes, such as higher clay contents following the drainage lines of the area. The flattest part with more weathered soils (Oxisols) had the highest pH values and lower Al3+ concentrations. These techniques of data interpolation and visualization have great potential for use in diverse areas of soil science, such as identification of soil volumes occurring side-by-side but that exhibit different physical, chemical, and mineralogical conditions for plant root growth, and monitoring of plumes of organic and inorganic pollutants in soils and sediments, among other applications. The methodological details for interpolation and a three-dimensional view of soil data are presented here.
Resumo:
Aim Species distribution models (SDMs) based on current species ranges underestimate the potential distribution when projected in time and/or space. A multi-temporal model calibration approach has been suggested as an alternative, and we evaluate this using 13,000 years of data. Location Europe. Methods We used fossil-based records of presence for Picea abies, Abies alba and Fagus sylvatica and six climatic variables for the period 13,000 to 1000yr bp. To measure the contribution of each 1000-year time step to the total niche of each species (the niche measured by pooling all the data), we employed a principal components analysis (PCA) calibrated with data over the entire range of possible climates. Then we projected both the total niche and the partial niches from single time frames into the PCA space, and tested if the partial niches were more similar to the total niche than random. Using an ensemble forecasting approach, we calibrated SDMs for each time frame and for the pooled database. We projected each model to current climate and evaluated the results against current pollen data. We also projected all models into the future. Results Niche similarity between the partial and the total-SDMs was almost always statistically significant and increased through time. SDMs calibrated from single time frames gave different results when projected to current climate, providing evidence of a change in the species realized niches through time. Moreover, they predicted limited climate suitability when compared with the total-SDMs. The same results were obtained when projected to future climates. Main conclusions The realized climatic niche of species differed for current and future climates when SDMs were calibrated considering different past climates. Building the niche as an ensemble through time represents a way forward to a better understanding of a species' range and its ecology in a changing climate.
Resumo:
A prior project, HR-388, (which was entitled "Total Cost of Transportation analysis of road and highway issues"), explored the use of a total economic cost basis for evaluation of road based transportation issues. It was conducted as a proof-of-concept effort between 1996 and 2002, with the final report presented in May 2002. TR-477 rebuilt the analytical model using current data, then performed general, system level, county level, and road segment level analyses. The results are presented herein and will be distributed to all county engineers for information and local use.
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
Tietokonejärjestelmän osien ja ohjelmistojen suorituskykymittauksista saadaan tietoa,jota voidaan käyttää suorituskyvyn parantamiseen ja laitteistohankintojen päätöksen tukena. Tässä työssä tutustutaan suorituskyvyn mittaamiseen ja mittausohjelmiin eli ns. benchmark-ohjelmistoihin. Työssä etsittiin ja arvioitiin eri tyyppisiä vapaasti saatavilla olevia benchmark-ohjelmia, jotka soveltuvat Linux-laskentaklusterin suorituskyvynanalysointiin. Benchmarkit ryhmiteltiin ja arvioitiin testaamalla niiden ominaisuuksia Linux-klusterissa. Työssä käsitellään myös mittausten tekemisen ja rinnakkaislaskennan haasteita. Benchmarkkeja löytyi moneen tarkoitukseen ja ne osoittautuivat laadultaan ja laajuudeltaan vaihteleviksi. Niitä on myös koottu ohjelmistopaketeiksi, jotta laitteiston suorituskyvystä saisi laajemman kuvan kuin mitä yhdellä ohjelmalla on mahdollista saada. Olennaista on ymmärtää nopeus, jolla dataa saadaan siirretyä prosessorille keskusmuistista, levyjärjestelmistä ja toisista laskentasolmuista. Tyypillinen benchmark-ohjelma sisältää paljon laskentaa tarvitsevan matemaattisen algoritmin, jota käytetään tieteellisissä ohjelmistoissa. Benchmarkista riippuen tulosten ymmärtäminen ja hyödyntäminen voi olla haasteellista.
Resumo:
To enable a mathematically and physically sound execution of the fatigue test and a correct interpretation of its results, statistical evaluation methods are used to assist in the analysis of fatigue testing data. The main objective of this work is to develop step-by-stepinstructions for statistical analysis of the laboratory fatigue data. The scopeof this project is to provide practical cases about answering the several questions raised in the treatment of test data with application of the methods and formulae in the document IIW-XIII-2138-06 (Best Practice Guide on the Statistical Analysis of Fatigue Data). Generally, the questions in the data sheets involve some aspects: estimation of necessary sample size, verification of the statistical equivalence of the collated sets of data, and determination of characteristic curves in different cases. The series of comprehensive examples which are given in this thesis serve as a demonstration of the various statistical methods to develop a sound procedure to create reliable calculation rules for the fatigue analysis.