920 resultados para Multivariate data analysis
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
The present paper advocates for the creation of a federated, hybrid database in the cloud, integrating law data from all available public sources in one single open access system - adding, in the process, relevant meta-data to the indexed documents, including the identification of social and semantic entities and the relationships between them, using linked open data techniques and standards such as RDF. Examples of potential benefits and applications of this approach are also provided, including, among others, experiences from of our previous research, in which data integration, graph databases and social and semantic networks analysis were used to identify power relations, litigation dynamics and cross-references patterns both intra and inter-institutionally, covering most of the World international economic courts.
Resumo:
The paper presents some contemporary approaches to spatial environmental data analysis. The main topics are concentrated on the decision-oriented problems of environmental spatial data mining and modeling: valorization and representativity of data with the help of exploratory data analysis, spatial predictions, probabilistic and risk mapping, development and application of conditional stochastic simulation models. The innovative part of the paper presents integrated/hybrid model-machine learning (ML) residuals sequential simulations-MLRSS. The models are based on multilayer perceptron and support vector regression ML algorithms used for modeling long-range spatial trends and sequential simulations of the residuals. NIL algorithms deliver non-linear solution for the spatial non-stationary problems, which are difficult for geostatistical approach. Geostatistical tools (variography) are used to characterize performance of ML algorithms, by analyzing quality and quantity of the spatially structured information extracted from data with ML algorithms. Sequential simulations provide efficient assessment of uncertainty and spatial variability. Case study from the Chernobyl fallouts illustrates the performance of the proposed model. It is shown that probability mapping, provided by the combination of ML data driven and geostatistical model based approaches, can be efficiently used in decision-making process. (C) 2003 Elsevier Ltd. All rights reserved.
Resumo:
Linezolid is used off-label to treat multidrug-resistant tuberculosis (MDR-TB) in absence of systematic evidence. We performed a systematic review and meta-analysis on efficacy, safety and tolerability of linezolid-containing regimes based on individual data analysis. 12 studies (11 countries from three continents) reporting complete information on safety, tolerability, efficacy of linezolid-containing regimes in treating MDR-TB cases were identified based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Meta-analysis was performed using the individual data of 121 patients with a definite treatment outcome (cure, completion, death or failure). Most MDR-TB cases achieved sputum smear (86 (92.5%) out of 93) and culture (100 (93.5%) out of 107) conversion after treatment with individualised regimens containing linezolid (median (inter-quartile range) times for smear and culture conversions were 43.5 (21-90) and 61 (29-119) days, respectively) and 99 (81.8%) out of 121 patients were successfully treated. No significant differences were detected in the subgroup efficacy analysis (daily linezolid dosage ≤600 mg versus >600 mg). Adverse events were observed in 63 (58.9%) out of 107 patients, of which 54 (68.4%) out of 79 were major adverse events that included anaemia (38.1%), peripheral neuropathy (47.1%), gastro-intestinal disorders (16.7%), optic neuritis (13.2%) and thrombocytopenia (11.8%). The proportion of adverse events was significantly higher when the linezolid daily dosage exceeded 600 mg. The study results suggest an excellent efficacy but also the necessity of caution in the prescription of linezolid.
Resumo:
ABSTRACT Dual-trap optical tweezers are often used in high-resolution measurements in single-molecule biophysics. Such measurements can be hindered by the presence of extraneous noise sources, the most prominent of which is the coupling of fluctuations along different spatial directions, which may affect any optical tweezers setup. In this article, we analyze, both from the theoretical and the experimental points of view, the most common source for these couplings in dual-trap optical-tweezers setups: the misalignment of traps and tether. We give criteria to distinguish different kinds of misalignment, to estimate their quantitative relevance and to include them in the data analysis. The experimental data is obtained in a, to our knowledge, novel dual-trap optical-tweezers setup that directly measures forces. In the case in which misalignment is negligible, we provide a method to measure the stiffness of traps and tether based on variance analysis. This method can be seen as a calibration technique valid beyond the linear trap region. Our analysis is then employed to measure the persistence length of dsDNA tethers of three different lengths spanning two orders of magnitude. The effective persistence length of such tethers is shown to decrease with the contour length, in accordance with previous studies.
Resumo:
Sport betting is a lucrative business for bookmakers, for the lucky (or wise) punters, but also for governments and for sport. While not new or even recent, the deviances linked to sport betting, primarily match-fixing, have gained increased media exposure in the past decade. This exploratory study is a qualitative content analysis of the press coverage of sport betting-related deviances in football in two countries (UK and France), using in each case two leading national publications over a period of five years. Data analysis indicates a mounting coverage of sport betting scandals, with teams, players and criminals increasingly framed as culprits, while authorities and federations primarily assume a positive role. As for the origin of sport betting deviances, French newspapers tend to blame the system (in an abstract way); British newspapers, in contrast, focus more on individual weaknesses, notably greed. This article contributed to the growing body of literature on the importance of these deviances and on the way they are perceived by sport organizations, legislators and the public at large.
Resumo:
The agricultural sector has always been characterized by a predominance of small firms. International competition and the consequent need for restraining costs are permanent challenges for farms. This paper performs an empirical investigation of cost behavior in agriculture using panel data analysis. Our results show that transactions caused by complexity influence farm costs with opposite effects for specific and indirect costs. While transactions allow economies of scale in specific costs, they significantly increase indirect costs. However, the main driver for farm costs is volume. In addition, important differences exist for small and big farms, since transactional variables significantly influence the former but not the latter. While sophisticated management tools, such ABC, could provide only limited complementary useful information but no essential allocation bases for farms, they seem inappropriate for small farms
Resumo:
The agricultural sector has always been characterized by a predominance of small firms. International competition and the consequent need for restraining costs are permanent challenges for farms. This paper performs an empirical investigation of cost behavior in agriculture using panel data analysis. Our results show that transactions caused by complexity influence farm costs with opposite effects for specific and indirect costs. While transactions allow economies of scale in specific costs, they significantly increase indirect costs. However, the main driver for farm costs is volume. In addition, important differences exist for small and big farms, since transactional variables significantly influence the former but not the latter. While sophisticated management tools, such ABC, could provide only limited complementary useful information but no essential allocation bases for farms, they seem inappropriate for small farms
Resumo:
We present a participant study that compares biological data exploration tasks using volume renderings of laser confocal microscopy data across three environments that vary in level of immersion: a desktop, fishtank, and cave system. For the tasks, data, and visualization approach used in our study, we found that subjects qualitatively preferred and quantitatively performed better in the cave compared with the fishtank and desktop. Subjects performed real-world biological data analysis tasks that emphasized understanding spatial relationships including characterizing the general features in a volume, identifying colocated features, and reporting geometric relationships such as whether clusters of cells were coplanar. After analyzing data in each environment, subjects were asked to choose which environment they wanted to analyze additional data sets in - subjects uniformly selected the cave environment.
Resumo:
The synthesis and characterization of crosslinked chitosan microbeads and their application in the removal of Cr(VI) are described. New kinetic and thermodynamic parameters of Cr(VI) adsorptions processes were found using continuous isothermal calorimetry. All adsorption processes are exothermic in nature. However, a multivariate statistical analysis have pointed out that adsorption enthalpies were affected by important binary interactions of the initial Cr(VI) in solution and temperature. The adsorption energetic data were well fitted to a kinetic exponential model, which have indicated fractionary adsorption kinetic orders.
Resumo:
Raw measurement data does not always immediately convey useful information, but applying mathematical statistical analysis tools into measurement data can improve the situation. Data analysis can offer benefits like acquiring meaningful insight from the dataset, basing critical decisions on the findings, and ruling out human bias through proper statistical treatment. In this thesis we analyze data from an industrial mineral processing plant with the aim of studying the possibility of forecasting the quality of the final product, given by one variable, with a model based on the other variables. For the study mathematical tools like Qlucore Omics Explorer (QOE) and Sparse Bayesian regression (SB) are used. Later on, linear regression is used to build a model based on a subset of variables that seem to have most significant weights in the SB model. The results obtained from QOE show that the variable representing the desired final product does not correlate with other variables. For SB and linear regression, the results show that both SB and linear regression models built on 1-day averaged data seriously underestimate the variance of true data, whereas the two models built on 1-month averaged data are reliable and able to explain a larger proportion of variability in the available data, making them suitable for prediction purposes. However, it is concluded that no single model can fit well the whole available dataset and therefore, it is proposed for future work to make piecewise non linear regression models if the same available dataset is used, or the plant to provide another dataset that should be collected in a more systematic fashion than the present data for further analysis.
Resumo:
Honey produced by three stingless bee species (Melipona flavolineata, M. fasciculata and Apis mellifera) from different regions of the Amazon was analyzed by separating phenolic acids and flavonoids using the HPLC technique. Data were subjected to multivariate statistical analysis (PCA, HCA and DA). Results showed the three species of honey samples could be distinguished by phenolic composition. Antioxidant activity of the honeys was determined by studying the capacity of inhibiting radicals using DPPH assay. Honeys with higher phenolic compound contents had greater antioxidant capacity and darker color.
Resumo:
The objective of this work was to develop a free access exploratory data analysis software application for academic use that is easy to install and can be handled without user-level programming due to extensive use of chemometrics and its association with applications that require purchased licenses or routines. The developed software, called Chemostat, employs Hierarchical Cluster Analysis (HCA), Principal Component Analysis (PCA), intervals Principal Component Analysis (iPCA), as well as correction methods, data transformation and outlier detection. The data can be imported from the clipboard, text files, ASCII or FT-IR Perkin-Elmer “.sp” files. It generates a variety of charts and tables that allow the analysis of results that can be exported in several formats. The main features of the software were tested using midinfrared and near-infrared spectra in vegetable oils and digital images obtained from different types of commercial diesel. In order to validate the software results, the same sets of data were analyzed using Matlab© and the results in both applications matched in various combinations. In addition to the desktop version, the reuse of algorithms allowed an online version to be provided that offers a unique experience on the web. Both applications are available in English.