Biblioteca Digital

90 resultados para Data anonymization and sanitization

em Université de Lausanne, Switzerland

Phylogeny and circumscription of Sapindaceae revisited: molecular sequence data, morphology and biogeography support recognition of a new family, Xanthoceraceae

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background and aims Recent studies have adopted a broad definition of Sapindaceae that includes taxa traditionally placed in Aceraceae and Hippocastanaceae, achieving monophyly but yielding a family difficult to characterize and for which no obvious morphological synapomorphy exists. This expanded circumscription was necessitated by the finding that the monotypic, temperate Asian genus Xanthoceras, historically placed in Sapindaceae tribe Harpullieae, is basal within the group. Here we seek to clarify the relationships of Xanthoceras based on phylogenetic analyses using a dataset encompassing nearly 3/4 of sapindaceous genera, comparing the results with information from morphology and biogeography, in particular with respect to the other taxa placed in Harpullieae. We then re-examine the appropriateness of maintaining the current broad, morphologically heterogeneous definition of Sapindaceae and explore the advantages of an alternative family circumscription. Methods Using 243 samples representing 104 of the 142 currently recognized genera of Sapindaceae s. lat. (including all in Harpullieae), sequence data were analyzed for nuclear (ITS) and plastid (matK, rpoB, trnD-trnT, trnK-matK, trnL-trnF and trnS-trnG) markers, adopting the methodology of a recent family-wide study, performing single-gene and total evidence analyses based on maximum likelihood (ML) and maximum parsimony (MP) criteria, and applying heuristic searches developed for large datasets, viz, a new strategy implemented in RAxML (for ML) and the parsimony ratchet (for MP). Bootstrap analyses were performed for each method to test for congruence between markers. Key results Our findings support earlier suggestions that Harpullieae are polyphyletic: Xanthoceras is confirmed as sister to all other sampled taxa of Sapindaceae s. lat.; the remaining members belong to three other clades within Sapindaceae s. lat., two of which correspond respectively to the groups traditionally treated as Aceraceae and Hippocastanaceae, together forming a clade sister to the largely tropical Sapindaceae s. str., which is monophyletic and morphologically coherent provided Xanthoceras is excluded. Conclusion To overcome the difficulties of a broadly circumscribed Sapindaceae, we resurrect the historically recognized temperate families Aceraceae and Hippocastanaceae, and describe a new family, Xanthoceraceae, thus adopting a monophyletic and easily characterized circumscription of Sapindaceae nearly identical to that used for over a century.

Truncated robust distance for clinical laboratory safety data monitoring and assessment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Laboratory safety data are routinely collected in clinical studies for safety monitoring and assessment. We have developed a truncated robust multivariate outlier detection method for identifying subjects with clinically relevant abnormal laboratory measurements. The proposed method can be applied to historical clinical data to establish a multivariate decision boundary that can then be used for future clinical trial laboratory safety data monitoring and assessment. Simulations demonstrate that the proposed method has the ability to detect relevant outliers while automatically excluding irrelevant outliers. Two examples from actual clinical studies are used to illustrate the use of this method for identifying clinically relevant outliers.

What is "clinical data"? Why and how can they be collected during field surveys on medicinal plants?

Relevância:

100.00% 100.00%

Publicador:

Resumo:

ETHNOPHARMACOLOGICAL RELEVANCE: "Reverse pharmacology", also called "bedside-to-bench" or "field to pharmacy" approach, is a research process starting with documentation of clinical outcome as observed by patients with different therapeutic regimens. The treatment most significantly associated with cure is selected for future studies: first, clinical safety and efficacy; then in vivo and vitro studies. Some clinical data, i.e. details on patient status and progress, can be collected during ethnobotanical surveys; they will help clinical researchers and, once effectiveness and safety are established, will also help users of traditional medicine make safer and more effective choices. To gather clinical data successfully, ethnopharmacologists need to be backed by an appropriate team of specialists in medicine and epidemiology. Ethnopharmacologists can also gather important data on traditional medicine safety. MATERIALS AND METHODS: The first step is to create a consensus on the meaning of "clinical data", their interest and importance. An understanding of why "a cure is not a proof of effectiveness" is a starting point to avoid faulty interpretation of the clinical observations. RESULTS: Experience showed that, with the "bedside-to-bench" approach, a treatment derived from traditional recipe can be scientifically validated (in terms of safety and effectiveness) with a cost of less than a million euros, thus providing an end-product that is affordable, available and sustainable. CONCLUSIONS: With rigorous clinical study results, medicinal plant users gain the possibility to refine heath strategies. The field surveyor may gain a better relationship with the population, once she/he is seen as bringing information useful for the quality of care in the community.

The Anarak, Jandaq and Posht-e-Badam metamorphic complexes in central Iran: New geological data, relationships and tectonic implications

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Anarak, Jandaq and Posht-e-Badam metamorphic complexes occupy the NW part of the Central-East Iranian Microcontinent and are juxtaposed with the Great Kavir block and Sanandaj-Sirjan zone. Our recent findings redefine the origin of these complexes, so far attributed to the Precambrian-Early Paleozoic orogenic episodes, and now directly related to the tectonic evolution of the Paleo-Tethys Ocean. This tectonic evolution was initiated by Late Ordovician-Early Devonian rifting events and terminated in the Triassic by the Eocimmerian collision event due to the docking of the Cimmerian blocks with the Asiatic Turan block. The ``Variscan accretionary complex'' is a new name we proposed for the most widely distributed metamorphic rocks connected to the Anarak and Jandaq complexes. This accretionary complex exposed from SW of Jandaq to the Anarak and Kabudan areas is a thick and fine grain siliciclastic sequence accompanied by marginal-sea ophiolitic remnants, including gabbro-basalts with a supra-subduction-geochemical signature. New Ar-40/Ar-39 ages are obtained as 333-320 Ma for the metamorphism of this sequence under greenschist to amphibolite facies. Moreover, the limy intercalations in the volcano-sedimentary part of this complex in Godar-e-Siah yielded Upper Devonian-Tournaisian conodonts. The northeastern part of this complex in the Jandaq area was intruded by 215 +/- 15 Ma arc to collisional granite and pegmatites dated by ID-TIMS and its metamorphic rocks are characterized by Some Ar-40/Ar-39 radiometric ages of 163-156 Ma. The ``Variscan'' accretionary complex was northwardly accreted to the Airekan granitic terrane dated at 549 +/- 15 Ma. Later, from the Late Carboniferous to Triassic, huge amounts of oceanic material were accreted to its southern side and penetrated by several seamounts such as the Anarak and Kabudan. This new period of accretion is supported by the 280-230 Ma Ar-40/Ar-39 ages for the Anarak mild high-pressure metamorphic rocks and a 262 Ma U-Pb age for the trondhjemite-rhyolite association of that area. The Triassic Bayazeh flysch filled the foreland basin during the final closure of the Paleo-Tethys Ocean and was partly deposited and/or thrusted onto the Cimmerian Yazd block. The Paleo-Tethys magmatic arc products have been well-preserved in the Late Devonian-Carboniferous Godar-e-Siah intra-arc deposits and the Triassic Nakhlak fore-arc succession. On the passive margin of the Cimmerian block, in the Yazd region, the nearly continuous Upper Paleozoic platform-type deposition was totally interrupted during the Middle to Late Triassic. Local erosion, down to Lower Paleozoic levels, may be related to flexural bulge erosion. The platform was finally unconformably covered by Liassic continental molassic deposits of the Shemshak. One of the extensional periods related to Neo-Tethyan back-arc rifting in Late Cretaceous time finally separated parts of the Eocimmerian collisional domain from the Eurasian Turan domain. The opening and closing of this new ocean, characterized by the Nain and Sabzevar ophiolitic melanges, finally transported the Anarak-Jandaq composite terrane to Central Iran, accompanied by large scale rotation of the Central-East Iranian Microcontinent (CEIM). Due to many similarities between the Posht-e-Badam metamorphic complex and the Anarak-Jandaq composite terrane, the former could be part of the latter, if it was transported further south during Tertiary time. (C) 2007 Elsevier B.V. All rights reserved.

How many diagnosis fields are needed to capture safety events in administrative data? Findings and recommendations from the WHO ICD-11 Topic Advisory Group on Quality and Safety

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE: As part of the WHO ICD-11 development initiative, the Topic Advisory Group on Quality and Safety explores meta-features of morbidity data sets, such as the optimal number of secondary diagnosis fields. DESIGN: The Health Care Quality Indicators Project of the Organization for Economic Co-Operation and Development collected Patient Safety Indicator (PSI) information from administrative hospital data of 19-20 countries in 2009 and 2011. We investigated whether three countries that expanded their data systems to include more secondary diagnosis fields showed increased PSI rates compared with six countries that did not. Furthermore, administrative hospital data from six of these countries and two American states, California (2011) and Florida (2010), were analysed for distributions of coded patient safety events across diagnosis fields. RESULTS: Among the participating countries, increasing the number of diagnosis fields was not associated with any overall increase in PSI rates. However, high proportions of PSI-related diagnoses appeared beyond the sixth secondary diagnosis field. The distribution of three PSI-related ICD codes was similar in California and Florida: 89-90% of central venous catheter infections and 97-99% of retained foreign bodies and accidental punctures or lacerations were captured within 15 secondary diagnosis fields. CONCLUSIONS: Six to nine secondary diagnosis fields are inadequate for comparing complication rates using hospital administrative data; at least 15 (and perhaps more with ICD-11) are recommended to fully characterize clinical outcomes. Increasing the number of fields should improve the international and intra-national comparability of data for epidemiologic and health services research, utilization analyses and quality of care assessment.

Design, data management, and population baseline characteristics of the PERFORM magnetic resonance imaging project.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Quantitative information from magnetic resonance imaging (MRI) may substantiate clinical findings and provide additional insight into the mechanism of clinical interventions in therapeutic stroke trials. The PERFORM study is exploring the efficacy of terutroban versus aspirin for secondary prevention in patients with a history of ischemic stroke. We report on the design of an exploratory longitudinal MRI follow-up study that was performed in a subgroup of the PERFORM trial. An international multi-centre longitudinal follow-up MRI study was designed for different MR systems employing safety and efficacy readouts: new T2 lesions, new DWI lesions, whole brain volume change, hippocampal volume change, changes in tissue microstructure as depicted by mean diffusivity and fractional anisotropy, vessel patency on MR angiography, and the presence of and development of new microbleeds. A total of 1,056 patients (men and women ≥ 55 years) were included. The data analysis included 3D reformation, image registration of different contrasts, tissue segmentation, and automated lesion detection. This large international multi-centre study demonstrates how new MRI readouts can be used to provide key information on the evolution of cerebral tissue lesions and within the macrovasculature after atherothrombotic stroke in a large sample of patients.

Environmental data monitoring and modelling using gis and remote sensing technologies. Case study: Aral sea region

Relevância:

100.00% 100.00%

Publicador:

ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Relevância:

100.00% 100.00%

Publicador:

Environmental data mining and modeling based on machine learning algorithms and geostatistics

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The paper presents some contemporary approaches to spatial environmental data analysis. The main topics are concentrated on the decision-oriented problems of environmental spatial data mining and modeling: valorization and representativity of data with the help of exploratory data analysis, spatial predictions, probabilistic and risk mapping, development and application of conditional stochastic simulation models. The innovative part of the paper presents integrated/hybrid model-machine learning (ML) residuals sequential simulations-MLRSS. The models are based on multilayer perceptron and support vector regression ML algorithms used for modeling long-range spatial trends and sequential simulations of the residuals. NIL algorithms deliver non-linear solution for the spatial non-stationary problems, which are difficult for geostatistical approach. Geostatistical tools (variography) are used to characterize performance of ML algorithms, by analyzing quality and quantity of the spatially structured information extracted from data with ML algorithms. Sequential simulations provide efficient assessment of uncertainty and spatial variability. Case study from the Chernobyl fallouts illustrates the performance of the proposed model. It is shown that probability mapping, provided by the combination of ML data driven and geostatistical model based approaches, can be efficiently used in decision-making process. (C) 2003 Elsevier Ltd. All rights reserved.

A modular approach for integrative analysis of large-scale gene-expression and drug-response data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

High-throughput technologies are now used to generate more than one type of data from the same biological samples. To properly integrate such data, we propose using co-modules, which describe coherent patterns across paired data sets, and conceive several modular methods for their identification. We first test these methods using in silico data, demonstrating that the integrative scheme of our Ping-Pong Algorithm uncovers drug-gene associations more accurately when considering noisy or complex data. Second, we provide an extensive comparative study using the gene-expression and drug-response data from the NCI-60 cell lines. Using information from the DrugBank and the Connectivity Map databases we show that the Ping-Pong Algorithm predicts drug-gene associations significantly better than other methods. Co-modules provide insights into possible mechanisms of action for a wide range of drugs and suggest new targets for therapy

Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With advances in the effectiveness of treatment and disease management, the contribution of chronic comorbid diseases (comorbidities) found within the Charlson comorbidity index to mortality is likely to have changed since development of the index in 1984. The authors reevaluated the Charlson index and reassigned weights to each condition by identifying and following patients to observe mortality within 1 year after hospital discharge. They applied the updated index and weights to hospital discharge data from 6 countries and tested for their ability to predict in-hospital mortality. Compared with the original Charlson weights, weights generated from the Calgary, Alberta, Canada, data (2004) were 0 for 5 comorbidities, decreased for 3 comorbidities, increased for 4 comorbidities, and did not change for 5 comorbidities. The C statistics for discriminating in-hospital mortality between the new score generated from the 12 comorbidities and the Charlson score were 0.825 (new) and 0.808 (old), respectively, in Australian data (2008), 0.828 and 0.825 in Canadian data (2008), 0.878 and 0.882 in French data (2004), 0.727 and 0.723 in Japanese data (2008), 0.831 and 0.836 in New Zealand data (2008), and 0.869 and 0.876 in Swiss data (2008). The updated index of 12 comorbidities showed good-to-excellent discrimination in predicting in-hospital mortality in data from 6 countries and may be more appropriate for use with more recent administrative data.

BIO-INSPIRED COMPUTATIONAL TECHNIQUES APPLIED TO THE CLUSTERING AND VISUALIZATION OF SPATIO-TEMPORAL GEOSPATIAL DATA

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The coverage and volume of geo-referenced datasets are extensive and incessantly¦growing. The systematic capture of geo-referenced information generates large volumes¦of spatio-temporal data to be analyzed. Clustering and visualization play a key¦role in the exploratory data analysis and the extraction of knowledge embedded in¦these data. However, new challenges in visualization and clustering are posed when¦dealing with the special characteristics of this data. For instance, its complex structures,¦large quantity of samples, variables involved in a temporal context, high dimensionality¦and large variability in cluster shapes.¦The central aim of my thesis is to propose new algorithms and methodologies for¦clustering and visualization, in order to assist the knowledge extraction from spatiotemporal¦geo-referenced data, thus improving making decision processes.¦I present two original algorithms, one for clustering: the Fuzzy Growing Hierarchical¦Self-Organizing Networks (FGHSON), and the second for exploratory visual data analysis:¦the Tree-structured Self-organizing Maps Component Planes. In addition, I present¦methodologies that combined with FGHSON and the Tree-structured SOM Component¦Planes allow the integration of space and time seamlessly and simultaneously in¦order to extract knowledge embedded in a temporal context.¦The originality of the FGHSON lies in its capability to reflect the underlying structure¦of a dataset in a hierarchical fuzzy way. A hierarchical fuzzy representation of¦clusters is crucial when data include complex structures with large variability of cluster¦shapes, variances, densities and number of clusters. The most important characteristics¦of the FGHSON include: (1) It does not require an a-priori setup of the number¦of clusters. (2) The algorithm executes several self-organizing processes in parallel.¦Hence, when dealing with large datasets the processes can be distributed reducing the¦computational cost. (3) Only three parameters are necessary to set up the algorithm.¦In the case of the Tree-structured SOM Component Planes, the novelty of this algorithm¦lies in its ability to create a structure that allows the visual exploratory data analysis¦of large high-dimensional datasets. This algorithm creates a hierarchical structure¦of Self-Organizing Map Component Planes, arranging similar variables' projections in¦the same branches of the tree. Hence, similarities on variables' behavior can be easily¦detected (e.g. local correlations, maximal and minimal values and outliers).¦Both FGHSON and the Tree-structured SOM Component Planes were applied in¦several agroecological problems proving to be very efficient in the exploratory analysis¦and clustering of spatio-temporal datasets.¦In this thesis I also tested three soft competitive learning algorithms. Two of them¦well-known non supervised soft competitive algorithms, namely the Self-Organizing¦Maps (SOMs) and the Growing Hierarchical Self-Organizing Maps (GHSOMs); and the¦third was our original contribution, the FGHSON. Although the algorithms presented¦here have been used in several areas, to my knowledge there is not any work applying¦and comparing the performance of those techniques when dealing with spatiotemporal¦geospatial data, as it is presented in this thesis.¦I propose original methodologies to explore spatio-temporal geo-referenced datasets¦through time. Our approach uses time windows to capture temporal similarities and¦variations by using the FGHSON clustering algorithm. The developed methodologies¦are used in two case studies. In the first, the objective was to find similar agroecozones¦through time and in the second one it was to find similar environmental patterns¦shifted in time.¦Several results presented in this thesis have led to new contributions to agroecological¦knowledge, for instance, in sugar cane, and blackberry production.¦Finally, in the framework of this thesis we developed several software tools: (1)¦a Matlab toolbox that implements the FGHSON algorithm, and (2) a program called¦BIS (Bio-inspired Identification of Similar agroecozones) an interactive graphical user¦interface tool which integrates the FGHSON algorithm with Google Earth in order to¦show zones with similar agroecological characteristics.

Genetic association tests: a method for the joint analysis of family and case-control data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the trend in molecular epidemiology towards both genome-wide association studies and complex modelling, the need for large sample sizes to detect small effects and to allow for the estimation of many parameters within a model continues to increase. Unfortunately, most methods of association analysis have been restricted to either a family-based or a case-control design, resulting in the lack of synthesis of data from multiple studies. Transmission disequilibrium-type methods for detecting linkage disequilibrium from family data were developed as an effective way of preventing the detection of association due to population stratification. Because these methods condition on parental genotype, however, they have precluded the joint analysis of family and case-control data, although methods for case-control data may not protect against population stratification and do not allow for familial correlations. We present here an extension of a family-based association analysis method for continuous traits that will simultaneously test for, and if necessary control for, population stratification. We further extend this method to analyse binary traits (and therefore family and case-control data together) and accurately to estimate genetic effects in the population, even when using an ascertained family sample. Finally, we present the power of this binary extension for both family-only and joint family and case-control data, and demonstrate the accuracy of the association parameter and variance components in an ascertained family sample.

Analysis and spatial modeling of the indoor radon Swiss data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The present research deals with an important public health threat, which is the pollution created by radon gas accumulation inside dwellings. The spatial modeling of indoor radon in Switzerland is particularly complex and challenging because of many influencing factors that should be taken into account. Indoor radon data analysis must be addressed from both a statistical and a spatial point of view. As a multivariate process, it was important at first to define the influence of each factor. In particular, it was important to define the influence of geology as being closely associated to indoor radon. This association was indeed observed for the Swiss data but not probed to be the sole determinant for the spatial modeling. The statistical analysis of data, both at univariate and multivariate level, was followed by an exploratory spatial analysis. Many tools proposed in the literature were tested and adapted, including fractality, declustering and moving windows methods. The use of Quan-tité Morisita Index (QMI) as a procedure to evaluate data clustering in function of the radon level was proposed. The existing methods of declustering were revised and applied in an attempt to approach the global histogram parameters. The exploratory phase comes along with the definition of multiple scales of interest for indoor radon mapping in Switzerland. The analysis was done with a top-to-down resolution approach, from regional to local lev¬els in order to find the appropriate scales for modeling. In this sense, data partition was optimized in order to cope with stationary conditions of geostatistical models. Common methods of spatial modeling such as Κ Nearest Neighbors (KNN), variography and General Regression Neural Networks (GRNN) were proposed as exploratory tools. In the following section, different spatial interpolation methods were applied for a par-ticular dataset. A bottom to top method complexity approach was adopted and the results were analyzed together in order to find common definitions of continuity and neighborhood parameters. Additionally, a data filter based on cross-validation was tested with the purpose of reducing noise at local scale (the CVMF). At the end of the chapter, a series of test for data consistency and methods robustness were performed. This lead to conclude about the importance of data splitting and the limitation of generalization methods for reproducing statistical distributions. The last section was dedicated to modeling methods with probabilistic interpretations. Data transformation and simulations thus allowed the use of multigaussian models and helped take the indoor radon pollution data uncertainty into consideration. The catego-rization transform was presented as a solution for extreme values modeling through clas-sification. Simulation scenarios were proposed, including an alternative proposal for the reproduction of the global histogram based on the sampling domain. The sequential Gaussian simulation (SGS) was presented as the method giving the most complete information, while classification performed in a more robust way. An error measure was defined in relation to the decision function for data classification hardening. Within the classification methods, probabilistic neural networks (PNN) show to be better adapted for modeling of high threshold categorization and for automation. Support vector machines (SVM) on the contrary performed well under balanced category conditions. In general, it was concluded that a particular prediction or estimation method is not better under all conditions of scale and neighborhood definitions. Simulations should be the basis, while other methods can provide complementary information to accomplish an efficient indoor radon decision making.

Machine learning for geospatial data : algorithms, software tools and case studies

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.

«
1
2
3
4
5
6
»