975 resultados para probabilistic topic models
Resumo:
SUMMARYSpecies distribution models (SDMs) represent nowadays an essential tool in the research fields of ecology and conservation biology. By combining observations of species occurrence or abundance with information on the environmental characteristic of the observation sites, they can provide information on the ecology of species, predict their distributions across the landscape or extrapolate them to other spatial or time frames. The advent of SDMs, supported by geographic information systems (GIS), new developments in statistical models and constantly increasing computational capacities, has revolutionized the way ecologists can comprehend species distributions in their environment. SDMs have brought the tool that allows describing species realized niches across a multivariate environmental space and predict their spatial distribution. Predictions, in the form of probabilistic maps showing the potential distribution of the species, are an irreplaceable mean to inform every single unit of a territory about its biodiversity potential. SDMs and the corresponding spatial predictions can be used to plan conservation actions for particular species, to design field surveys, to assess the risks related to the spread of invasive species, to select reserve locations and design reserve networks, and ultimately, to forecast distributional changes according to scenarios of climate and/or land use change.By assessing the effect of several factors on model performance and on the accuracy of spatial predictions, this thesis aims at improving techniques and data available for distribution modelling and at providing the best possible information to conservation managers to support their decisions and action plans for the conservation of biodiversity in Switzerland and beyond. Several monitoring programs have been put in place from the national to the global scale, and different sources of data now exist and start to be available to researchers who want to model species distribution. However, because of the lack of means, data are often not gathered at an appropriate resolution, are sampled only over limited areas, are not spatially explicit or do not provide a sound biological information. A typical example of this is data on 'habitat' (sensu biota). Even though this is essential information for an effective conservation planning, it often has to be approximated from land use, the closest available information. Moreover, data are often not sampled according to an established sampling design, which can lead to biased samples and consequently to spurious modelling results. Understanding the sources of variability linked to the different phases of the modelling process and their importance is crucial in order to evaluate the final distribution maps that are to be used for conservation purposes.The research presented in this thesis was essentially conducted within the framework of the Landspot Project, a project supported by the Swiss National Science Foundation. The main goal of the project was to assess the possible contribution of pre-modelled 'habitat' units to model the distribution of animal species, in particular butterfly species, across Switzerland. While pursuing this goal, different aspects of data quality, sampling design and modelling process were addressed and improved, and implications for conservation discussed. The main 'habitat' units considered in this thesis are grassland and forest communities of natural and anthropogenic origin as defined in the typology of habitats for Switzerland. These communities are mainly defined at the phytosociological level of the alliance. For the time being, no comprehensive map of such communities is available at the national scale and at fine resolution. As a first step, it was therefore necessary to create distribution models and maps for these communities across Switzerland and thus to gather and collect the necessary data. In order to reach this first objective, several new developments were necessary such as the definition of expert models, the classification of the Swiss territory in environmental domains, the design of an environmentally stratified sampling of the target vegetation units across Switzerland, the development of a database integrating a decision-support system assisting in the classification of the relevés, and the downscaling of the land use/cover data from 100 m to 25 m resolution.The main contributions of this thesis to the discipline of species distribution modelling (SDM) are assembled in four main scientific papers. In the first, published in Journal of Riogeography different issues related to the modelling process itself are investigated. First is assessed the effect of five different stepwise selection methods on model performance, stability and parsimony, using data of the forest inventory of State of Vaud. In the same paper are also assessed: the effect of weighting absences to ensure a prevalence of 0.5 prior to model calibration; the effect of limiting absences beyond the environmental envelope defined by presences; four different methods for incorporating spatial autocorrelation; and finally, the effect of integrating predictor interactions. Results allowed to specifically enhance the GRASP tool (Generalized Regression Analysis and Spatial Predictions) that now incorporates new selection methods and the possibility of dealing with interactions among predictors as well as spatial autocorrelation. The contribution of different sources of remotely sensed information to species distribution models was also assessed. The second paper (to be submitted) explores the combined effects of sample size and data post-stratification on the accuracy of models using data on grassland distribution across Switzerland collected within the framework of the Landspot project and supplemented with other important vegetation databases. For the stratification of the data, different spatial frameworks were compared. In particular, environmental stratification by Swiss Environmental Domains was compared to geographical stratification either by biogeographic regions or political states (cantons). The third paper (to be submitted) assesses the contribution of pre- modelled vegetation communities to the modelling of fauna. It is a two-steps approach that combines the disciplines of community ecology and spatial ecology and integrates their corresponding concepts of habitat. First are modelled vegetation communities per se and then these 'habitat' units are used in order to model animal species habitat. A case study is presented with grassland communities and butterfly species. Different ways of integrating vegetation information in the models of butterfly distribution were also evaluated. Finally, a glimpse to climate change is given in the fourth paper, recently published in Ecological Modelling. This paper proposes a conceptual framework for analysing range shifts, namely a catalogue of the possible patterns of change in the distribution of a species along elevational or other environmental gradients and an improved quantitative methodology to identify and objectively describe these patterns. The methodology was developed using data from the Swiss national common breeding bird survey and the article presents results concerning the observed shifts in the elevational distribution of breeding birds in Switzerland.The overall objective of this thesis is to improve species distribution models as potential inputs for different conservation tools (e.g. red lists, ecological networks, risk assessment of the spread of invasive species, vulnerability assessment in the context of climate change). While no conservation issues or tools are directly tested in this thesis, the importance of the proposed improvements made in species distribution modelling is discussed in the context of the selection of reserve networks.RESUMELes modèles de distribution d'espèces (SDMs) représentent aujourd'hui un outil essentiel dans les domaines de recherche de l'écologie et de la biologie de la conservation. En combinant les observations de la présence des espèces ou de leur abondance avec des informations sur les caractéristiques environnementales des sites d'observation, ces modèles peuvent fournir des informations sur l'écologie des espèces, prédire leur distribution à travers le paysage ou l'extrapoler dans l'espace et le temps. Le déploiement des SDMs, soutenu par les systèmes d'information géographique (SIG), les nouveaux développements dans les modèles statistiques, ainsi que la constante augmentation des capacités de calcul, a révolutionné la façon dont les écologistes peuvent comprendre la distribution des espèces dans leur environnement. Les SDMs ont apporté l'outil qui permet de décrire la niche réalisée des espèces dans un espace environnemental multivarié et prédire leur distribution spatiale. Les prédictions, sous forme de carte probabilistes montrant la distribution potentielle de l'espèce, sont un moyen irremplaçable d'informer chaque unité du territoire de sa biodiversité potentielle. Les SDMs et les prédictions spatiales correspondantes peuvent être utilisés pour planifier des mesures de conservation pour des espèces particulières, pour concevoir des plans d'échantillonnage, pour évaluer les risques liés à la propagation d'espèces envahissantes, pour choisir l'emplacement de réserves et les mettre en réseau, et finalement, pour prévoir les changements de répartition en fonction de scénarios de changement climatique et/ou d'utilisation du sol. En évaluant l'effet de plusieurs facteurs sur la performance des modèles et sur la précision des prédictions spatiales, cette thèse vise à améliorer les techniques et les données disponibles pour la modélisation de la distribution des espèces et à fournir la meilleure information possible aux gestionnaires pour appuyer leurs décisions et leurs plans d'action pour la conservation de la biodiversité en Suisse et au-delà. Plusieurs programmes de surveillance ont été mis en place de l'échelle nationale à l'échelle globale, et différentes sources de données sont désormais disponibles pour les chercheurs qui veulent modéliser la distribution des espèces. Toutefois, en raison du manque de moyens, les données sont souvent collectées à une résolution inappropriée, sont échantillonnées sur des zones limitées, ne sont pas spatialement explicites ou ne fournissent pas une information écologique suffisante. Un exemple typique est fourni par les données sur 'l'habitat' (sensu biota). Même s'il s'agit d'une information essentielle pour des mesures de conservation efficaces, elle est souvent approximée par l'utilisation du sol, l'information qui s'en approche le plus. En outre, les données ne sont souvent pas échantillonnées selon un plan d'échantillonnage établi, ce qui biaise les échantillons et par conséquent les résultats de la modélisation. Comprendre les sources de variabilité liées aux différentes phases du processus de modélisation s'avère crucial afin d'évaluer l'utilisation des cartes de distribution prédites à des fins de conservation.La recherche présentée dans cette thèse a été essentiellement menée dans le cadre du projet Landspot, un projet soutenu par le Fond National Suisse pour la Recherche. L'objectif principal de ce projet était d'évaluer la contribution d'unités 'd'habitat' pré-modélisées pour modéliser la répartition des espèces animales, notamment de papillons, à travers la Suisse. Tout en poursuivant cet objectif, différents aspects touchant à la qualité des données, au plan d'échantillonnage et au processus de modélisation sont abordés et améliorés, et leurs implications pour la conservation des espèces discutées. Les principaux 'habitats' considérés dans cette thèse sont des communautés de prairie et de forêt d'origine naturelle et anthropique telles que définies dans la typologie des habitats de Suisse. Ces communautés sont principalement définies au niveau phytosociologique de l'alliance. Pour l'instant aucune carte de la distribution de ces communautés n'est disponible à l'échelle nationale et à résolution fine. Dans un premier temps, il a donc été nécessaire de créer des modèles de distribution de ces communautés à travers la Suisse et par conséquent de recueillir les données nécessaires. Afin d'atteindre ce premier objectif, plusieurs nouveaux développements ont été nécessaires, tels que la définition de modèles experts, la classification du territoire suisse en domaines environnementaux, la conception d'un échantillonnage environnementalement stratifié des unités de végétation cibles dans toute la Suisse, la création d'une base de données intégrant un système d'aide à la décision pour la classification des relevés, et le « downscaling » des données de couverture du sol de 100 m à 25 m de résolution. Les principales contributions de cette thèse à la discipline de la modélisation de la distribution d'espèces (SDM) sont rassemblées dans quatre articles scientifiques. Dans le premier article, publié dans le Journal of Biogeography, différentes questions liées au processus de modélisation sont étudiées en utilisant les données de l'inventaire forestier de l'Etat de Vaud. Tout d'abord sont évalués les effets de cinq méthodes de sélection pas-à-pas sur la performance, la stabilité et la parcimonie des modèles. Dans le même article sont également évalués: l'effet de la pondération des absences afin d'assurer une prévalence de 0.5 lors de la calibration du modèle; l'effet de limiter les absences au-delà de l'enveloppe définie par les présences; quatre méthodes différentes pour l'intégration de l'autocorrélation spatiale; et enfin, l'effet de l'intégration d'interactions entre facteurs. Les résultats présentés dans cet article ont permis d'améliorer l'outil GRASP qui intègre désonnais de nouvelles méthodes de sélection et la possibilité de traiter les interactions entre variables explicatives, ainsi que l'autocorrélation spatiale. La contribution de différentes sources de données issues de la télédétection a également été évaluée. Le deuxième article (en voie de soumission) explore les effets combinés de la taille de l'échantillon et de la post-stratification sur le la précision des modèles. Les données utilisées ici sont celles concernant la répartition des prairies de Suisse recueillies dans le cadre du projet Landspot et complétées par d'autres sources. Pour la stratification des données, différents cadres spatiaux ont été comparés. En particulier, la stratification environnementale par les domaines environnementaux de Suisse a été comparée à la stratification géographique par les régions biogéographiques ou par les cantons. Le troisième article (en voie de soumission) évalue la contribution de communautés végétales pré-modélisées à la modélisation de la faune. C'est une approche en deux étapes qui combine les disciplines de l'écologie des communautés et de l'écologie spatiale en intégrant leurs concepts de 'habitat' respectifs. Les communautés végétales sont modélisées d'abord, puis ces unités de 'habitat' sont utilisées pour modéliser les espèces animales. Une étude de cas est présentée avec des communautés prairiales et des espèces de papillons. Différentes façons d'intégrer l'information sur la végétation dans les modèles de répartition des papillons sont évaluées. Enfin, un clin d'oeil aux changements climatiques dans le dernier article, publié dans Ecological Modelling. Cet article propose un cadre conceptuel pour l'analyse des changements dans la distribution des espèces qui comprend notamment un catalogue des différentes formes possibles de changement le long d'un gradient d'élévation ou autre gradient environnemental, et une méthode quantitative améliorée pour identifier et décrire ces déplacements. Cette méthodologie a été développée en utilisant des données issues du monitoring des oiseaux nicheurs répandus et l'article présente les résultats concernant les déplacements observés dans la distribution altitudinale des oiseaux nicheurs en Suisse.L'objectif général de cette thèse est d'améliorer les modèles de distribution des espèces en tant que source d'information possible pour les différents outils de conservation (par exemple, listes rouges, réseaux écologiques, évaluation des risques de propagation d'espèces envahissantes, évaluation de la vulnérabilité des espèces dans le contexte de changement climatique). Bien que ces questions de conservation ne soient pas directement testées dans cette thèse, l'importance des améliorations proposées pour la modélisation de la distribution des espèces est discutée à la fin de ce travail dans le contexte de la sélection de réseaux de réserves.
Resumo:
Leishmaniasis causes significant morbidity and mortality, constituting an important global health problem for which there are few effective drugs. Given the urgent need to identify a safe and effective Leishmania vaccine to help prevent the two million new cases of human leishmaniasis worldwide each year, all reasonable efforts to achieve this goal should be made. This includes the use of animal models that are as close to leishmanial infection in humans as is practical and feasible. Old world monkey species (macaques, baboons, mandrills etc.) have the closest evolutionary relatedness to humans among the approachable animal models. The Asian rhesus macaques (Macaca mulatta) are quite susceptible to leishmanial infection, develop a human-like disease, exhibit antibodies to Leishmania and parasite-specific T-cell mediated immune responses both in vivo and in vitro, and can be protected effectively by vaccination. Results from macaque vaccine studies could also prove useful in guiding the design of human vaccine trials. This review summarizes our current knowledge on this topic and proposes potential approaches that may result in the more effective use of the macaque model to maximize its potential to help the development of an effective vaccine for human leishmaniasis.
Resumo:
Background The 'database search problem', that is, the strengthening of a case - in terms of probative value - against an individual who is found as a result of a database search, has been approached during the last two decades with substantial mathematical analyses, accompanied by lively debate and centrally opposing conclusions. This represents a challenging obstacle in teaching but also hinders a balanced and coherent discussion of the topic within the wider scientific and legal community. This paper revisits and tracks the associated mathematical analyses in terms of Bayesian networks. Their derivation and discussion for capturing probabilistic arguments that explain the database search problem are outlined in detail. The resulting Bayesian networks offer a distinct view on the main debated issues, along with further clarity. Methods As a general framework for representing and analyzing formal arguments in probabilistic reasoning about uncertain target propositions (that is, whether or not a given individual is the source of a crime stain), this paper relies on graphical probability models, in particular, Bayesian networks. This graphical probability modeling approach is used to capture, within a single model, a series of key variables, such as the number of individuals in a database, the size of the population of potential crime stain sources, and the rarity of the corresponding analytical characteristics in a relevant population. Results This paper demonstrates the feasibility of deriving Bayesian network structures for analyzing, representing, and tracking the database search problem. The output of the proposed models can be shown to agree with existing but exclusively formulaic approaches. Conclusions The proposed Bayesian networks allow one to capture and analyze the currently most well-supported but reputedly counter-intuitive and difficult solution to the database search problem in a way that goes beyond the traditional, purely formulaic expressions. The method's graphical environment, along with its computational and probabilistic architectures, represents a rich package that offers analysts and discussants with additional modes of interaction, concise representation, and coherent communication.
Resumo:
Chagas disease, a neglected illness, affects nearly 12-14 million people in endemic areas of Latin America. Although the occurrence of acute cases sharply has declined due to Southern Cone Initiative efforts to control vector transmission, there still remain serious challenges, including the maintenance of sustainable public policies for Chagas disease control and the urgent need for better drugs to treat chagasic patients. Since the introduction of benznidazole and nifurtimox approximately 40 years ago, many natural and synthetic compounds have been assayed against Trypanosoma cruzi, yet only a few compounds have advanced to clinical trials. This reflects, at least in part, the lack of consensus regarding appropriate in vitro and in vivo screening protocols as well as the lack of biomarkers for treating parasitaemia. The development of more effective drugs requires (i) the identification and validation of parasite targets, (ii) compounds to be screened against the targets or the whole parasite and (iii) a panel of minimum standardised procedures to advance leading compounds to clinical trials. This third aim was the topic of the workshop entitled Experimental Models in Drug Screening and Development for Chagas Disease, held in Rio de Janeiro, Brazil, on the 25th and 26th of November 2008 by the Fiocruz Program for Research and Technological Development on Chagas Disease and Drugs for Neglected Diseases Initiative. During the meeting, the minimum steps, requirements and decision gates for the determination of the efficacy of novel drugs for T. cruzi control were evaluated by interdisciplinary experts and an in vitro and in vivo flowchart was designed to serve as a general and standardised protocol for screening potential drugs for the treatment of Chagas disease.
Resumo:
In the forensic examination of DNA mixtures, the question of how to set the total number of contributors (N) presents a topic of ongoing interest. Part of the discussion gravitates around issues of bias, in particular when assessments of the number of contributors are not made prior to considering the genotypic configuration of potential donors. Further complication may stem from the observation that, in some cases, there may be numbers of contributors that are incompatible with the set of alleles seen in the profile of a mixed crime stain, given the genotype of a potential contributor. In such situations, procedures that take a single and fixed number contributors as their output can lead to inferential impasses. Assessing the number of contributors within a probabilistic framework can help avoiding such complication. Using elements of decision theory, this paper analyses two strategies for inference on the number of contributors. One procedure is deterministic and focuses on the minimum number of contributors required to 'explain' an observed set of alleles. The other procedure is probabilistic using Bayes' theorem and provides a probability distribution for a set of numbers of contributors, based on the set of observed alleles as well as their respective rates of occurrence. The discussion concentrates on mixed stains of varying quality (i.e., different numbers of loci for which genotyping information is available). A so-called qualitative interpretation is pursued since quantitative information such as peak area and height data are not taken into account. The competing procedures are compared using a standard scoring rule that penalizes the degree of divergence between a given agreed value for N, that is the number of contributors, and the actual value taken by N. Using only modest assumptions and a discussion with reference to a casework example, this paper reports on analyses using simulation techniques and graphical models (i.e., Bayesian networks) to point out that setting the number of contributors to a mixed crime stain in probabilistic terms is, for the conditions assumed in this study, preferable to a decision policy that uses categoric assumptions about N.
Resumo:
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.
Resumo:
Abstract Sitting between your past and your future doesn't mean you are in the present. Dakota Skye Complex systems science is an interdisciplinary field grouping under the same umbrella dynamical phenomena from social, natural or mathematical sciences. The emergence of a higher order organization or behavior, transcending that expected of the linear addition of the parts, is a key factor shared by all these systems. Most complex systems can be modeled as networks that represent the interactions amongst the system's components. In addition to the actual nature of the part's interactions, the intrinsic topological structure of underlying network is believed to play a crucial role in the remarkable emergent behaviors exhibited by the systems. Moreover, the topology is also a key a factor to explain the extraordinary flexibility and resilience to perturbations when applied to transmission and diffusion phenomena. In this work, we study the effect of different network structures on the performance and on the fault tolerance of systems in two different contexts. In the first part, we study cellular automata, which are a simple paradigm for distributed computation. Cellular automata are made of basic Boolean computational units, the cells; relying on simple rules and information from- the surrounding cells to perform a global task. The limited visibility of the cells can be modeled as a network, where interactions amongst cells are governed by an underlying structure, usually a regular one. In order to increase the performance of cellular automata, we chose to change its topology. We applied computational principles inspired by Darwinian evolution, called evolutionary algorithms, to alter the system's topological structure starting from either a regular or a random one. The outcome is remarkable, as the resulting topologies find themselves sharing properties of both regular and random network, and display similitudes Watts-Strogtz's small-world network found in social systems. Moreover, the performance and tolerance to probabilistic faults of our small-world like cellular automata surpasses that of regular ones. In the second part, we use the context of biological genetic regulatory networks and, in particular, Kauffman's random Boolean networks model. In some ways, this model is close to cellular automata, although is not expected to perform any task. Instead, it simulates the time-evolution of genetic regulation within living organisms under strict conditions. The original model, though very attractive by it's simplicity, suffered from important shortcomings unveiled by the recent advances in genetics and biology. We propose to use these new discoveries to improve the original model. Firstly, we have used artificial topologies believed to be closer to that of gene regulatory networks. We have also studied actual biological organisms, and used parts of their genetic regulatory networks in our models. Secondly, we have addressed the improbable full synchronicity of the event taking place on. Boolean networks and proposed a more biologically plausible cascading scheme. Finally, we tackled the actual Boolean functions of the model, i.e. the specifics of how genes activate according to the activity of upstream genes, and presented a new update function that takes into account the actual promoting and repressing effects of one gene on another. Our improved models demonstrate the expected, biologically sound, behavior of previous GRN model, yet with superior resistance to perturbations. We believe they are one step closer to the biological reality.
Resumo:
This paper analyses and discusses arguments that emerge from a recent discussion about the proper assessment of the evidential value of correspondences observed between the characteristics of a crime stain and those of a sample from a suspect when (i) this latter individual is found as a result of a database search and (ii) remaining database members are excluded as potential sources (because of different analytical characteristics). Using a graphical probability approach (i.e., Bayesian networks), the paper here intends to clarify that there is no need to (i) introduce a correction factor equal to the size of the searched database (i.e., to reduce a likelihood ratio), nor to (ii) adopt a propositional level not directly related to the suspect matching the crime stain (i.e., a proposition of the kind 'some person in (outside) the database is the source of the crime stain' rather than 'the suspect (some other person) is the source of the crime stain'). The present research thus confirms existing literature on the topic that has repeatedly demonstrated that the latter two requirements (i) and (ii) should not be a cause of concern.
Resumo:
Models are presented for the optimal location of hubs in airline networks, that take into consideration the congestion effects. Hubs, which are the most congested airports, are modeled as M/D/c queuing systems, that is, Poisson arrivals, deterministic service time, and {\em c} servers. A formula is derived for the probability of a number of customers in the system, which is later used to propose a probabilistic constraint. This constraint limits the probability of {\em b} airplanes in queue, to be lesser than a value $\alpha$. Due to the computational complexity of the formulation. The model is solved using a meta-heuristic based on tabu search. Computational experience is presented.
Resumo:
Uncertainty quantification of petroleum reservoir models is one of the present challenges, which is usually approached with a wide range of geostatistical tools linked with statistical optimisation or/and inference algorithms. The paper considers a data driven approach in modelling uncertainty in spatial predictions. Proposed semi-supervised Support Vector Regression (SVR) model has demonstrated its capability to represent realistic features and describe stochastic variability and non-uniqueness of spatial properties. It is able to capture and preserve key spatial dependencies such as connectivity, which is often difficult to achieve with two-point geostatistical models. Semi-supervised SVR is designed to integrate various kinds of conditioning data and learn dependences from them. A stochastic semi-supervised SVR model is integrated into a Bayesian framework to quantify uncertainty with multiple models fitted to dynamic observations. The developed approach is illustrated with a reservoir case study. The resulting probabilistic production forecasts are described by uncertainty envelopes.
Resumo:
L'utilisation efficace des systèmes géothermaux, la séquestration du CO2 pour limiter le changement climatique et la prévention de l'intrusion d'eau salée dans les aquifères costaux ne sont que quelques exemples qui démontrent notre besoin en technologies nouvelles pour suivre l'évolution des processus souterrains à partir de la surface. Un défi majeur est d'assurer la caractérisation et l'optimisation des performances de ces technologies à différentes échelles spatiales et temporelles. Les méthodes électromagnétiques (EM) d'ondes planes sont sensibles à la conductivité électrique du sous-sol et, par conséquent, à la conductivité électrique des fluides saturant la roche, à la présence de fractures connectées, à la température et aux matériaux géologiques. Ces méthodes sont régies par des équations valides sur de larges gammes de fréquences, permettant détudier de manières analogues des processus allant de quelques mètres sous la surface jusqu'à plusieurs kilomètres de profondeur. Néanmoins, ces méthodes sont soumises à une perte de résolution avec la profondeur à cause des propriétés diffusives du champ électromagnétique. Pour cette raison, l'estimation des modèles du sous-sol par ces méthodes doit prendre en compte des informations a priori afin de contraindre les modèles autant que possible et de permettre la quantification des incertitudes de ces modèles de façon appropriée. Dans la présente thèse, je développe des approches permettant la caractérisation statique et dynamique du sous-sol à l'aide d'ondes EM planes. Dans une première partie, je présente une approche déterministe permettant de réaliser des inversions répétées dans le temps (time-lapse) de données d'ondes EM planes en deux dimensions. Cette stratégie est basée sur l'incorporation dans l'algorithme d'informations a priori en fonction des changements du modèle de conductivité électrique attendus. Ceci est réalisé en intégrant une régularisation stochastique et des contraintes flexibles par rapport à la gamme des changements attendus en utilisant les multiplicateurs de Lagrange. J'utilise des normes différentes de la norme l2 pour contraindre la structure du modèle et obtenir des transitions abruptes entre les régions du model qui subissent des changements dans le temps et celles qui n'en subissent pas. Aussi, j'incorpore une stratégie afin d'éliminer les erreurs systématiques de données time-lapse. Ce travail a mis en évidence l'amélioration de la caractérisation des changements temporels par rapport aux approches classiques qui réalisent des inversions indépendantes à chaque pas de temps et comparent les modèles. Dans la seconde partie de cette thèse, j'adopte un formalisme bayésien et je teste la possibilité de quantifier les incertitudes sur les paramètres du modèle dans l'inversion d'ondes EM planes. Pour ce faire, je présente une stratégie d'inversion probabiliste basée sur des pixels à deux dimensions pour des inversions de données d'ondes EM planes et de tomographies de résistivité électrique (ERT) séparées et jointes. Je compare les incertitudes des paramètres du modèle en considérant différents types d'information a priori sur la structure du modèle et différentes fonctions de vraisemblance pour décrire les erreurs sur les données. Les résultats indiquent que la régularisation du modèle est nécessaire lorsqu'on a à faire à un large nombre de paramètres car cela permet d'accélérer la convergence des chaînes et d'obtenir des modèles plus réalistes. Cependent, ces contraintes mènent à des incertitudes d'estimations plus faibles, ce qui implique des distributions a posteriori qui ne contiennent pas le vrai modèledans les régions ou` la méthode présente une sensibilité limitée. Cette situation peut être améliorée en combinant des méthodes d'ondes EM planes avec d'autres méthodes complémentaires telles que l'ERT. De plus, je montre que le poids de régularisation des paramètres et l'écart-type des erreurs sur les données peuvent être retrouvés par une inversion probabiliste. Finalement, j'évalue la possibilité de caractériser une distribution tridimensionnelle d'un panache de traceur salin injecté dans le sous-sol en réalisant une inversion probabiliste time-lapse tridimensionnelle d'ondes EM planes. Etant donné que les inversions probabilistes sont très coûteuses en temps de calcul lorsque l'espace des paramètres présente une grande dimension, je propose une stratégie de réduction du modèle ou` les coefficients de décomposition des moments de Legendre du panache de traceur injecté ainsi que sa position sont estimés. Pour ce faire, un modèle de résistivité de base est nécessaire. Il peut être obtenu avant l'expérience time-lapse. Un test synthétique montre que la méthodologie marche bien quand le modèle de résistivité de base est caractérisé correctement. Cette méthodologie est aussi appliquée à un test de trac¸age par injection d'une solution saline et d'acides réalisé dans un système géothermal en Australie, puis comparée à une inversion time-lapse tridimensionnelle réalisée selon une approche déterministe. L'inversion probabiliste permet de mieux contraindre le panache du traceur salin gr^ace à la grande quantité d'informations a priori incluse dans l'algorithme. Néanmoins, les changements de conductivités nécessaires pour expliquer les changements observés dans les données sont plus grands que ce qu'expliquent notre connaissance actuelle des phénomenès physiques. Ce problème peut être lié à la qualité limitée du modèle de résistivité de base utilisé, indiquant ainsi que des efforts plus grands devront être fournis dans le futur pour obtenir des modèles de base de bonne qualité avant de réaliser des expériences dynamiques. Les études décrites dans cette thèse montrent que les méthodes d'ondes EM planes sont très utiles pour caractériser et suivre les variations temporelles du sous-sol sur de larges échelles. Les présentes approches améliorent l'évaluation des modèles obtenus, autant en termes d'incorporation d'informations a priori, qu'en termes de quantification d'incertitudes a posteriori. De plus, les stratégies développées peuvent être appliquées à d'autres méthodes géophysiques, et offrent une grande flexibilité pour l'incorporation d'informations additionnelles lorsqu'elles sont disponibles. -- The efficient use of geothermal systems, the sequestration of CO2 to mitigate climate change, and the prevention of seawater intrusion in coastal aquifers are only some examples that demonstrate the need for novel technologies to monitor subsurface processes from the surface. A main challenge is to assure optimal performance of such technologies at different temporal and spatial scales. Plane-wave electromagnetic (EM) methods are sensitive to subsurface electrical conductivity and consequently to fluid conductivity, fracture connectivity, temperature, and rock mineralogy. These methods have governing equations that are the same over a large range of frequencies, thus allowing to study in an analogous manner processes on scales ranging from few meters close to the surface down to several hundreds of kilometers depth. Unfortunately, they suffer from a significant resolution loss with depth due to the diffusive nature of the electromagnetic fields. Therefore, estimations of subsurface models that use these methods should incorporate a priori information to better constrain the models, and provide appropriate measures of model uncertainty. During my thesis, I have developed approaches to improve the static and dynamic characterization of the subsurface with plane-wave EM methods. In the first part of this thesis, I present a two-dimensional deterministic approach to perform time-lapse inversion of plane-wave EM data. The strategy is based on the incorporation of prior information into the inversion algorithm regarding the expected temporal changes in electrical conductivity. This is done by incorporating a flexible stochastic regularization and constraints regarding the expected ranges of the changes by using Lagrange multipliers. I use non-l2 norms to penalize the model update in order to obtain sharp transitions between regions that experience temporal changes and regions that do not. I also incorporate a time-lapse differencing strategy to remove systematic errors in the time-lapse inversion. This work presents improvements in the characterization of temporal changes with respect to the classical approach of performing separate inversions and computing differences between the models. In the second part of this thesis, I adopt a Bayesian framework and use Markov chain Monte Carlo (MCMC) simulations to quantify model parameter uncertainty in plane-wave EM inversion. For this purpose, I present a two-dimensional pixel-based probabilistic inversion strategy for separate and joint inversions of plane-wave EM and electrical resistivity tomography (ERT) data. I compare the uncertainties of the model parameters when considering different types of prior information on the model structure and different likelihood functions to describe the data errors. The results indicate that model regularization is necessary when dealing with a large number of model parameters because it helps to accelerate the convergence of the chains and leads to more realistic models. These constraints also lead to smaller uncertainty estimates, which imply posterior distributions that do not include the true underlying model in regions where the method has limited sensitivity. This situation can be improved by combining planewave EM methods with complimentary geophysical methods such as ERT. In addition, I show that an appropriate regularization weight and the standard deviation of the data errors can be retrieved by the MCMC inversion. Finally, I evaluate the possibility of characterizing the three-dimensional distribution of an injected water plume by performing three-dimensional time-lapse MCMC inversion of planewave EM data. Since MCMC inversion involves a significant computational burden in high parameter dimensions, I propose a model reduction strategy where the coefficients of a Legendre moment decomposition of the injected water plume and its location are estimated. For this purpose, a base resistivity model is needed which is obtained prior to the time-lapse experiment. A synthetic test shows that the methodology works well when the base resistivity model is correctly characterized. The methodology is also applied to an injection experiment performed in a geothermal system in Australia, and compared to a three-dimensional time-lapse inversion performed within a deterministic framework. The MCMC inversion better constrains the water plumes due to the larger amount of prior information that is included in the algorithm. The conductivity changes needed to explain the time-lapse data are much larger than what is physically possible based on present day understandings. This issue may be related to the base resistivity model used, therefore indicating that more efforts should be given to obtain high-quality base models prior to dynamic experiments. The studies described herein give clear evidence that plane-wave EM methods are useful to characterize and monitor the subsurface at a wide range of scales. The presented approaches contribute to an improved appraisal of the obtained models, both in terms of the incorporation of prior information in the algorithms and the posterior uncertainty quantification. In addition, the developed strategies can be applied to other geophysical methods, and offer great flexibility to incorporate additional information when available.
Resumo:
This paper analyses and discusses arguments that emerge from a recent discussion about the proper assessment of the evidential value of correspondences observed between the characteristics of a crime stain and those of a sample from a suspect when (i) this latter individual is found as a result of a database search and (ii) remaining database members are excluded as potential sources (because of different analytical characteristics). Using a graphical probability approach (i.e., Bayesian networks), the paper here intends to clarify that there is no need to (i) introduce a correction factor equal to the size of the searched database (i.e., to reduce a likelihood ratio), nor to (ii) adopt a propositional level not directly related to the suspect matching the crime stain (i.e., a proposition of the kind 'some person in (outside) the database is the source of the crime stain' rather than 'the suspect (some other person) is the source of the crime stain'). The present research thus confirms existing literature on the topic that has repeatedly demonstrated that the latter two requirements (i) and (ii) should not be a cause of concern.
Resumo:
Radioactive soil-contamination mapping and risk assessment is a vital issue for decision makers. Traditional approaches for mapping the spatial concentration of radionuclides employ various regression-based models, which usually provide a single-value prediction realization accompanied (in some cases) by estimation error. Such approaches do not provide the capability for rigorous uncertainty quantification or probabilistic mapping. Machine learning is a recent and fast-developing approach based on learning patterns and information from data. Artificial neural networks for prediction mapping have been especially powerful in combination with spatial statistics. A data-driven approach provides the opportunity to integrate additional relevant information about spatial phenomena into a prediction model for more accurate spatial estimates and associated uncertainty. Machine-learning algorithms can also be used for a wider spectrum of problems than before: classification, probability density estimation, and so forth. Stochastic simulations are used to model spatial variability and uncertainty. Unlike regression models, they provide multiple realizations of a particular spatial pattern that allow uncertainty and risk quantification. This paper reviews the most recent methods of spatial data analysis, prediction, and risk mapping, based on machine learning and stochastic simulations in comparison with more traditional regression models. The radioactive fallout from the Chernobyl Nuclear Power Plant accident is used to illustrate the application of the models for prediction and classification problems. This fallout is a unique case study that provides the challenging task of analyzing huge amounts of data ('hard' direct measurements, as well as supplementary information and expert estimates) and solving particular decision-oriented problems.
Resumo:
Unlike the evaluation of single items of scientific evidence, the formal study and analysis of the jointevaluation of several distinct items of forensic evidence has to date received some punctual, ratherthan systematic, attention. Questions about the (i) relationships among a set of (usually unobservable)propositions and a set of (observable) items of scientific evidence, (ii) the joint probative valueof a collection of distinct items of evidence as well as (iii) the contribution of each individual itemwithin a given group of pieces of evidence still represent fundamental areas of research. To somedegree, this is remarkable since both, forensic science theory and practice, yet many daily inferencetasks, require the consideration of multiple items if not masses of evidence. A recurrent and particularcomplication that arises in such settings is that the application of probability theory, i.e. the referencemethod for reasoning under uncertainty, becomes increasingly demanding. The present paper takesthis as a starting point and discusses graphical probability models, i.e. Bayesian networks, as frameworkwithin which the joint evaluation of scientific evidence can be approached in some viable way.Based on a review of existing main contributions in this area, the article here aims at presentinginstances of real case studies from the author's institution in order to point out the usefulness andcapacities of Bayesian networks for the probabilistic assessment of the probative value of multipleand interrelated items of evidence. A main emphasis is placed on underlying general patterns of inference,their representation as well as their graphical probabilistic analysis. Attention is also drawnto inferential interactions, such as redundancy, synergy and directional change. These distinguish thejoint evaluation of evidence from assessments of isolated items of evidence. Together, these topicspresent aspects of interest to both, domain experts and recipients of expert information, because theyhave bearing on how multiple items of evidence are meaningfully and appropriately set into context.
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.