119 resultados para Data Driven Modeling


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Résumé Suite aux recentes avancées technologiques, les archives d'images digitales ont connu une croissance qualitative et quantitative sans précédent. Malgré les énormes possibilités qu'elles offrent, ces avancées posent de nouvelles questions quant au traitement des masses de données saisies. Cette question est à la base de cette Thèse: les problèmes de traitement d'information digitale à très haute résolution spatiale et/ou spectrale y sont considérés en recourant à des approches d'apprentissage statistique, les méthodes à noyau. Cette Thèse étudie des problèmes de classification d'images, c'est à dire de catégorisation de pixels en un nombre réduit de classes refletant les propriétés spectrales et contextuelles des objets qu'elles représentent. L'accent est mis sur l'efficience des algorithmes, ainsi que sur leur simplicité, de manière à augmenter leur potentiel d'implementation pour les utilisateurs. De plus, le défi de cette Thèse est de rester proche des problèmes concrets des utilisateurs d'images satellite sans pour autant perdre de vue l'intéret des méthodes proposées pour le milieu du machine learning dont elles sont issues. En ce sens, ce travail joue la carte de la transdisciplinarité en maintenant un lien fort entre les deux sciences dans tous les développements proposés. Quatre modèles sont proposés: le premier répond au problème de la haute dimensionalité et de la redondance des données par un modèle optimisant les performances en classification en s'adaptant aux particularités de l'image. Ceci est rendu possible par un système de ranking des variables (les bandes) qui est optimisé en même temps que le modèle de base: ce faisant, seules les variables importantes pour résoudre le problème sont utilisées par le classifieur. Le manque d'information étiquétée et l'incertitude quant à sa pertinence pour le problème sont à la source des deux modèles suivants, basés respectivement sur l'apprentissage actif et les méthodes semi-supervisées: le premier permet d'améliorer la qualité d'un ensemble d'entraînement par interaction directe entre l'utilisateur et la machine, alors que le deuxième utilise les pixels non étiquetés pour améliorer la description des données disponibles et la robustesse du modèle. Enfin, le dernier modèle proposé considère la question plus théorique de la structure entre les outputs: l'intègration de cette source d'information, jusqu'à présent jamais considérée en télédétection, ouvre des nouveaux défis de recherche. Advanced kernel methods for remote sensing image classification Devis Tuia Institut de Géomatique et d'Analyse du Risque September 2009 Abstract The technical developments in recent years have brought the quantity and quality of digital information to an unprecedented level, as enormous archives of satellite images are available to the users. However, even if these advances open more and more possibilities in the use of digital imagery, they also rise several problems of storage and treatment. The latter is considered in this Thesis: the processing of very high spatial and spectral resolution images is treated with approaches based on data-driven algorithms relying on kernel methods. In particular, the problem of image classification, i.e. the categorization of the image's pixels into a reduced number of classes reflecting spectral and contextual properties, is studied through the different models presented. The accent is put on algorithmic efficiency and the simplicity of the approaches proposed, to avoid too complex models that would not be used by users. The major challenge of the Thesis is to remain close to concrete remote sensing problems, without losing the methodological interest from the machine learning viewpoint: in this sense, this work aims at building a bridge between the machine learning and remote sensing communities and all the models proposed have been developed keeping in mind the need for such a synergy. Four models are proposed: first, an adaptive model learning the relevant image features has been proposed to solve the problem of high dimensionality and collinearity of the image features. This model provides automatically an accurate classifier and a ranking of the relevance of the single features. The scarcity and unreliability of labeled. information were the common root of the second and third models proposed: when confronted to such problems, the user can either construct the labeled set iteratively by direct interaction with the machine or use the unlabeled data to increase robustness and quality of the description of data. Both solutions have been explored resulting into two methodological contributions, based respectively on active learning and semisupervised learning. Finally, the more theoretical issue of structured outputs has been considered in the last model, which, by integrating outputs similarity into a model, opens new challenges and opportunities for remote sensing image processing.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper presents multiple kernel learning (MKL) regression as an exploratory spatial data analysis and modelling tool. The MKL approach is introduced as an extension of support vector regression, where MKL uses dedicated kernels to divide a given task into sub-problems and to treat them separately in an effective way. It provides better interpretability to non-linear robust kernel regression at the cost of a more complex numerical optimization. In particular, we investigate the use of MKL as a tool that allows us to avoid using ad-hoc topographic indices as covariables in statistical models in complex terrains. Instead, MKL learns these relationships from the data in a non-parametric fashion. A study on data simulated from real terrain features confirms the ability of MKL to enhance the interpretability of data-driven models and to aid feature selection without degrading predictive performances. Here we examine the stability of the MKL algorithm with respect to the number of training data samples and to the presence of noise. The results of a real case study are also presented, where MKL is able to exploit a large set of terrain features computed at multiple spatial scales, when predicting mean wind speed in an Alpine region.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Mountain ecosystems will likely be affected by global warming during the 21st century, with substantial biodiversity loss predicted by species distribution models (SDMs). Depending on the geographic extent, elevation range and spatial resolution of data used in making these models, different rates of habitat loss have been predicted, with associated risk of species extinction. Few coordinated across-scale comparisons have been made using data of different resolution and geographic extent. Here, we assess whether climate-change induced habitat losses predicted at the European scale (10x10' grid cells) are also predicted from local scale data and modeling (25x25m grid cells) in two regions of the Swiss Alps. We show that local-scale models predict persistence of suitable habitats in up to 100% of species that were predicted by a European-scale model to lose all their suitable habitats in the area. Proportion of habitat loss depends on climate change scenario and study area. We find good agreement between the mismatch in predictions between scales and the fine-grain elevation range within 10x10' cells. The greatest prediction discrepancy for alpine species occurs in the area with the largest nival zone. Our results suggest elevation range as the main driver for the observed prediction discrepancies. Local scale projections may better reflect the possibility for species to track their climatic requirement toward higher elevations.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The hydrological and biogeochemical processes that operate in catchments influence the ecological quality of freshwater systems through delivery of fine sediment, nutrients and organic matter. Most models that seek to characterise the delivery of diffuse pollutants from land to water are reductionist. The multitude of processes that are parameterised in such models to ensure generic applicability make them complex and difficult to test on available data. Here, we outline an alternative - data-driven - inverse approach. We apply SCIMAP, a parsimonious risk based model that has an explicit treatment of hydrological connectivity. we take a Bayesian approach to the inverse problem of determining the risk that must be assigned to different land uses in a catchment in order to explain the spatial patterns of measured in-stream nutrient concentrations. We apply the model to identify the key sources of nitrogen (N) and phosphorus (P) diffuse pollution risk in eleven UK catchments covering a range of landscapes. The model results show that: 1) some land use generates a consistently high or low risk of diffuse nutrient pollution; but 2) the risks associated with different land uses vary both between catchments and between nutrients; and 3) that the dominant sources of P and N risk in the catchment are often a function of the spatial configuration of land uses. Taken on a case-by-case basis, this type of inverse approach may be used to help prioritise the focus of interventions to reduce diffuse pollution risk for freshwater ecosystems. (C) 2012 Elsevier B.V. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Interactions between stimuli's acoustic features and experience-based internal models of the environment enable listeners to compensate for the disruptions in auditory streams that are regularly encountered in noisy environments. However, whether auditory gaps are filled in predictively or restored a posteriori remains unclear. The current lack of positive statistical evidence that internal models can actually shape brain activity as would real sounds precludes accepting predictive accounts of filling-in phenomenon. We investigated the neurophysiological effects of internal models by testing whether single-trial electrophysiological responses to omitted sounds in a rule-based sequence of tones with varying pitch could be decoded from the responses to real sounds and by analyzing the ERPs to the omissions with data-driven electrical neuroimaging methods. The decoding of the brain responses to different expected, but omitted, tones in both passive and active listening conditions was above chance based on the responses to the real sound in active listening conditions. Topographic ERP analyses and electrical source estimations revealed that, in the absence of any stimulation, experience-based internal models elicit an electrophysiological activity different from noise and that the temporal dynamics of this activity depend on attention. We further found that the expected change in pitch direction of omitted tones modulated the activity of left posterior temporal areas 140-200 msec after the onset of omissions. Collectively, our results indicate that, even in the absence of any stimulation, internal models modulate brain activity as do real sounds, indicating that auditory filling in can be accounted for by predictive activity.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Despite their limited proliferation capacity, regulatory T cells (T(regs)) constitute a population maintained over the entire lifetime of a human organism. The means by which T(regs) sustain a stable pool in vivo are controversial. Using a mathematical model, we address this issue by evaluating several biological scenarios of the origins and the proliferation capacity of two subsets of T(regs): precursor CD4(+)CD25(+)CD45RO(-) and mature CD4(+)CD25(+)CD45RO(+) cells. The lifelong dynamics of T(regs) are described by a set of ordinary differential equations, driven by a stochastic process representing the major immune reactions involving these cells. The model dynamics are validated using data from human donors of different ages. Analysis of the data led to the identification of two properties of the dynamics: (1) the equilibrium in the CD4(+)CD25(+)FoxP3(+)T(regs) population is maintained over both precursor and mature T(regs) pools together, and (2) the ratio between precursor and mature T(regs) is inverted in the early years of adulthood. Then, using the model, we identified three biologically relevant scenarios that have the above properties: (1) the unique source of mature T(regs) is the antigen-driven differentiation of precursors that acquire the mature profile in the periphery and the proliferation of T(regs) is essential for the development and the maintenance of the pool; there exist other sources of mature T(regs), such as (2) a homeostatic density-dependent regulation or (3) thymus- or effector-derived T(regs), and in both cases, antigen-induced proliferation is not necessary for the development of a stable pool of T(regs). This is the first time that a mathematical model built to describe the in vivo dynamics of regulatory T cells is validated using human data. The application of this model provides an invaluable tool in estimating the amount of regulatory T cells as a function of time in the blood of patients that received a solid organ transplant or are suffering from an autoimmune disease.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Odds ratios for head and neck cancer increase with greater cigarette and alcohol use and lower body mass index (BMI; weight (kg)/height(2) (m(2))). Using data from the International Head and Neck Cancer Epidemiology Consortium, the authors conducted a formal analysis of BMI as a modifier of smoking- and alcohol-related effects. Analysis of never and current smokers included 6,333 cases, while analysis of never drinkers and consumers of < or =10 drinks/day included 8,452 cases. There were 8,000 or more controls, depending on the analysis. Odds ratios for all sites increased with lower BMI, greater smoking, and greater drinking. In polytomous regression, odds ratios for BMI (P = 0.65), smoking (P = 0.52), and drinking (P = 0.73) were homogeneous for oral cavity and pharyngeal cancers. Odds ratios for BMI and drinking were greater for oral cavity/pharyngeal cancer (P < 0.01), while smoking odds ratios were greater for laryngeal cancer (P < 0.01). Lower BMI enhanced smoking- and drinking-related odds ratios for oral cavity/pharyngeal cancer (P < 0.01), while BMI did not modify smoking and drinking odds ratios for laryngeal cancer. The increased odds ratios for all sites with low BMI may suggest related carcinogenic mechanisms; however, BMI modification of smoking and drinking odds ratios for cancer of the oral cavity/pharynx but not larynx cancer suggests additional factors specific to oral cavity/pharynx cancer.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Summary: Global warming has led to an average earth surface temperature increase of about 0.7 °C in the 20th century, according to the 2007 IPCC report. In Switzerland, the temperature increase in the same period was even higher: 1.3 °C in the Northern Alps anal 1.7 °C in the Southern Alps. The impacts of this warming on ecosystems aspecially on climatically sensitive systems like the treeline ecotone -are already visible today. Alpine treeline species show increased growth rates, more establishment of young trees in forest gaps is observed in many locations and treelines are migrating upwards. With the forecasted warming, this globally visible phenomenon is expected to continue. This PhD thesis aimed to develop a set of methods and models to investigate current and future climatic treeline positions and treeline shifts in the Swiss Alps in a spatial context. The focus was therefore on: 1) the quantification of current treeline dynamics and its potential causes, 2) the evaluation and improvement of temperaturebased treeline indicators and 3) the spatial analysis and projection of past, current and future climatic treeline positions and their respective elevational shifts. The methods used involved a combination of field temperature measurements, statistical modeling and spatial modeling in a geographical information system. To determine treeline shifts and assign the respective drivers, neighborhood relationships between forest patches were analyzed using moving window algorithms. Time series regression modeling was used in the development of an air-to-soil temperature transfer model to calculate thermal treeline indicators. The indicators were then applied spatially to delineate the climatic treeline, based on interpolated temperature data. Observation of recent forest dynamics in the Swiss treeline ecotone showed that changes were mainly due to forest in-growth, but also partly to upward attitudinal shifts. The recent reduction in agricultural land-use was found to be the dominant driver of these changes. Climate-driven changes were identified only at the uppermost limits of the treeline ecotone. Seasonal mean temperature indicators were found to be the best for predicting climatic treelines. Applying dynamic seasonal delimitations and the air-to-soil temperature transfer model improved the indicators' applicability for spatial modeling. Reproducing the climatic treelines of the past 45 years revealed regionally different attitudinal shifts, the largest being located near the highest mountain mass. Modeling climatic treelines based on two IPCC climate warming scenarios predicted major shifts in treeline altitude. However, the currently-observed treeline is not expected to reach this limit easily, due to lagged reaction, possible climate feedback effects and other limiting factors. Résumé: Selon le rapport 2007 de l'IPCC, le réchauffement global a induit une augmentation de la température terrestre de 0.7 °C en moyenne au cours du 20e siècle. En Suisse, l'augmentation durant la même période a été plus importante: 1.3 °C dans les Alpes du nord et 1.7 °C dans les Alpes du sud. Les impacts de ce réchauffement sur les écosystèmes - en particuliers les systèmes sensibles comme l'écotone de la limite des arbres - sont déjà visibles aujourd'hui. Les espèces de la limite alpine des forêts ont des taux de croissance plus forts, on observe en de nombreux endroits un accroissement du nombre de jeunes arbres s'établissant dans les trouées et la limite des arbres migre vers le haut. Compte tenu du réchauffement prévu, on s'attend à ce que ce phénomène, visible globalement, persiste. Cette thèse de doctorat visait à développer un jeu de méthodes et de modèles pour étudier dans un contexte spatial la position présente et future de la limite climatique des arbres, ainsi que ses déplacements, au sein des Alpes suisses. L'étude s'est donc focalisée sur: 1) la quantification de la dynamique actuelle de la limite des arbres et ses causes potentielles, 2) l'évaluation et l'amélioration des indicateurs, basés sur la température, pour la limite des arbres et 3) l'analyse spatiale et la projection de la position climatique passée, présente et future de la limite des arbres et des déplacements altitudinaux de cette position. Les méthodes utilisées sont une combinaison de mesures de température sur le terrain, de modélisation statistique et de la modélisation spatiale à l'aide d'un système d'information géographique. Les relations de voisinage entre parcelles de forêt ont été analysées à l'aide d'algorithmes utilisant des fenêtres mobiles, afin de mesurer les déplacements de la limite des arbres et déterminer leurs causes. Un modèle de transfert de température air-sol, basé sur les modèles de régression sur séries temporelles, a été développé pour calculer des indicateurs thermiques de la limite des arbres. Les indicateurs ont ensuite été appliqués spatialement pour délimiter la limite climatique des arbres, sur la base de données de températures interpolées. L'observation de la dynamique forestière récente dans l'écotone de la limite des arbres en Suisse a montré que les changements étaient principalement dus à la fermeture des trouées, mais aussi en partie à des déplacements vers des altitudes plus élevées. Il a été montré que la récente déprise agricole était la cause principale de ces changements. Des changements dus au climat n'ont été identifiés qu'aux limites supérieures de l'écotone de la limite des arbres. Les indicateurs de température moyenne saisonnière se sont avérés le mieux convenir pour prédire la limite climatique des arbres. L'application de limites dynamiques saisonnières et du modèle de transfert de température air-sol a amélioré l'applicabilité des indicateurs pour la modélisation spatiale. La reproduction des limites climatiques des arbres durant ces 45 dernières années a mis en évidence des changements d'altitude différents selon les régions, les plus importants étant situés près du plus haut massif montagneux. La modélisation des limites climatiques des arbres d'après deux scénarios de réchauffement climatique de l'IPCC a prédit des changements majeurs de l'altitude de la limite des arbres. Toutefois, l'on ne s'attend pas à ce que la limite des arbres actuellement observée atteigne cette limite facilement, en raison du délai de réaction, d'effets rétroactifs du climat et d'autres facteurs limitants.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Natural selection is typically exerted at some specific life stages. If natural selection takes place before a trait can be measured, using conventional models can cause wrong inference about population parameters. When the missing data process relates to the trait of interest, a valid inference requires explicit modeling of the missing process. We propose a joint modeling approach, a shared parameter model, to account for nonrandom missing data. It consists of an animal model for the phenotypic data and a logistic model for the missing process, linked by the additive genetic effects. A Bayesian approach is taken and inference is made using integrated nested Laplace approximations. From a simulation study we find that wrongly assuming that missing data are missing at random can result in severely biased estimates of additive genetic variance. Using real data from a wild population of Swiss barn owls Tyto alba, our model indicates that the missing individuals would display large black spots; and we conclude that genes affecting this trait are already under selection before it is expressed. Our model is a tool to correctly estimate the magnitude of both natural selection and additive genetic variance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Imaging mass spectrometry (IMS) represents an innovative tool in the cancer research pipeline, which is increasingly being used in clinical and pharmaceutical applications. The unique properties of the technique, especially the amount of data generated, make the handling of data from multiple IMS acquisitions challenging. This work presents a histology-driven IMS approach aiming to identify discriminant lipid signatures from the simultaneous mining of IMS data sets from multiple samples. The feasibility of the developed workflow is evaluated on a set of three human colorectal cancer liver metastasis (CRCLM) tissue sections. Lipid IMS on tissue sections was performed using MALDI-TOF/TOF MS in both negative and positive ionization modes after 1,5-diaminonaphthalene matrix deposition by sublimation. The combination of both positive and negative acquisition results was performed during data mining to simplify the process and interrogate a larger lipidome into a single analysis. To reduce the complexity of the IMS data sets, a sub data set was generated by randomly selecting a fixed number of spectra from a histologically defined region of interest, resulting in a 10-fold data reduction. Principal component analysis confirmed that the molecular selectivity of the regions of interest is maintained after data reduction. Partial least-squares and heat map analyses demonstrated a selective signature of the CRCLM, revealing lipids that are significantly up- and down-regulated in the tumor region. This comprehensive approach is thus of interest for defining disease signatures directly from IMS data sets by the use of combinatory data mining, opening novel routes of investigation for addressing the demands of the clinical setting.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Crystallographic data about T-Cell Receptor - peptide - major histocompatibility complex class I (TCRpMHC) interaction have revealed extremely diverse TCR binding modes triggering antigen recognition. Understanding the molecular basis that governs TCR orientation over pMHC is still a considerable challenge. We present a simplified rigid approach applied on all non-redundant TCRpMHC crystal structures available. The CHARMM force field in combination with the FACTS implicit solvation model is used to study the role of long-distance interactions between the TCR and pMHC. We demonstrate that the sum of the coulomb interactions and the electrostatic solvation energies is sufficient to identify two orientations corresponding to energetic minima at 0° and 180° from the native orientation. Interestingly, these results are shown to be robust upon small structural variations of the TCR such as changes induced by Molecular Dynamics simulations, suggesting that shape complementarity is not required to obtain a reliable signal. Accurate energy minima are also identified by confronting unbound TCR crystal structures to pMHC. Furthermore, we decompose the electrostatic energy into residue contributions to estimate their role in the overall orientation. Results show that most of the driving force leading to the formation of the complex is defined by CDR1,2/MHC interactions. This long-distance contribution appears to be independent from the binding process itself, since it is reliably identified without considering neither short-range energy terms nor CDR induced fit upon binding. Ultimately, we present an attempt to predict the TCR/pMHC binding mode for a TCR structure obtained by homology modeling. The simplicity of the approach and the absence of any fitted parameters make it also easily applicable to other types of macromolecular protein complexes.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The ability to determine the location and relative strength of all transcription-factor binding sites in a genome is important both for a comprehensive understanding of gene regulation and for effective promoter engineering in biotechnological applications. Here we present a bioinformatically driven experimental method to accurately define the DNA-binding sequence specificity of transcription factors. A generalized profile was used as a predictive quantitative model for binding sites, and its parameters were estimated from in vitro-selected ligands using standard hidden Markov model training algorithms. Computer simulations showed that several thousand low- to medium-affinity sequences are required to generate a profile of desired accuracy. To produce data on this scale, we applied high-throughput genomics methods to the biochemical problem addressed here. A method combining systematic evolution of ligands by exponential enrichment (SELEX) and serial analysis of gene expression (SAGE) protocols was coupled to an automated quality-controlled sequence extraction procedure based on Phred quality scores. This allowed the sequencing of a database of more than 10,000 potential DNA ligands for the CTF/NFI transcription factor. The resulting binding-site model defines the sequence specificity of this protein with a high degree of accuracy not achieved earlier and thereby makes it possible to identify previously unknown regulatory sequences in genomic DNA. A covariance analysis of the selected sites revealed non-independent base preferences at different nucleotide positions, providing insight into the binding mechanism.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The present research deals with an important public health threat, which is the pollution created by radon gas accumulation inside dwellings. The spatial modeling of indoor radon in Switzerland is particularly complex and challenging because of many influencing factors that should be taken into account. Indoor radon data analysis must be addressed from both a statistical and a spatial point of view. As a multivariate process, it was important at first to define the influence of each factor. In particular, it was important to define the influence of geology as being closely associated to indoor radon. This association was indeed observed for the Swiss data but not probed to be the sole determinant for the spatial modeling. The statistical analysis of data, both at univariate and multivariate level, was followed by an exploratory spatial analysis. Many tools proposed in the literature were tested and adapted, including fractality, declustering and moving windows methods. The use of Quan-tité Morisita Index (QMI) as a procedure to evaluate data clustering in function of the radon level was proposed. The existing methods of declustering were revised and applied in an attempt to approach the global histogram parameters. The exploratory phase comes along with the definition of multiple scales of interest for indoor radon mapping in Switzerland. The analysis was done with a top-to-down resolution approach, from regional to local lev¬els in order to find the appropriate scales for modeling. In this sense, data partition was optimized in order to cope with stationary conditions of geostatistical models. Common methods of spatial modeling such as Κ Nearest Neighbors (KNN), variography and General Regression Neural Networks (GRNN) were proposed as exploratory tools. In the following section, different spatial interpolation methods were applied for a par-ticular dataset. A bottom to top method complexity approach was adopted and the results were analyzed together in order to find common definitions of continuity and neighborhood parameters. Additionally, a data filter based on cross-validation was tested with the purpose of reducing noise at local scale (the CVMF). At the end of the chapter, a series of test for data consistency and methods robustness were performed. This lead to conclude about the importance of data splitting and the limitation of generalization methods for reproducing statistical distributions. The last section was dedicated to modeling methods with probabilistic interpretations. Data transformation and simulations thus allowed the use of multigaussian models and helped take the indoor radon pollution data uncertainty into consideration. The catego-rization transform was presented as a solution for extreme values modeling through clas-sification. Simulation scenarios were proposed, including an alternative proposal for the reproduction of the global histogram based on the sampling domain. The sequential Gaussian simulation (SGS) was presented as the method giving the most complete information, while classification performed in a more robust way. An error measure was defined in relation to the decision function for data classification hardening. Within the classification methods, probabilistic neural networks (PNN) show to be better adapted for modeling of high threshold categorization and for automation. Support vector machines (SVM) on the contrary performed well under balanced category conditions. In general, it was concluded that a particular prediction or estimation method is not better under all conditions of scale and neighborhood definitions. Simulations should be the basis, while other methods can provide complementary information to accomplish an efficient indoor radon decision making.