974 resultados para Data Interpretation, Statistical
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
New stratigraphic data along a profile from the Helvetic Gotthard massif to the remnants of the North Penninic basin in eastern Ticino and Graubunden are presented. The stratigraphic record together with existing geochemical and structural data, motivate a new interpretation of the fossil European distal margin. We introduce a new group of Triassic facies, the North-Penninic-Triassic (NPT), which is characterised by the Ladinian ``dolomie bicolori''. The NPT was located in-between the Brianconnais carbonate platform and the Helvetic lands. The observed horizontal transition, coupled with the stratigraphic superposition of a Helvetic Liassic on a Briaconnais Triassic in the Luzzone-Terri nappe, links, prior to Jurassic rifting, the Brianconnais paleogeographic domain at the Helvetic margin, south of the Gotthard. Our observations suggest that the Jurassic rifting separated the Brianconnais domain from the Helvetic margin by complex and protracted extension. The syn-rift stratigraphic record in the Adula nappe and surroundings suggests the presence of a diffuse rising area with only moderately subsiding basins above a thinned continental and proto-oceanic crust. Strong subsidence occurred in a second phase following protracted extension and the resulting delamination of the rising area. The stratigraphic coherency in the Adula's Mesozoic questions the idea of a lithospheric m lange in the eclogitic Adula nappe, which is more likely to be a coherent alpine tectonic unit. The structural and stratigraphic observations in the Piz Terri-Lunschania zone suggest the activity of syn-rift detachments. During the alpine collision these faults are reactivated (and inverted) and played a major role in allowing the Adula subduction, the ``Penninic Thrust'' above it and in creating the structural complexity of the Central Alps. (C) 2012 Elsevier B.V. All rights reserved.
Resumo:
In this paper, some steganalytic techniques designed to detect the existence of hidden messages using histogram shifting methods are presented. Firstly, some techniques to identify specific methods of histogram shifting, based on visible marks on the histogram or abnormal statistical distributions are suggested. Then, we present a general technique capable of detecting all histogram shifting techniques analyzed. This technique is based on the effect of histogram shifting methods on the "volatility" of the histogram of differences and the study of its reduction whenever new data are hidden.
Resumo:
BACKGROUND: PCR has the potential to detect and precisely quantify specific DNA sequences, but it is not yet often used as a fully quantitative method. A number of data collection and processing strategies have been described for the implementation of quantitative PCR. However, they can be experimentally cumbersome, their relative performances have not been evaluated systematically, and they often remain poorly validated statistically and/or experimentally. In this study, we evaluated the performance of known methods, and compared them with newly developed data processing strategies in terms of resolution, precision and robustness. RESULTS: Our results indicate that simple methods that do not rely on the estimation of the efficiency of the PCR amplification may provide reproducible and sensitive data, but that they do not quantify DNA with precision. Other evaluated methods based on sigmoidal or exponential curve fitting were generally of both poor resolution and precision. A statistical analysis of the parameters that influence efficiency indicated that it depends mostly on the selected amplicon and to a lesser extent on the particular biological sample analyzed. Thus, we devised various strategies based on individual or averaged efficiency values, which were used to assess the regulated expression of several genes in response to a growth factor. CONCLUSION: Overall, qPCR data analysis methods differ significantly in their performance, and this analysis identifies methods that provide DNA quantification estimates of high precision, robustness and reliability. These methods allow reliable estimations of relative expression ratio of two-fold or higher, and our analysis provides an estimation of the number of biological samples that have to be analyzed to achieve a given precision.
Resumo:
Genotypic frequencies at codominant marker loci in population samples convey information on mating systems. A classical way to extract this information is to measure heterozygote deficiencies (FIS) and obtain the selfing rate s from FIS = s/(2 - s), assuming inbreeding equilibrium. A major drawback is that heterozygote deficiencies are often present without selfing, owing largely to technical artefacts such as null alleles or partial dominance. We show here that, in the absence of gametic disequilibrium, the multilocus structure can be used to derive estimates of s independent of FIS and free of technical biases. Their statistical power and precision are comparable to those of FIS, although they are sensitive to certain types of gametic disequilibria, a bias shared with progeny-array methods but not FIS. We analyse four real data sets spanning a range of mating systems. In two examples, we obtain s = 0 despite positive FIS, strongly suggesting that the latter are artefactual. In the remaining examples, all estimates are consistent. All the computations have been implemented in a open-access and user-friendly software called rmes (robust multilocus estimate of selfing) available at http://ftp.cefe.cnrs.fr, and can be used on any multilocus data. Being able to extract the reliable information from imperfect data, our method opens the way to make use of the ever-growing number of published population genetic studies, in addition to the more demanding progeny-array approaches, to investigate selfing rates.
Resumo:
Background Nowadays, combining the different sources of information to improve the biological knowledge available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data types are kernel-based methods. Kernel-based data integration approaches consist of two basic steps: firstly the right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. Results We analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed. Conclusions The integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge.
Resumo:
Peer-reviewed
Resumo:
The R package EasyStrata facilitates the evaluation and visualization of stratified genome-wide association meta-analyses (GWAMAs) results. It provides (i) statistical methods to test and account for between-strata difference as a means to tackle gene-strata interaction effects and (ii) extended graphical features tailored for stratified GWAMA results. The software provides further features also suitable for general GWAMAs including functions to annotate, exclude or highlight specific loci in plots or to extract independent subsets of loci from genome-wide datasets. It is freely available and includes a user-friendly scripting interface that simplifies data handling and allows for combining statistical and graphical functions in a flexible fashion. AVAILABILITY: EasyStrata is available for free (under the GNU General Public License v3) from our Web site www.genepi-regensburg.de/easystrata and from the CRAN R package repository cran.r-project.org/web/packages/EasyStrata/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Resumo:
There is currently a considerable diversity of quantitative measures available for summarizing the results in single-case studies. Given that the interpretation of some of them is difficult due to the lack of established benchmarks, the current paper proposes an approach for obtaining further numerical evidence on the importance of the results, complementing the substantive criteria, visual analysis, and primary summary measures. This additional evidence consists of obtaining the statistical significance of the outcome when referred to the corresponding sampling distribution. This sampling distribution is formed by the values of the outcomes (expressed as data nonoverlap, R-squared, etc.) in case the intervention is ineffective. The approach proposed here is intended to offer the outcome"s probability of being as extreme when there is no treatment effect without the need for some assumptions that cannot be checked with guarantees. Following this approach, researchers would compare their outcomes to reference values rather than constructing the sampling distributions themselves. The integration of single-case studies is problematic, when different metrics are used across primary studies and not all raw data are available. Via the approach for assigning p values it is possible to combine the results of similar studies regardless of the primary effect size indicator. The alternatives for combining probabilities are discussed in the context of single-case studies pointing out two potentially useful methods one based on a weighted average and the other on the binomial test.
Resumo:
The main objective of this study was todo a statistical analysis of ecological type from optical satellite data, using Tipping's sparse Bayesian algorithm. This thesis uses "the Relevence Vector Machine" algorithm in ecological classification betweenforestland and wetland. Further this bi-classification technique was used to do classification of many other different species of trees and produces hierarchical classification of entire subclasses given as a target class. Also, we carried out an attempt to use airborne image of same forest area. Combining it with image analysis, using different image processing operation, we tried to extract good features and later used them to perform classification of forestland and wetland.
Resumo:
Dissolved organic matter (DOM) is a complex mixture of organic compounds, ubiquitous in marine and freshwater systems. Fluorescence spectroscopy, by means of Excitation-Emission Matrices (EEM), has become an indispensable tool to study DOM sources, transport and fate in aquatic ecosystems. However the statistical treatment of large and heterogeneous EEM data sets still represents an important challenge for biogeochemists. Recently, Self-Organising Maps (SOM) has been proposed as a tool to explore patterns in large EEM data sets. SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure. In this paper, we show how SOM, coupled with a correlation analysis of the component planes, can be used both to explore patterns among samples, as well as to identify individual fluorescence components. We analysed a large and heterogeneous EEM data set, including samples from a river catchment collected under a range of hydrological conditions, along a 60-km downstream gradient, and under the influence of different degrees of anthropogenic impact. According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics. On the other hand, river samples collected under flash flood conditions showed homogeneous EEM shapes. The correlation analysis of the component planes suggested the presence of four fluorescence components, consistent with DOM components previously described in the literature. A remarkable strength of this methodology was that outlier samples appeared naturally integrated in the analysis. We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.
Resumo:
BACKGROUND: Worldwide data for cancer survival are scarce. We aimed to initiate worldwide surveillance of cancer survival by central analysis of population-based registry data, as a metric of the effectiveness of health systems, and to inform global policy on cancer control. METHODS: Individual tumour records were submitted by 279 population-based cancer registries in 67 countries for 25·7 million adults (age 15-99 years) and 75 000 children (age 0-14 years) diagnosed with cancer during 1995-2009 and followed up to Dec 31, 2009, or later. We looked at cancers of the stomach, colon, rectum, liver, lung, breast (women), cervix, ovary, and prostate in adults, and adult and childhood leukaemia. Standardised quality control procedures were applied; errors were corrected by the registry concerned. We estimated 5-year net survival, adjusted for background mortality in every country or region by age (single year), sex, and calendar year, and by race or ethnic origin in some countries. Estimates were age-standardised with the International Cancer Survival Standard weights. FINDINGS: 5-year survival from colon, rectal, and breast cancers has increased steadily in most developed countries. For patients diagnosed during 2005-09, survival for colon and rectal cancer reached 60% or more in 22 countries around the world; for breast cancer, 5-year survival rose to 85% or higher in 17 countries worldwide. Liver and lung cancer remain lethal in all nations: for both cancers, 5-year survival is below 20% everywhere in Europe, in the range 15-19% in North America, and as low as 7-9% in Mongolia and Thailand. Striking rises in 5-year survival from prostate cancer have occurred in many countries: survival rose by 10-20% between 1995-99 and 2005-09 in 22 countries in South America, Asia, and Europe, but survival still varies widely around the world, from less than 60% in Bulgaria and Thailand to 95% or more in Brazil, Puerto Rico, and the USA. For cervical cancer, national estimates of 5-year survival range from less than 50% to more than 70%; regional variations are much wider, and improvements between 1995-99 and 2005-09 have generally been slight. For women diagnosed with ovarian cancer in 2005-09, 5-year survival was 40% or higher only in Ecuador, the USA, and 17 countries in Asia and Europe. 5-year survival for stomach cancer in 2005-09 was high (54-58%) in Japan and South Korea, compared with less than 40% in other countries. By contrast, 5-year survival from adult leukaemia in Japan and South Korea (18-23%) is lower than in most other countries. 5-year survival from childhood acute lymphoblastic leukaemia is less than 60% in several countries, but as high as 90% in Canada and four European countries, which suggests major deficiencies in the management of a largely curable disease. INTERPRETATION: International comparison of survival trends reveals very wide differences that are likely to be attributable to differences in access to early diagnosis and optimum treatment. Continuous worldwide surveillance of cancer survival should become an indispensable source of information for cancer patients and researchers and a stimulus for politicians to improve health policy and health-care systems. FUNDING: Canadian Partnership Against Cancer (Toronto, Canada), Cancer Focus Northern Ireland (Belfast, UK), Cancer Institute New South Wales (Sydney, Australia), Cancer Research UK (London, UK), Centers for Disease Control and Prevention (Atlanta, GA, USA), Swiss Re (London, UK), Swiss Cancer Research foundation (Bern, Switzerland), Swiss Cancer League (Bern, Switzerland), and University of Kentucky (Lexington, KY, USA).
Resumo:
BACKGROUND: Artemether-lumefantrine is the most widely used artemisinin-based combination therapy for malaria, although treatment failures occur in some regions. We investigated the effect of dosing strategy on efficacy in a pooled analysis from trials done in a wide range of malaria-endemic settings. METHODS: We searched PubMed for clinical trials that enrolled and treated patients with artemether-lumefantrine and were published from 1960 to December, 2012. We merged individual patient data from these trials by use of standardised methods. The primary endpoint was the PCR-adjusted risk of Plasmodium falciparum recrudescence by day 28. Secondary endpoints consisted of the PCR-adjusted risk of P falciparum recurrence by day 42, PCR-unadjusted risk of P falciparum recurrence by day 42, early parasite clearance, and gametocyte carriage. Risk factors for PCR-adjusted recrudescence were identified using Cox's regression model with frailty shared across the study sites. FINDINGS: We included 61 studies done between January, 1998, and December, 2012, and included 14 327 patients in our analyses. The PCR-adjusted therapeutic efficacy was 97·6% (95% CI 97·4-97·9) at day 28 and 96·0% (95·6-96·5) at day 42. After controlling for age and parasitaemia, patients prescribed a higher dose of artemether had a lower risk of having parasitaemia on day 1 (adjusted odds ratio [OR] 0·92, 95% CI 0·86-0·99 for every 1 mg/kg increase in daily artemether dose; p=0·024), but not on day 2 (p=0·69) or day 3 (0·087). In Asia, children weighing 10-15 kg who received a total lumefantrine dose less than 60 mg/kg had the lowest PCR-adjusted efficacy (91·7%, 95% CI 86·5-96·9). In Africa, the risk of treatment failure was greatest in malnourished children aged 1-3 years (PCR-adjusted efficacy 94·3%, 95% CI 92·3-96·3). A higher artemether dose was associated with a lower gametocyte presence within 14 days of treatment (adjusted OR 0·92, 95% CI 0·85-0·99; p=0·037 for every 1 mg/kg increase in total artemether dose). INTERPRETATION: The recommended dose of artemether-lumefantrine provides reliable efficacy in most patients with uncomplicated malaria. However, therapeutic efficacy was lowest in young children from Asia and young underweight children from Africa; a higher dose regimen should be assessed in these groups. FUNDING: Bill & Melinda Gates Foundation.
Resumo:
This study deals with the statistical properties of a randomization test applied to an ABAB design in cases where the desirable random assignment of the points of change in phase is not possible. In order to obtain information about each possible data division we carried out a conditional Monte Carlo simulation with 100,000 samples for each systematically chosen triplet. Robustness and power are studied under several experimental conditions: different autocorrelation levels and different effect sizes, as well as different phase lengths determined by the points of change. Type I error rates were distorted by the presence of autocorrelation for the majority of data divisions. Satisfactory Type II error rates were obtained only for large treatment effects. The relationship between the lengths of the four phases appeared to be an important factor for the robustness and the power of the randomization test.