948 resultados para STATISTICAL-METHODS


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Microarray platforms have been around for many years and while there is a rise of new technologies in laboratories, microarrays are still prevalent. When it comes to the analysis of microarray data to identify differentially expressed (DE) genes, many methods have been proposed and modified for improvement. However, the most popular methods such as Significance Analysis of Microarrays (SAM), samroc, fold change, and rank product are far from perfect. When it comes down to choosing which method is most powerful, it comes down to the characteristics of the sample and distribution of the gene expressions. The most practiced method is usually SAM or samroc but when the data tends to be skewed, the power of these methods decrease. With the concept that the median becomes a better measure of central tendency than the mean when the data is skewed, the tests statistics of the SAM and fold change methods are modified in this thesis. This study shows that the median modified fold change method improves the power for many cases when identifying DE genes if the data follows a lognormal distribution.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A certain type of bacterial inclusion, known as a bacterial microcompartment, was recently identified and imaged through cryo-electron tomography. A reconstructed 3D object from single-axis limited angle tilt-series cryo-electron tomography contains missing regions and this problem is known as the missing wedge problem. Due to missing regions on the reconstructed images, analyzing their 3D structures is a challenging problem. The existing methods overcome this problem by aligning and averaging several similar shaped objects. These schemes work well if the objects are symmetric and several objects with almost similar shapes and sizes are available. Since the bacterial inclusions studied here are not symmetric, are deformed, and show a wide range of shapes and sizes, the existing approaches are not appropriate. This research develops new statistical methods for analyzing geometric properties, such as volume, symmetry, aspect ratio, polyhedral structures etc., of these bacterial inclusions in presence of missing data. These methods work with deformed and non-symmetric varied shaped objects and do not necessitate multiple objects for handling the missing wedge problem. The developed methods and contributions include: (a) an improved method for manual image segmentation, (b) a new approach to 'complete' the segmented and reconstructed incomplete 3D images, (c) a polyhedral structural distance model to predict the polyhedral shapes of these microstructures, (d) a new shape descriptor for polyhedral shapes, named as polyhedron profile statistic, and (e) the Bayes classifier, linear discriminant analysis and support vector machine based classifiers for supervised incomplete polyhedral shape classification. Finally, the predicted 3D shapes for these bacterial microstructures belong to the Johnson solids family, and these shapes along with their other geometric properties are important for better understanding of their chemical and biological characteristics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This dissertation proposes statistical methods to formulate, estimate and apply complex transportation models. Two main problems are part of the analyses conducted and presented in this dissertation. The first method solves an econometric problem and is concerned with the joint estimation of models that contain both discrete and continuous decision variables. The use of ordered models along with a regression is proposed and their effectiveness is evaluated with respect to unordered models. Procedure to calculate and optimize the log-likelihood functions of both discrete-continuous approaches are derived, and difficulties associated with the estimation of unordered models explained. Numerical approximation methods based on the Genz algortithm are implemented in order to solve the multidimensional integral associated with the unordered modeling structure. The problems deriving from the lack of smoothness of the probit model around the maximum of the log-likelihood function, which makes the optimization and the calculation of standard deviations very difficult, are carefully analyzed. A methodology to perform out-of-sample validation in the context of a joint model is proposed. Comprehensive numerical experiments have been conducted on both simulated and real data. In particular, the discrete-continuous models are estimated and applied to vehicle ownership and use models on data extracted from the 2009 National Household Travel Survey. The second part of this work offers a comprehensive statistical analysis of free-flow speed distribution; the method is applied to data collected on a sample of roads in Italy. A linear mixed model that includes speed quantiles in its predictors is estimated. Results show that there is no road effect in the analysis of free-flow speeds, which is particularly important for model transferability. A very general framework to predict random effects with few observations and incomplete access to model covariates is formulated and applied to predict the distribution of free-flow speed quantiles. The speed distribution of most road sections is successfully predicted; jack-knife estimates are calculated and used to explain why some sections are poorly predicted. Eventually, this work contributes to the literature in transportation modeling by proposing econometric model formulations for discrete-continuous variables, more efficient methods for the calculation of multivariate normal probabilities, and random effects models for free-flow speed estimation that takes into account the survey design. All methods are rigorously validated on both real and simulated data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Purpose: To develop an effective method for evaluating the quality of Cortex berberidis from different geographical origins. Methods: A simple, precise and accurate high performance liquid chromatography (HPLC) method was first developed for simultaneous quantification of four active alkaloids (magnoflorine, jatrorrhizine, palmatine, and berberine) in Cortex berberidis obtained from Qinghai, Tibet and Sichuan Provinces of China. Method validation was performed in terms of precision, repeatability, stability, accuracy, and linearity. Besides, partial least squares discriminant analysis (PLS-DA) and one-way analysis of variance (ANOVA) were applied to study the quality variations of Cortex berberidis from various geographical origins. Results: The proposed HPLC method showed good linearity, precision, repeatability, and accuracy. The four alkaloids were detected in all samples of Cortex berberidis. Among them, magnoflorine (36.46 - 87.30 mg/g) consistently showed the highest amounts in all the samples, followed by berberine (16.00 - 37.50 mg/g). The content varied in the range of 0.66 - 4.57 mg/g for palmatine and 1.53 - 16.26 mg/g for jatrorrhizine, respectively. The total content of the four alkaloids ranged from 67.62 to 114.79 mg/g. Moreover, the results obtained by the PLS-DA and ANOVA showed that magnoflorine level and the total content of these four alkaloids in Qinghai and Tibet samples were significantly higher (p < 0.01) than those in Sichuan samples. Conclusion: Quantification of multi-ingredients by HPLC combined with statistical methods provide an effective approach for achieving origin discrimination and quality evaluation of Cortex berberidis. The quality of Cortex berberidis closely correlates to the geographical origin of the samples, with Cortex berberidis samples from Qinghai and Tibet exhibiting superior qualities to those from Sichuan.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the scope of the European project Hydroptimet, INTERREG IIIB-MEDOCC programme, limited area model (LAM) intercomparison of intense events that produced many damages to people and territory is performed. As the comparison is limited to single case studies, the work is not meant to provide a measure of the different models' skill, but to identify the key model factors useful to give a good forecast on such a kind of meteorological phenomena. This work focuses on the Spanish flash-flood event, also known as "Montserrat-2000" event. The study is performed using forecast data from seven operational LAMs, placed at partners' disposal via the Hydroptimet ftp site, and observed data from Catalonia rain gauge network. To improve the event analysis, satellite rainfall estimates have been also considered. For statistical evaluation of quantitative precipitation forecasts (QPFs), several non-parametric skill scores based on contingency tables have been used. Furthermore, for each model run it has been possible to identify Catalonia regions affected by misses and false alarms using contingency table elements. Moreover, the standard "eyeball" analysis of forecast and observed precipitation fields has been supported by the use of a state-of-the-art diagnostic method, the contiguous rain area (CRA) analysis. This method allows to quantify the spatial shift forecast error and to identify the error sources that affected each model forecasts. High-resolution modelling and domain size seem to have a key role for providing a skillful forecast. Further work is needed to support this statement, including verification using a wider observational data set.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The supervised pattern recognition methods K-Nearest Neighbors (KNN), stepwise discriminant analysis (SDA), and soft independent modelling of class analogy (SIMCA) were employed in this work with the aim to investigate the relationship between the molecular structure of 27 cannabinoid compounds and their analgesic activity. Previous analyses using two unsupervised pattern recognition methods (PCA-principal component analysis and HCA-hierarchical cluster analysis) were performed and five descriptors were selected as the most relevants for the analgesic activity of the compounds studied: R (3) (charge density on substituent at position C(3)), Q (1) (charge on atom C(1)), A (surface area), log P (logarithm of the partition coefficient) and MR (molecular refractivity). The supervised pattern recognition methods (SDA, KNN, and SIMCA) were employed in order to construct a reliable model that can be able to predict the analgesic activity of new cannabinoid compounds and to validate our previous study. The results obtained using the SDA, KNN, and SIMCA methods agree perfectly with our previous model. Comparing the SDA, KNN, and SIMCA results with the PCA and HCA ones we could notice that all multivariate statistical methods classified the cannabinoid compounds studied in three groups exactly in the same way: active, moderately active, and inactive.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Beyond the classical statistical approaches (determination of basic statistics, regression analysis, ANOVA, etc.) a new set of applications of different statistical techniques has increasingly gained relevance in the analysis, processing and interpretation of data concerning the characteristics of forest soils. This is possible to be seen in some of the recent publications in the context of Multivariate Statistics. These new methods require additional care that is not always included or refered in some approaches. In the particular case of geostatistical data applications it is necessary, besides to geo-reference all the data acquisition, to collect the samples in regular grids and in sufficient quantity so that the variograms can reflect the spatial distribution of soil properties in a representative manner. In the case of the great majority of Multivariate Statistics techniques (Principal Component Analysis, Correspondence Analysis, Cluster Analysis, etc.) despite the fact they do not require in most cases the assumption of normal distribution, they however need a proper and rigorous strategy for its utilization. In this work, some reflections about these methodologies and, in particular, about the main constraints that often occur during the information collecting process and about the various linking possibilities of these different techniques will be presented. At the end, illustrations of some particular cases of the applications of these statistical methods will also be presented.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Submitted in partial fulfillment for the Requirements for the Degree of PhD in Mathematics, in the Speciality of Statistics in the Faculdade de Ciências e Tecnologia

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Univariate statistical control charts, such as the Shewhart chart, do not satisfy the requirements for process monitoring on a high volume automated fuel cell manufacturing line. This is because of the number of variables that require monitoring. The risk of elevated false alarms, due to the nature of the process being high volume, can present problems if univariate methods are used. Multivariate statistical methods are discussed as an alternative for process monitoring and control. The research presented is conducted on a manufacturing line which evaluates the performance of a fuel cell. It has three stages of production assembly that contribute to the final end product performance. The product performance is assessed by power and energy measurements, taken at various time points throughout the discharge testing of the fuel cell. The literature review performed on these multivariate techniques are evaluated using individual and batch observations. Modern techniques using multivariate control charts on Hotellings T2 are compared to other multivariate methods, such as Principal Components Analysis (PCA). The latter, PCA, was identified as the most suitable method. Control charts such as, scores, T2 and DModX charts, are constructed from the PCA model. Diagnostic procedures, using Contribution plots, for out of control points that are detected using these control charts, are also discussed. These plots enable the investigator to perform root cause analysis. Multivariate batch techniques are compared to individual observations typically seen on continuous processes. Recommendations, for the introduction of multivariate techniques that would be appropriate for most high volume processes, are also covered.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Nowadays, the joint exploitation of images acquired daily by remote sensing instruments and of images available from archives allows a detailed monitoring of the transitions occurring at the surface of the Earth. These modifications of the land cover generate spectral discrepancies that can be detected via the analysis of remote sensing images. Independently from the origin of the images and of type of surface change, a correct processing of such data implies the adoption of flexible, robust and possibly nonlinear method, to correctly account for the complex statistical relationships characterizing the pixels of the images. This Thesis deals with the development and the application of advanced statistical methods for multi-temporal optical remote sensing image processing tasks. Three different families of machine learning models have been explored and fundamental solutions for change detection problems are provided. In the first part, change detection with user supervision has been considered. In a first application, a nonlinear classifier has been applied with the intent of precisely delineating flooded regions from a pair of images. In a second case study, the spatial context of each pixel has been injected into another nonlinear classifier to obtain a precise mapping of new urban structures. In both cases, the user provides the classifier with examples of what he believes has changed or not. In the second part, a completely automatic and unsupervised method for precise binary detection of changes has been proposed. The technique allows a very accurate mapping without any user intervention, resulting particularly useful when readiness and reaction times of the system are a crucial constraint. In the third, the problem of statistical distributions shifting between acquisitions is studied. Two approaches to transform the couple of bi-temporal images and reduce their differences unrelated to changes in land cover are studied. The methods align the distributions of the images, so that the pixel-wise comparison could be carried out with higher accuracy. Furthermore, the second method can deal with images from different sensors, no matter the dimensionality of the data nor the spectral information content. This opens the doors to possible solutions for a crucial problem in the field: detecting changes when the images have been acquired by two different sensors.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Trees are a great bank of data, named sometimes for this reason as the "silentwitnesses" of the past. Due to annual formation of rings, which is normally influenced directly by of climate parameters (generally changes in temperature and moisture or precipitation) and other environmental factors; these changes, occurred in the past, are"written" in the tree "archives" and can be "decoded" in order to interpret what hadhappened before, mainly applied for the past climate reconstruction.Using dendrochronological methods for obtaining samples of Pinus nigra fromthe Catalonian PrePirineous region, the cores of 15 trees with total time spine of about 100 - 250 years were analyzed for the tree ring width (TRW) patterns and had quite high correlation between them (0.71 ¿ 0.84), corresponding to a common behaviour for the environmental changes in their annual growth.After different trials with raw TRW data for standardization in order to take outthe negative exponential growth curve dependency, the best method of doubledetrending (power transformation and smoothing line of 32 years) were selected for obtaining the indexes for further analysis.Analyzing the cross-correlations between obtained tree ring width indexes andclimate data, significant correlations (p<0.05) were observed in some lags, as forexample, annual precipitation in lag -1 (previous year) had negative correlation with TRW growth in the Pallars region. Significant correlation coefficients are between 0.27- 0.51 (with positive or negative signs) for many cases; as for recent (but very short period) climate data of Seu d¿Urgell meteorological station, some significant correlation coefficients were observed, of the order of 0.9.These results confirm the hypothesis of using dendrochronological data as aclimate signal for further analysis, such as reconstruction of climate in the past orprediction in the future for the same locality.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Background: Molecular tools may help to uncover closely related and still diverging species from a wide variety of taxa and provide insight into the mechanisms, pace and geography of marine speciation. There is a certain controversy on the phylogeography and speciation modes of species-groups with an Eastern Atlantic-Western Indian Ocean distribution, with previous studies suggesting that older events (Miocene) and/or more recent (Pleistocene) oceanographic processes could have influenced the phylogeny of marine taxa. The spiny lobster genus Palinurus allows for testing among speciation hypotheses, since it has a particular distribution with two groups of three species each in the Northeastern Atlantic (P. elephas, P. mauritanicus and P. charlestoni) and Southeastern Atlantic and Southwestern Indian Oceans (P. gilchristi, P. delagoae and P. barbarae). In the present study, we obtain a more complete understanding of the phylogenetic relationships among these species through a combined dataset with both nuclear and mitochondrial markers, by testing alternative hypotheses on both the mutation rate and tree topology under the recently developed approximate Bayesian computation (ABC) methods. Results Our analyses support a North-to-South speciation pattern in Palinurus with all the South-African species forming a monophyletic clade nested within the Northern Hemisphere species. Coalescent-based ABC methods allowed us to reject the previously proposed hypothesis of a Middle Miocene speciation event related with the closure of the Tethyan Seaway. Instead, divergence times obtained for Palinurus species using the combined mtDNA-microsatellite dataset and standard mutation rates for mtDNA agree with known glaciation-related processes occurring during the last 2 my. Conclusion The Palinurus speciation pattern is a typical example of a series of rapid speciation events occurring within a group, with very short branches separating different species. Our results support the hypothesis that recent climate change-related oceanographic processes have influenced the phylogeny of marine taxa, with most Palinurus species originating during the last two million years. The present study highlights the value of new coalescent-based statistical methods such as ABC for testing different speciation hypotheses using molecular data.