899 resultados para two-Gaussian mixture model
Resumo:
An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local FDR (false discovery rate) is provided for each gene. An attractive feature of the mixture model approach is that it provides a framework for the estimation of the prior probability that a gene is not differentially expressed, and this probability can subsequently be used in forming a decision rule. The rule can also be formed to take the false negative rate into account. We apply this approach to a well-known publicly available data set on breast cancer, and discuss our findings with reference to other approaches.
Resumo:
An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local false discovery rate is provided for each gene, and it can be implemented so that the implied global false discovery rate is bounded as with the Benjamini-Hochberg methodology based on tail areas. The latter procedure is too conservative, unless it is modified according to the prior probability that a gene is not differentially expressed. An attractive feature of the mixture model approach is that it provides a framework for the estimation of this probability and its subsequent use in forming a decision rule. The rule can also be formed to take the false negative rate into account.
Resumo:
QTL detection experiments in livestock species commonly use the half-sib design. Each male is mated to a number of females, each female producing a limited number of progeny. Analysis consists of attempting to detect associations between phenotype and genotype measured on the progeny. When family sizes are limiting experimenters may wish to incorporate as much information as possible into a single analysis. However, combining information across sires is problematic because of incomplete linkage disequilibrium between the markers and the QTL in the population. This study describes formulae for obtaining MLEs via the expectation maximization (EM) algorithm for use in a multiple-trait, multiple-family analysis. A model specifying a QTL with only two alleles, and a common within sire error variance is assumed. Compared to single-family analyses, power can be improved up to fourfold with multi-family analyses. The accuracy and precision of QTL location estimates are also substantially improved. With small family sizes, the multi-family, multi-trait analyses reduce substantially, but not totally remove, biases in QTL effect estimates. In situations where multiple QTL alleles are segregating the multi-family analysis will average out the effects of the different QTL alleles.
Resumo:
Quantitatively predicting mass transport rates for chemical mixtures in porous materials is important in applications of materials such as adsorbents, membranes, and catalysts. Because directly assessing mixture transport experimentally is challenging, theoretical models that can predict mixture diffusion coefficients using Only single-component information would have many uses. One such model was proposed by Skoulidas, Sholl, and Krishna (Langmuir, 2003, 19, 7977), and applications of this model to a variety of chemical mixtures in nanoporous materials have yielded promising results. In this paper, the accuracy of this model for predicting mixture diffusion coefficients in materials that exhibit a heterogeneous distribution of local binding energies is examined. To examine this issue, single-component and binary mixture diffusion coefficients are computed using kinetic Monte Carlo for a two-dimensional lattice model over a wide range of lattice occupancies and compositions. The approach suggested by Skoulidas, Sholl, and Krishna is found to be accurate in situations where the spatial distribution of binding site energies is relatively homogeneous, but is considerably less accurate for strongly heterogeneous energy distributions.
Resumo:
An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local FDR (false discovery rate) is provided for each gene. An attractive feature of the mixture model approach is that it provides a framework for the estimation of the prior probability that a gene is not differentially expressed, and this probability can subsequently be used in forming a decision rule. The rule can also be formed to take the false negative rate into account. We apply this approach to a well-known publicly available data set on breast cancer, and discuss our findings with reference to other approaches.
Resumo:
This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).
Resumo:
Time-course experiments with microarrays are often used to study dynamic biological systems and genetic regulatory networks (GRNs) that model how genes influence each other in cell-level development of organisms. The inference for GRNs provides important insights into the fundamental biological processes such as growth and is useful in disease diagnosis and genomic drug design. Due to the experimental design, multilevel data hierarchies are often present in time-course gene expression data. Most existing methods, however, ignore the dependency of the expression measurements over time and the correlation among gene expression profiles. Such independence assumptions violate regulatory interactions and can result in overlooking certain important subject effects and lead to spurious inference for regulatory networks or mechanisms. In this paper, a multilevel mixed-effects model is adopted to incorporate data hierarchies in the analysis of time-course data, where temporal and subject effects are both assumed to be random. The method starts with the clustering of genes by fitting the mixture model within the multilevel random-effects model framework using the expectation-maximization (EM) algorithm. The network of regulatory interactions is then determined by searching for regulatory control elements (activators and inhibitors) shared by the clusters of co-expressed genes, based on a time-lagged correlation coefficients measurement. The method is applied to two real time-course datasets from the budding yeast (Saccharomyces cerevisiae) genome. It is shown that the proposed method provides clusters of cell-cycle regulated genes that are supported by existing gene function annotations, and hence enables inference on regulatory interactions for the genetic network.
Resumo:
Among the Solar System’s bodies, Moon, Mercury and Mars are at present, or have been in the recent years, object of space missions aimed, among other topics, also at improving our knowledge about surface composition. Between the techniques to detect planet’s mineralogical composition, both from remote and close range platforms, visible and near-infrared reflectance (VNIR) spectroscopy is a powerful tool, because crystal field absorption bands are related to particular transitional metals in well-defined crystal structures, e.g., Fe2+ in M1 and M2 sites of olivine or pyroxene (Burns, 1993). Thanks to the improvements in the spectrometers onboard the recent missions, a more detailed interpretation of the planetary surfaces can now be delineated. However, quantitative interpretation of planetary surface mineralogy could not always be a simple task. In fact, several factors such as the mineral chemistry, the presence of different minerals that absorb in a narrow spectral range, the regolith with a variable particle size range, the space weathering, the atmosphere composition etc., act in unpredictable ways on the reflectance spectra on a planetary surface (Serventi et al., 2014). One method for the interpretation of reflectance spectra of unknown materials involves the study of a number of spectra acquired in the laboratory under different conditions, such as different mineral abundances or different particle sizes, in order to derive empirical trends. This is the methodology that has been followed in this PhD thesis: the single factors previously listed have been analyzed, creating, in the laboratory, a set of terrestrial analogues with well-defined composition and size. The aim of this work is to provide new tools and criteria to improve the knowledge of the composition of planetary surfaces. In particular, mixtures composed with different content and chemistry of plagioclase and mafic minerals have been spectroscopically analyzed at different particle sizes and with different mineral relative percentages. The reflectance spectra of each mixture have been analyzed both qualitatively (using the software ORIGIN®) and quantitatively applying the Modified Gaussian Model (MGM, Sunshine et al., 1990) algorithm. In particular, the spectral parameter variations of each absorption band have been evaluated versus the volumetric FeO% content in the PL phase and versus the PL modal abundance. This delineated calibration curves of composition vs. spectral parameters and allow implementation of spectral libraries. Furthermore, the trends derived from terrestrial analogues here analyzed and from analogues in the literature have been applied for the interpretation of hyperspectral images of both plagioclase-rich (Moon) and plagioclase-poor (Mars) bodies.
Resumo:
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.
Resumo:
It is well known that one of the obstacles to effective forecasting of exchange rates is heteroscedasticity (non-stationary conditional variance). The autoregressive conditional heteroscedastic (ARCH) model and its variants have been used to estimate a time dependent variance for many financial time series. However, such models are essentially linear in form and we can ask whether a non-linear model for variance can improve results just as non-linear models (such as neural networks) for the mean have done. In this paper we consider two neural network models for variance estimation. Mixture Density Networks (Bishop 1994, Nix and Weigend 1994) combine a Multi-Layer Perceptron (MLP) and a mixture model to estimate the conditional data density. They are trained using a maximum likelihood approach. However, it is known that maximum likelihood estimates are biased and lead to a systematic under-estimate of variance. More recently, a Bayesian approach to parameter estimation has been developed (Bishop and Qazaz 1996) that shows promise in removing the maximum likelihood bias. However, up to now, this model has not been used for time series prediction. Here we compare these algorithms with two other models to provide benchmark results: a linear model (from the ARIMA family), and a conventional neural network trained with a sum-of-squares error function (which estimates the conditional mean of the time series with a constant variance noise model). This comparison is carried out on daily exchange rate data for five currencies.
Resumo:
Investigations into the modelling techniques that depict the transport of discrete phases (gas bubbles or solid particles) and model biochemical reactions in a bubble column reactor are discussed here. The mixture model was used to calculate gas-liquid, solid-liquid and gasliquid-solid interactions. Multiphase flow is a difficult phenomenon to capture, particularly in bubble columns where the major driving force is caused by the injection of gas bubbles. The gas bubbles cause a large density difference to occur that results in transient multi-dimensional fluid motion. Standard design procedures do not account for the transient motion, due to the simplifying assumptions of steady plug flow. Computational fluid dynamics (CFD) can assist in expanding the understanding of complex flows in bubble columns by characterising the flow phenomena for many geometrical configurations. Therefore, CFD has a role in the education of chemical and biochemical engineers, providing the examples of flow phenomena that many engineers may not experience, even through experimentation. The performance of the mixture model was investigated for three domains (plane, rectangular and cylindrical) and three flow models (laminar, k-e turbulence and the Reynolds stresses). mThis investigation raised many questions about how gas-liquid interactions are captured numerically. To answer some of these questions the analogy between thermal convection in a cavity and gas-liquid flow in bubble columns was invoked. This involved modelling the buoyant motion of air in a narrow cavity for a number of turbulence schemes. The difference in density was caused by a temperature gradient that acted across the width of the cavity. Multiple vortices were obtained when the Reynolds stresses were utilised with the addition of a basic flow profile after each time step. To implement the three-phase models an alternative mixture model was developed and compared against a commercially available mixture model for three turbulence schemes. The scheme where just the Reynolds stresses model was employed, predicted the transient motion of the fluids quite well for both mixture models. Solid-liquid and then alternative formulations of gas-liquid-solid model were compared against one another. The alternative form of the mixture model was found to perform particularly well for both gas and solid phase transport when calculating two and three-phase flow. The improvement in the solutions obtained was a result of the inclusion of the Reynolds stresses model and differences in the mixture models employed. The differences between the alternative mixture models were found in the volume fraction equation (flux and deviatoric stress tensor terms) and the viscosity formulation for the mixture phase.
Resumo:
Direct quantile regression involves estimating a given quantile of a response variable as a function of input variables. We present a new framework for direct quantile regression where a Gaussian process model is learned, minimising the expected tilted loss function. The integration required in learning is not analytically tractable so to speed up the learning we employ the Expectation Propagation algorithm. We describe how this work relates to other quantile regression methods and apply the method on both synthetic and real data sets. The method is shown to be competitive with state of the art methods whilst allowing for the leverage of the full Gaussian process probabilistic framework.
Resumo:
Single-molecule manipulation experiments of molecular motors provide essential information about the rate and conformational changes of the steps of the reaction located along the manipulation coordinate. This information is not always sufficient to define a particular kinetic cycle. Recent single-molecule experiments with optical tweezers showed that the DNA unwinding activity of a Phi29 DNA polymerase mutant presents a complex pause behavior, which includes short and long pauses. Here we show that different kinetic models, considering different connections between the active and the pause states, can explain the experimental pause behavior. Both the two independent pause model and the two connected pause model are able to describe the pause behavior of a mutated Phi29 DNA polymerase observed in an optical tweezers single-molecule experiment. For the two independent pause model all parameters are fixed by the observed data, while for the more general two connected pause model there is a range of values of the parameters compatible with the observed data (which can be expressed in terms of two of the rates and their force dependencies). This general model includes models with indirect entry and exit to the long-pause state, and also models with cycling in both directions. Additionally, assuming that detailed balance is verified, which forbids cycling, this reduces the ranges of the values of the parameters (which can then be expressed in terms of one rate and its force dependency). The resulting model interpolates between the independent pause model and the indirect entry and exit to the long-pause state model
Resumo:
Bayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.
Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.
Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with
little or no prior knowledge
Resumo:
Due to the variability and stochastic nature of wind power system, accurate wind power forecasting has an important role in developing reliable and economic power system operation and control strategies. As wind variability is stochastic, Gaussian Process regression has recently been introduced to capture the randomness of wind energy. However, the disadvantages of Gaussian Process regression include its computation complexity and incapability to adapt to time varying time-series systems. A variant Gaussian Process for time series forecasting is introduced in this study to address these issues. This new method is shown to be capable of reducing computational complexity and increasing prediction accuracy. It is further proved that the forecasting result converges as the number of available data approaches innite. Further, a teaching learning based optimization (TLBO) method is used to train the model and to accelerate
the learning rate. The proposed modelling and optimization method is applied to forecast both the wind power generation of Ireland and that from a single wind farm to show the eectiveness of the proposed method.