7 resultados para Hidden Genes
em Massachusetts Institute of Technology
Resumo:
This report studies when and why two Hidden Markov Models (HMMs) may represent the same stochastic process. HMMs are characterized in terms of equivalence classes whose elements represent identical stochastic processes. This characterization yields polynomial time algorithms to detect equivalent HMMs. We also find fast algorithms to reduce HMMs to essentially unique and minimal canonical representations. The reduction to a canonical form leads to the definition of 'Generalized Markov Models' which are essentially HMMs without the positivity constraint on their parameters. We discuss how this generalization can yield more parsimonious representations of stochastic processes at the cost of the probabilistic interpretation of the model parameters.
Resumo:
Graphical techniques for modeling the dependencies of randomvariables have been explored in a variety of different areas includingstatistics, statistical physics, artificial intelligence, speech recognition, image processing, and genetics.Formalisms for manipulating these models have been developedrelatively independently in these research communities. In this paper weexplore hidden Markov models (HMMs) and related structures within the general framework of probabilistic independencenetworks (PINs). The paper contains a self-contained review of the basic principles of PINs.It is shown that the well-known forward-backward (F-B) and Viterbialgorithms for HMMs are special cases of more general inference algorithms forarbitrary PINs. Furthermore, the existence of inference and estimationalgorithms for more general graphical models provides a set of analysistools for HMM practitioners who wish to explore a richer class of HMMstructures.Examples of relatively complex models to handle sensorfusion and coarticulationin speech recognitionare introduced and treated within the graphical model framework toillustrate the advantages of the general approach.
Resumo:
We present a framework for learning in hidden Markov models with distributed state representations. Within this framework, we derive a learning algorithm based on the Expectation--Maximization (EM) procedure for maximum likelihood estimation. Analogous to the standard Baum-Welch update rules, the M-step of our algorithm is exact and can be solved analytically. However, due to the combinatorial nature of the hidden state representation, the exact E-step is intractable. A simple and tractable mean field approximation is derived. Empirical results on a set of problems suggest that both the mean field approximation and Gibbs sampling are viable alternatives to the computationally expensive exact algorithm.
Resumo:
This paper introduces a probability model, the mixture of trees that can account for sparse, dynamically changing dependence relationships. We present a family of efficient algorithms that use EMand the Minimum Spanning Tree algorithm to find the ML and MAP mixtureof trees for a variety of priors, including the Dirichlet and the MDL priors.
Resumo:
This paper introduces a probability model, the mixture of trees that can account for sparse, dynamically changing dependence relationships. We present a family of efficient algorithms that use EM and the Minimum Spanning Tree algorithm to find the ML and MAP mixture of trees for a variety of priors, including the Dirichlet and the MDL priors. We also show that the single tree classifier acts like an implicit feature selector, thus making the classification performance insensitive to irrelevant attributes. Experimental results demonstrate the excellent performance of the new model both in density estimation and in classification.
Resumo:
Array technologies have made it possible to record simultaneously the expression pattern of thousands of genes. A fundamental problem in the analysis of gene expression data is the identification of highly relevant genes that either discriminate between phenotypic labels or are important with respect to the cellular process studied in the experiment: for example cell cycle or heat shock in yeast experiments, chemical or genetic perturbations of mammalian cell lines, and genes involved in class discovery for human tumors. In this paper we focus on the task of unsupervised gene selection. The problem of selecting a small subset of genes is particularly challenging as the datasets involved are typically characterized by a very small sample size ?? the order of few tens of tissue samples ??d by a very large feature space as the number of genes tend to be in the high thousands. We propose a model independent approach which scores candidate gene selections using spectral properties of the candidate affinity matrix. The algorithm is very straightforward to implement yet contains a number of remarkable properties which guarantee consistent sparse selections. To illustrate the value of our approach we applied our algorithm on five different datasets. The first consists of time course data from four well studied Hematopoietic cell lines (HL-60, Jurkat, NB4, and U937). The other four datasets include three well studied treatment outcomes (large cell lymphoma, childhood medulloblastomas, breast tumors) and one unpublished dataset (lymph status). We compared our approach both with other unsupervised methods (SOM,PCA,GS) and with supervised methods (SNR,RMB,RFE). The results clearly show that our approach considerably outperforms all the other unsupervised approaches in our study, is competitive with supervised methods and in some case even outperforms supervised approaches.
Resumo:
In the field of biologics production, productivity and stability of the transfected gene of interest are two very important attributes that dictate if a production process is viable. To further understand and improve these two traits, we would need to further our understanding of the factors affecting them. These would include integration site of the gene, gene copy number, cell phenotypic variation and cell environment. As these factors play different parts in the development process, they lead to variable productivity and stability of the transfected gene between clones, the well-known phenomenon of “clonal variation”. A study of this phenomenon and how the various factors contribute to it will thus shed light on strategies to improve productivity and stability in the production cell line. Of the four factors, the site of gene integration appears to be one of the most important. Hence, it is proposed that work is done on studying how different integration sites affect the productivity and stability of transfected genes in the development process. For the study to be more industrially relevant, it is proposed that the Chinese Hamster Ovary dhfr-deficient cell line, CHO-DG44, is used as the model system.