4 resultados para Large Data

em National Center for Biotechnology Information - NCBI


Relevância:

70.00% 70.00%

Publicador:

Resumo:

We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Simple phylogenetic tests were applied to a large data set of nucleotide sequences from two nuclear genes and a region of the mitochondrial genome of Trypanosoma cruzi, the agent of Chagas' disease. Incongruent gene genealogies manifest genetic exchange among distantly related lineages of T. cruzi. Two widely distributed isoenzyme types of T. cruzi are hybrids, their genetic composition being the likely result of genetic exchange between two distantly related lineages. The data show that the reference strain for the T. cruzi genome project (CL Brener) is a hybrid. Well-supported gene genealogies show that mitochondrial and nuclear gene sequences from T. cruzi cluster, respectively, in three or four distinct clades that do not fully correspond to the two previously defined major lineages of T. cruzi. There is clear genetic differentiation among the major groups of sequences, but genetic diversity within each major group is low. We estimate that the major extant lineages of T. cruzi have diverged during the Miocene or early Pliocene (3–16 million years ago).

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Two issues in the evolution of the intron/exon structure of genes are the role of exon shuffling and the origin of introns. Using a large data base of eukaryotic intron-containing genes, we have found that there are correlations between intron phases leading to an excess of symmetric exons and symmetric exon sets. We interpret these excesses as manifestations of exon shuffling and make a conservative estimate that at least 19% of the exons in the data base were involved in exon shuffling, suggesting an important role for exon shuffling in evolution. Furthermore, these excesses of symmetric exons appear also in those regions of eukaryotic genes that are homologous to prokaryotic genes: the ancient conserved regions. This last fact cannot be explained in terms of the insertional theory of introns but rather supports the concept that some of the introns were ancient, the exon theory of genes.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A statistical modeling approach is proposed for use in searching large microarray data sets for genes that have a transcriptional response to a stimulus. The approach is unrestricted with respect to the timing, magnitude or duration of the response, or the overall abundance of the transcript. The statistical model makes an accommodation for systematic heterogeneity in expression levels. Corresponding data analyses provide gene-specific information, and the approach provides a means for evaluating the statistical significance of such information. To illustrate this strategy we have derived a model to depict the profile expected for a periodically transcribed gene and used it to look for budding yeast transcripts that adhere to this profile. Using objective criteria, this method identifies 81% of the known periodic transcripts and 1,088 genes, which show significant periodicity in at least one of the three data sets analyzed. However, only one-quarter of these genes show significant oscillations in at least two data sets and can be classified as periodic with high confidence. The method provides estimates of the mean activation and deactivation times, induced and basal expression levels, and statistical measures of the precision of these estimates for each periodic transcript.