948 resultados para Clustering Analysis
Resumo:
We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.
Resumo:
While genome sequencing projects are advancing rapidly, EST sequencing and analysis remains a primary research tool for the identification and categorization of gene sequences in a wide variety of species and an important resource for annotation of genomic sequence. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST sequences in an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed by first clustering, then assembling EST and annotated gene sequences from GenBank for the targeted species. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, to provide links between orthologous and paralogous genes and as a resource for comparative sequence analysis.
Resumo:
Olfactory receptor (OR) genes represent ≈1% of genomic coding sequence in mammals, and these genes are clustered on multiple chromosomes in both the mouse and human genomes. We have taken a comparative genomics approach to identify features that may be involved in the dynamic evolution of this gene family and in the transcriptional control that results in a single OR gene expressed per olfactory neuron. We sequenced ≈350 kb of the murine P2 OR cluster and used synteny, gene linkage, and phylogenetic analysis to identify and sequence ≈111 kb of an orthologous cluster in the human genome. In total, 18 mouse and 8 human OR genes were identified, including 7 orthologs that appear to be functional in both species. Noncoding homology is evident between orthologs and generally is confined within the transcriptional unit. We find no evidence for common regulatory features shared among paralogs, and promoter regions generally do not contain strong promoter motifs. We discuss these observations, as well as OR clustering, in the context of evolutionary expansion and transcriptional regulation of OR repertoires.
Resumo:
Whole genome linkage analysis of type 1 diabetes using affected sib pair families and semi-automated genotyping and data capture procedures has shown how type 1 diabetes is inherited. A major proportion of clustering of the disease in families can be accounted for by sharing of alleles at susceptibility loci in the major histocompatibility complex on chromosome 6 (IDDM1) and at a minimum of 11 other loci on nine chromosomes. Primary etiological components of IDDM1, the HLA-DQB1 and -DRB1 class II immune response genes, and of IDDM2, the minisatellite repeat sequence in the 5' regulatory region of the insulin gene on chromosome 11p15, have been identified. Identification of the other loci will involve linkage disequilibrium mapping and sequencing of candidate genes in regions of linkage.
Resumo:
Automated human behaviour analysis has been, and still remains, a challenging problem. It has been dealt from different points of views: from primitive actions to human interaction recognition. This paper is focused on trajectory analysis which allows a simple high level understanding of complex human behaviour. It is proposed a novel representation method of trajectory data, called Activity Description Vector (ADV) based on the number of occurrences of a person is in a specific point of the scenario and the local movements that perform in it. The ADV is calculated for each cell of the scenario in which it is spatially sampled obtaining a cue for different clustering methods. The ADV representation has been tested as the input of several classic classifiers and compared to other approaches using CAVIAR dataset sequences obtaining great accuracy in the recognition of the behaviour of people in a Shopping Centre.
Resumo:
Optimal currency area theory suggests that business cycle comovement is a sufficient condition for monetary union, particularly if there are low levels of labour mobility between potential members of the monetary union. Previous studies of co-movement of business cycle variables (mainly authored by Artis and Zhang in the late 1990s) found that there was a core of member states in the EU that could be grouped together as having similar business cycle comovements, but these studies always used Germany as the country against which to compare. In this study, the analysis of Artis and Zhang is extended and updated but correlating against both German and euro area macroeconomic aggregates and using more recent techniques in cluster analysis, namely model-based clustering techniques.
Resumo:
The master thesis presents methods for intellectual analysis and visualization 3D EKG in order to increase the efficiency of ECG analysis by extracting additional data. Visualization is presented as part of the signal analysis tasks considered imaging techniques and their mathematical description. Have been developed algorithms for calculating and visualizing the signal attributes are described using mathematical methods and tools for mining signal. The model of patterns searching for comparison purposes of accuracy of methods was constructed, problems of a clustering and classification of data are solved, the program of visualization of data is also developed. This approach gives the largest accuracy in a task of the intellectual analysis that is confirmed in this work. Considered visualization and analysis techniques are also applicable to the multi-dimensional signals of a different kind.
Resumo:
"May 1980."
Resumo:
Originally presented as the author's thesis (M.A.), University of Illinois at Urbana-Champaign.
Resumo:
This thesis is an analytical analysis of consumption in Brazil, based on data from the Consumer Expenditure Survey, years 2008 to 2009, collected by the Brazilian Institute of Geography and Statistics. The main aim of the thesis was to identify differences and similarities in consumption among Brazilian households, and estimate the importance of demographic and geographic characteristics. Initially, households belonging to different social classes and geographical regions were compared based on their consumption. For further insights, two cluster analyses were conducted. Firstly, households were grouped according to the absolute values of expenditures. Five clusters were discovered; cluster membership showed larger spending in all of the expense categories for households having higher income, and a substantial association with particular demographic variables, including as region, neighborhood, race and education. Secondly, cluster analysis was performed on proportionate distribution of total spending by every household. Five groups of households were revealed: Basic Consumers, the largest group that spends only on fundamental goods, Limited Spenders, which additionally purchase alcohol, tobacco, literature and telecommunication technologies, Mainstream Buyers, characterized by spending on clothing, personal care, entertainment and transport, Advanced Consumers, which have high relative expenses on financial and legal services, healthcare and education, and Exclusive Spenders, households distinguished by spending on vehicles, real estate and travelling.
Resumo:
Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of clusters g by first fitting a mixture model with g components. An outright clustering of the data is then obtained by assigning an observation to the component to which it has the highest estimated posterior probability of belonging; that is, the ith cluster consists of those observations assigned to the ith component (i = 1,..., g). The focus is on the use of mixtures of normal components for the cluster analysis of data that can be regarded as being continuous. But attention is also given to the case of mixed data, where the observations consist of both continuous and discrete variables.
Resumo:
The number of mammalian transcripts identified by full-length cDNA projects and genome sequencing projects is increasing remarkably. Clustering them into a strictly nonredundant and comprehensive set provides a platform for functional analysis of the transcriptome and proteome, but the quality of the clustering and predictive usefulness have previously required manual curation to identify truncated transcripts and inappropriate clustering of closely related sequences. A Representative Transcript and Protein Sets (RTPS) pipeline was previously designed to identify the nonredundant and comprehensive set of mouse transcripts based on clustering of a large mouse full-length cDNA set (FANTOM2). Here we propose an alternative method that is more robust, requires less manual curation, and is applicable to other organisms in addition to mouse. RTPSs of human, mouse, and rat have been produced by this method and used for validation. Their comprehensiveness and quality are discussed by comparison with other clustering approaches. The RTPSs are available at ftp://fantom2.gsc.riken.go.jp/RTPS/. (C). 2004 Elsevier Inc. All rights reserved.
Resumo:
Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.
Resumo:
Objectives: The objectives of this study were to examine the extent of clustering of smoking, high levels of television watching, overweight, and high blood pressure among adolescents and whether this clustering varies by socioeconomic position and Cognitive function. Methods: This study was a cross-sectional analysis of 3613 (1742 females) participants of an Australian birth cohort who were examined at age 14. Results: Three hundred fifty-three (9.8%) of the participants had co-occurrence of three or four risk factors. Risk factors clustered in these adolescents with a greater number of participants than would be predicted by assumptions of independence having no risk factors and three or four risk factors. The extent of clustering tended to be greater in those from lower-income families and among those with lower cognitive function. The age-adjusted ratio of observed to expected cooccurrence of three or four risk factors was 2.70 (95% confidence interval [Cl], 1.80-4.06) among those from low-income families and 1.70 (95% Cl, 1.34-2.16) among those from more affluent families. The ratio among those with low Raven's scores (nonverbal reasoning) was 2.36 (95% Cl, 1.69-3.30) and among those with higher scores was 1.51 (95% Cl, 1.19-1.92); similar results for the WRAT 3 score (reading ability) were 2.69 (95% Cl, 1.85-3.94) and 1.68 (95% Cl, 1.34-2.11). Clustering did not differ by sex. Conclusion: Among adolescents, coronary heart disease risk factors cluster, and there is some evidence that this clustering is greater among those from families with low income and those who have lower cognitive function.
Resumo:
Motivation: The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. Results: We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation) and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too.