767 resultados para Unsupervised clustering
Resumo:
In this paper, the goal of identifying disease subgroups based on differences in observed symptom profile is considered. Commonly referred to as phenotype identification, solutions to this task often involve the application of unsupervised clustering techniques. In this paper, we investigate the application of a Dirichlet Process mixture (DPM) model for this task. This model is defined by the placement of the Dirichlet Process (DP) on the unknown components of a mixture model, allowing for the expression of uncertainty about the partitioning of observed data into homogeneous subgroups. To exemplify this approach, an application to phenotype identification in Parkinson’s disease (PD) is considered, with symptom profiles collected using the Unified Parkinson’s Disease Rating Scale (UPDRS). Clustering, Dirichlet Process mixture, Parkinson’s disease, UPDRS.
Resumo:
Training data for supervised learning neural networks can be clustered such that the input/output pairs in each cluster are redundant. Redundant training data can adversely affect training time. In this paper we apply two clustering algorithms, ART2 -A and the Generalized Equality Classifier, to identify training data clusters and thus reduce the training data and training time. The approach is demonstrated for a high dimensional nonlinear continuous time mapping. The demonstration shows six-fold decrease in training time at little or no loss of accuracy in the handling of evaluation data.
Resumo:
The XML Document Mining track was launched for exploring two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of Machine Learning (ML) techniques for dealing with generic ML tasks in the structured domain, i.e., classification and clustering of semi-structured documents. This track has run for six editions during INEX 2005, 2006, 2007, 2008, 2009 and 2010. The first five editions have been summarized in previous editions and we focus here on the 2010 edition. INEX 2010 included two tasks in the XML Mining track: (1) unsupervised clustering task and (2) semi-supervised classification task where documents are organized in a graph. The clustering task requires the participants to group the documents into clusters without any knowledge of category labels using an unsupervised learning algorithm. On the other hand, the classification task requires the participants to label the documents in the dataset into known categories using a supervised learning algorithm and a training set. This report gives the details of clustering and classification tasks.
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
In this paper, we propose a novel method for the unsupervised clustering of graphs in the context of the constellation approach to object recognition. Such method is an EM central clustering algorithm which builds prototypical graphs on the basis of fast matching with graph transformations. Our experiments, both with random graphs and in realistic situations (visual localization), show that our prototypes improve the set median graphs and also the prototypes derived from our previous incremental method. We also discuss how the method scales with a growing number of images.
Resumo:
2000 Mathematics Subject Classification: 62H30
Resumo:
Mixture models are a flexible tool for unsupervised clustering that have found popularity in a vast array of research areas. In studies of medicine, the use of mixtures holds the potential to greatly enhance our understanding of patient responses through the identification of clinically meaningful clusters that, given the complexity of many data sources, may otherwise by intangible. Furthermore, when developed in the Bayesian framework, mixture models provide a natural means for capturing and propagating uncertainty in different aspects of a clustering solution, arguably resulting in richer analyses of the population under study. This thesis aims to investigate the use of Bayesian mixture models in analysing varied and detailed sources of patient information collected in the study of complex disease. The first aim of this thesis is to showcase the flexibility of mixture models in modelling markedly different types of data. In particular, we examine three common variants on the mixture model, namely, finite mixtures, Dirichlet Process mixtures and hidden Markov models. Beyond the development and application of these models to different sources of data, this thesis also focuses on modelling different aspects relating to uncertainty in clustering. Examples of clustering uncertainty considered are uncertainty in a patient’s true cluster membership and accounting for uncertainty in the true number of clusters present. Finally, this thesis aims to address and propose solutions to the task of comparing clustering solutions, whether this be comparing patients or observations assigned to different subgroups or comparing clustering solutions over multiple datasets. To address these aims, we consider a case study in Parkinson’s disease (PD), a complex and commonly diagnosed neurodegenerative disorder. In particular, two commonly collected sources of patient information are considered. The first source of data are on symptoms associated with PD, recorded using the Unified Parkinson’s Disease Rating Scale (UPDRS) and constitutes the first half of this thesis. The second half of this thesis is dedicated to the analysis of microelectrode recordings collected during Deep Brain Stimulation (DBS), a popular palliative treatment for advanced PD. Analysis of this second source of data centers on the problems of unsupervised detection and sorting of action potentials or "spikes" in recordings of multiple cell activity, providing valuable information on real time neural activity in the brain.
Resumo:
In this paper, we propose a semi-supervised approach of anomaly detection in Online Social Networks. The social network is modeled as a graph and its features are extracted to detect anomaly. A clustering algorithm is then used to group users based on these features and fuzzy logic is applied to assign degree of anomalous behavior to the users of these clusters. Empirical analysis shows effectiveness of this method.
Resumo:
Recent studies suggest that genetic and environmental factors do not account for all the schizophrenia risk and epigenetics also plays a role in disease susceptibility. DNA methylation is a heritable epigenetic modification that can regulate gene expression. Genome-Wide DNA methylation analysis was performed on post-mortem human brain tissue from 24 patients with schizophrenia and 24 unaffected controls. DNA methylation was assessed at over 485 000 CpG sites using the Illumina Infinium Human Methylation450 Bead Chip. After adjusting for age and post-mortem interval (PMI), 4 641 probes corresponding to 2 929 unique genes were found to be differentially methylated. Of those genes, 1 291 were located in a CpG island and 817 were in a promoter region. These include NOS1, AKT1, DTNBP1, DNMT1, PPP3CC and SOX10 which have previously been associated with schizophrenia. More than 100 of these genes overlap with a previous DNA methylation study of peripheral blood from schizophrenia patients in which 27 000 CpG sites were analysed. Unsupervised clustering analysis of the top 3 000 most variable probes revealed two distinct groups with significantly more people with schizophrenia in cluster one compared to controls (p = 1.74x10-4). The first cluster was composed of 88% of patients with schizophrenia and only 12% controls while the second cluster was composed of 27% of patients with schizophrenia and 73% controls. These results strongly suggest that differential DNA methylation is important in schizophrenia etiology and add support for the use of DNA methylation profiles as a future prognostic indicator of schizophrenia.
Resumo:
Germline mutations in fumarate hydratase (FH) cause hereditary leiomyomatosis and renal cell cancer (HLRCC). FH is a nuclear encoded enzyme which functions in the Krebs tricarboxylic acid cycle, and homozygous mutation in FH lead to severe developmental defects. Both uterine and cutaneous leiomyomas are components of the HLRCC phenotype. Most of these tumours show loss of the wild-type allele and, also, the mutations reduce FH enzyme activity, which indicate that FH is a tumour suppressor gene. The renal cell cancers associated with HLRCC are of rare papillary type 2 histology. Other genes involved in the Krebs cycle, which are also implicated in neoplasia are 3 of the 4 subunits encoding succinate dehydrogenase (SDH); mutations in SHDB, SDHC, and SDHD predispose to paraganglioma and phaeochromocytoma. Although uterine leiomyomas (or fibroids) are very common, the estimations of affected women ranging from 25% to 77%, not much is known about their genetic background. Cytogenetic studies have revealed that rearrangements involving chromosomes 6, 7, 12 and 14 are most commonly seen in fibroids. Deletions on the long arm of chromosome 7 have been reported to be involved in about 17 to 34 % of leiomyomas and the small commonly deleted region on 7q22 suggests that there might be an underlying tumour suppressor gene in that region. The purpose of this study was to investigate the genetic mechanisms behind the development of tumours associated with HLRCC, both renal cell cancer and uterine fibroids. Firstly, a database search at the Finnish cancer registry was conducted in order to identify new families with early-onset RCC and to test if the family history was compatible with HLRCC. Secondly, sporadic uterine fibroids were tested for deletions on 7q in order to define the minimal deleted 7q-region, followed by mutation analysis of the candidate genes. Thirdly, oligonucleotide chips were utilised to study the global gene expression profiles of uterine fibroids in order to test whether 7q-deletions and FH mutations significantly affected fibroid biology. In the screen for early-onset RCC, 214 families were identified. Subsequently, the pedigrees were constructed and clinical data obtained. One of the index cases (RCC at the age of 28) had a mother who had been diagnosed with a heart tumour, which in further investigation turned out to be a paraganglioma. This lead to an alternative hypothesis that SDH, instead of FH, could be involved. SDHA, SDHB, SDHC and SDHD were sequenced from these individuals; a germline SDHB R27X mutation was detected with loss of the wild-type allele in both tumours. These results suggest that germline mutations in the SDHB gene predispose to early-onset RCC establishing a novel form of hereditary RCC. This has immediate clinical implications in the surveillance of patients suffering from early-onset RCC and phaeochromocytoma/paraganglioma. For the studies on sporadic uterine fibroids, a set of 166 fibroids from 51 individuals were collected. The 7q LOH mapping defined a commonly deleted region of about 3.2 mega bases in 11 of the 166 tumours. The deletion was consistent with previously reported allelotyping studies of leiomyomas and it therefore suggested the presence of a tumour suppressor gene in the deleted region. Furthermore, the high-resolution aCGH-chip analysis refined the deleted region to only 2.79Mb. When combined with previous data, the commonly deleted region was only 2.3Mb. The mutation screening of the known genes within the commonly deleted region did not reveal pathogenic mutations, however. The expression microarray analysis revealed that FH-deficient fibroids, both sporadic and familial, had their distinct gene expression profile as they formed their own group in the unsupervised clustering. On the other hand, the presence or absence of 7q-deletions did not significantly alter the global gene expression pattern of fibroids, suggesting that these two groups do not have different biological backgrounds. Multiple differentially expressed genes were identified between FH wild-type and FH-mutant fibroids, and the most significant increase was seen in the expression of carbohydrate metabolism-related and hypoxia inducible factor (HIF) target genes.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
This paper describes a new method of color text localization from generic scene images containing text of different scripts and with arbitrary orientations. A representative set of colors is first identified using the edge information to initiate an unsupervised clustering algorithm. Text components are identified from each color layer using a combination of a support vector machine and a neural network classifier trained on a set of low-level features derived from the geometric, boundary, stroke and gradient information. Experiments on camera-captured images that contain variable fonts, size, color, irregular layout, non-uniform illumination and multiple scripts illustrate the robustness of the method. The proposed method yields precision and recall of 0.8 and 0.86 respectively on a database of 100 images. The method is also compared with others in the literature using the ICDAR 2003 robust reading competition dataset.
Resumo:
A novel approach for estimating articulated body posture and motion from monocular video sequences is proposed. Human pose is defined as the instantaneous two dimensional configuration (i.e., the projection onto the image plane) of a single articulated body in terms of the position of a predetermined set of joints. First, statistical segmentation of the human bodies from the background is performed and low-level visual features are found given the segmented body shape. The goal is to be able to map these, generally low level, visual features to body configurations. The system estimates different mappings, each one with a specific cluster in the visual feature space. Given a set of body motion sequences for training, unsupervised clustering is obtained via the Expectation Maximation algorithm. Then, for each of the clusters, a function is estimated to build the mapping between low-level features to 3D pose. Currently this mapping is modeled by a neural network. Given new visual features, a mapping from each cluster is performed to yield a set of possible poses. From this set, the system selects the most likely pose given the learned probability distribution and the visual feature similarity between hypothesis and input. Performance of the proposed approach is characterized using a new set of known body postures, showing promising results.
Resumo:
In this paper, we introduce the Generalized Equality Classifier (GEC) for use as an unsupervised clustering algorithm in categorizing analog data. GEC is based on a formal definition of inexact equality originally developed for voting in fault tolerant software applications. GEC is defined using a metric space framework. The only parameter in GEC is a scalar threshold which defines the approximate equality of two patterns. Here, we compare the characteristics of GEC to the ART2-A algorithm (Carpenter, Grossberg, and Rosen, 1991). In particular, we show that GEC with the Hamming distance performs the same optimization as ART2. Moreover, GEC has lower computational requirements than AR12 on serial machines.