812 resultados para Hier-archical clustering
Resumo:
Automatic labeling of white matter fibres in diffusion-weighted brain MRI is vital for comparing brain integrity and connectivity across populations, but is challenging. Whole brain tractography generates a vast set of fibres throughout the brain, but it is hard to cluster them into anatomically meaningful tracts, due to wide individual variations in the trajectory and shape of white matter pathways. We propose a novel automatic tract labeling algorithm that fuses information from tractography and multiple hand-labeled fibre tract atlases. As streamline tractography can generate a large number of false positive fibres, we developed a top-down approach to extract tracts consistent with known anatomy, based on a distance metric to multiple hand-labeled atlases. Clustering results from different atlases were fused, using a multi-stage fusion scheme. Our "label fusion" method reliably extracted the major tracts from 105-gradient HARDI scans of 100 young normal adults. © 2012 Springer-Verlag.
Resumo:
We introduce a framework for population analysis of white matter tracts based on diffusion-weighted images of the brain. The framework enables extraction of fibers from high angular resolution diffusion images (HARDI); clustering of the fibers based partly on prior knowledge from an atlas; representation of the fiber bundles compactly using a path following points of highest density (maximum density path; MDP); and registration of these paths together using geodesic curve matching to find local correspondences across a population. We demonstrate our method on 4-Tesla HARDI scans from 565 young adults to compute localized statistics across 50 white matter tracts based on fractional anisotropy (FA). Experimental results show increased sensitivity in the determination of genetic influences on principal fiber tracts compared to the tract-based spatial statistics (TBSS) method. Our results show that the MDP representation reveals important parts of the white matter structure and considerably reduces the dimensionality over comparable fiber matching approaches.
Resumo:
Environmental acoustic recordings can be used to perform avian species richness surveys, whereby a trained ornithologist can observe the species present by listening to the recording. This could be made more efficient by using computational methods for iteratively selecting the richest parts of a long recording for the human observer to listen to, a process known as “smart sampling”. This allows scaling up to much larger ecological datasets. In this paper we explore computational approaches based on information and diversity of selected samples. We propose to use an event detection algorithm to estimate the amount of information present in each sample. We further propose to cluster the detected events for a better estimate of this amount of information. Additionally, we present a time dispersal approach to estimating diversity between iteratively selected samples. Combinations of approaches were evaluated on seven 24-hour recordings that have been manually labeled by bird watchers. The results show that on average all the methods we have explored would allow annotators to observe more new species in fewer minutes compared to a baseline of random sampling at dawn.
Resumo:
Multicentric carpotarsal osteolysis (MCTO) is a rare skeletal dysplasia characterized by aggressive osteolysis, particularly affecting the carpal and tarsal bones, and is frequently associated with progressive renal failure. Using exome capture and next-generation sequencing in five unrelated simplex cases of MCTO, we identified previously unreported missense mutations clustering within a 51 base pair region of the single exon of MAFB, validated by Sanger sequencing. A further six unrelated simplex cases with MCTO were also heterozygous for previously unreported mutations within this same region, as were affected members of two families with autosomal-dominant MCTO. MAFB encodes a transcription factor that negatively regulates RANKL-induced osteoclastogenesis and is essential for normal renal development. Identification of this gene paves the way for development of novel therapeutic approaches for this crippling disease and provides insight into normal bone and kidney development.
Resumo:
(The American Journal of Human Genetics, 90, 494–501; March 9, 2012) In the published version of this article, the amino acid alteration caused by c.161C>T should have been notated as p.Ser54Leu and not p.Pro54Leu. The wild-type amino acid is incorrectly notated in the main text, in Table 2, and in Figure 4. The authors regret this error. Additionally, The Journal regrets that this erratum, originally requested in 2012, was not published in a timely fashion.
Resumo:
This thesis has investigated how to cluster a large number of faces within a multi-media corpus in the presence of large session variation. Quality metrics are used to select the best faces to represent a sequence of faces; and session variation modelling improves clustering performance in the presence of wide variations across videos. Findings from this thesis contribute to improving the performance of both face verification systems and the fully automated clustering of faces from a large video corpus.
Resumo:
The work reported hen was motivated by a desire to verify the existence of structure - specifically MP-rich clusters induced by sodium bromide (NaBr) in the ternary liquid mixture 3-methylpyridine (Mf) + water(W) + NaBr. We present small-angle X-ray scattering (SAXS) measurements in this mixture. These measurements were obtained at room temperature (similar to 298 K) in the one-phase region (below the relevant lower consolute points, T(L)s) at different values of X (i.e., X = 0.02 - 0.17), where X is the weight fraction of NaBr in the mixture. Cluster-size distribution, estimated on the assumption that the clusters are spherical, shows systematic behaviour in that the peak of the distribution shifts rewards larger values of cluster radius as X increases. The largest spatial extent of the clusters (similar to 4.5 nm) is seen at X = 0.17. Data analysis assuming arbitrary shapes and sizes of clusters gives a limiting value of cluster size (- 4.5 nm) that is not very sensitive to X. It is suggested that the cluster size determined may not be the same as the usual critical-point fluctuations far removed from the critical point (T-L). The influence of the additional length scale due to clustering is discussed from the standpoint of crossover from Ising to mean-field critical behaviour, when moving away from the T-L.
Resumo:
Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.
Resumo:
n this paper, a multistage evolutionary scheme is proposed for clustering in a large data base, like speech data. This is achieved by clustering a small subset of the entire sample set in each stage and treating the cluster centroids so obtained as samples, together with another subset of samples not considered previously, as input data to the next stage. This is continued till the whole sample set is exhausted. The clustering is accomplished by constructing a fuzzy similarity matrix and using the fuzzy techniques proposed here. The technique is illustrated by an efficient scheme for voiced-unvoiced-silence classification of speech.
Resumo:
This paper addresses the following predictive business process monitoring problem: Given the execution trace of an ongoing case,and given a set of traces of historical (completed) cases, predict the most likely outcome of the ongoing case. In this context, a trace refers to a sequence of events with corresponding payloads, where a payload consists of a set of attribute-value pairs. Meanwhile, an outcome refers to a label associated to completed cases, like, for example, a label indicating that a given case completed “on time” (with respect to a given desired duration) or “late”, or a label indicating that a given case led to a customer complaint or not. The paper tackles this problem via a two-phased approach. In the first phase, prefixes of historical cases are encoded using complex symbolic sequences and clustered. In the second phase, a classifier is built for each of the clusters. To predict the outcome of an ongoing case at runtime given its (uncompleted) trace, we select the closest cluster(s) to the trace in question and apply the respective classifier(s), taking into account the Euclidean distance of the trace from the center of the clusters. We consider two families of clustering algorithms – hierarchical clustering and k-medoids – and use random forests for classification. The approach was evaluated on four real-life datasets.
Resumo:
In this paper the notion of conceptual cohesiveness is precised and used to group objects semantically, based on a knowledge structure called ‘cohesion forest’. A set of axioms is proposed which should be satisfied to make the generated clusters meaningful.
Resumo:
A computationally efficient agglomerative clustering algorithm based on multilevel theory is presented. Here, the data set is divided randomly into a number of partitions. The samples of each such partition are clustered separately using hierarchical agglomerative clustering algorithm to form sub-clusters. These are merged at higher levels to get the final classification. This algorithm leads to the same classification as that of hierarchical agglomerative clustering algorithm when the clusters are well separated. The advantages of this algorithm are short run time and small storage requirement. It is observed that the savings, in storage space and computation time, increase nonlinearly with the sample size.
Resumo:
K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.
Resumo:
Objective To investigate the epidemic characteristics of human cutaneous anthrax (CA) in China, detect the spatiotemporal clusters at the county level for preemptive public health interventions, and evaluate the differences in the epidemiological characteristics within and outside clusters. Methods CA cases reported during 2005–2012 from the national surveillance system were evaluated at the county level using space-time scan statistic. Comparative analysis of the epidemic characteristics within and outside identified clusters was performed using using the χ2 test or Kruskal-Wallis test. Results The group of 30–39 years had the highest incidence of CA, and the fatality rate increased with age, with persons ≥70 years showing a fatality rate of 4.04%. Seasonality analysis showed that most of CA cases occurred between May/June and September/October of each year. The primary spatiotemporal cluster contained 19 counties from June 2006 to May 2010, and it was mainly located straddling the borders of Sichuan, Gansu, and Qinghai provinces. In these high-risk areas, CA cases were predominantly found among younger, local, males, shepherds, who were living on agriculture and stockbreeding and characterized with high morbidity, low mortality and a shorter period from illness onset to diagnosis. Conclusion CA was geographically and persistently clustered in the Southwestern China during 2005–2012, with notable differences in the epidemic characteristics within and outside spatiotemporal clusters; this demonstrates the necessity for CA interventions such as enhanced surveillance, health education, mandatory and standard decontamination or disinfection procedures to be geographically targeted to the areas identified in this study.
Resumo:
The family of location and scale mixtures of Gaussians has the ability to generate a number of flexible distributional forms. The family nests as particular cases several important asymmetric distributions like the Generalized Hyperbolic distribution. The Generalized Hyperbolic distribution in turn nests many other well known distributions such as the Normal Inverse Gaussian. In a multivariate setting, an extension of the standard location and scale mixture concept is proposed into a so called multiple scaled framework which has the advantage of allowing different tail and skewness behaviours in each dimension with arbitrary correlation between dimensions. Estimation of the parameters is provided via an EM algorithm and extended to cover the case of mixtures of such multiple scaled distributions for application to clustering. Assessments on simulated and real data confirm the gain in degrees of freedom and flexibility in modelling data of varying tail behaviour and directional shape.