996 resultados para CLUSTER VALIDATION


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The evaluation and comparison of internal cluster validity indices is a critical problem in the clustering area. The methodology used in most of the evaluations assumes that the clustering algorithms work correctly. We propose an alternative methodology that does not make this often false assumption. We compared 7 internal cluster validity indices with both methodologies and concluded that the results obtained with the proposed methodology are more representative of the actual capabilities of the compared indices.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Three classification techniques, namely, K-means Cluster Analysis (KCA), Fuzzy Cluster Analysis (FCA), and Kohonen Neural Networks (KNN) were employed to group 25 microwatersheds of Kherthal watershed, Rajasthan into homogeneous groups for formulating the basis for suitable conservation and management practices. Ten parameters, mainly, morphological, namely, drainage density (D-d), bifurcation ratio (R-b), stream frequency (F-u), length of overland flow (L-o), form factor (R-f), shape factor (B-s), elongation ratio (R-e), circulatory ratio (R-c), compactness coefficient (C-c) and texture ratio (T) are used for the classification. Optimal number of groups is chosen, based on two cluster validation indices Davies-Bouldin and Dunn's. Comparative analysis of various clustering techniques revealed that 13 microwatersheds out of 25 are commonly suggested by KCA, FCA and KNN i.e., 52%; 17 microwatersheds out of 25 i.e., 68% are commonly suggested by KCA and FCA whereas these are 16 out of 25 in FCA and KNN (64%) and 15 out of 25 in KNN and CA (60%). It is observed from KNN sensitivity analysis that effect of various number of epochs (1000, 3000, 5000) and learning rates (0.01, 0.1-0.9) on total squared error values is significant even though no fixed trend is observed. Sensitivity analysis studies revealed that microwatershecls have occupied all the groups even though their number in each group is different in case of further increase in the number of groups from 5 to 6, 7 and 8. (C) 2010 International Association of Hydro-environment Engineering and Research, Asia Pacific Division. Published by Elsevier B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clustering techniques are used in regional flood frequency analysis (RFFA) to partition watersheds into natural groups or regions with similar hydrologic responses. The linear Kohonen's self‐organizing feature map (SOFM) has been applied as a clustering technique for RFFA in several recent studies. However, it is seldom possible to interpret clusters from the output of an SOFM, irrespective of its size and dimensionality. In this study, we demonstrate that SOFMs may, however, serve as a useful precursor to clustering algorithms. We present a two‐level. SOFM‐based clustering approach to form regions for FFA. In the first level, the SOFM is used to form a two‐dimensional feature map. In the second level, the output nodes of SOFM are clustered using Fuzzy c‐means algorithm to form regions. The optimal number of regions is based on fuzzy cluster validation measures. Effectiveness of the proposed approach in forming homogeneous regions for FFA is illustrated through application to data from watersheds in Indiana, USA. Results show that the performance of the proposed approach to form regions is better than that based on classical SOFM.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters` compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Data clustering is applied to various fields such as data mining, image processing and pattern recognition technique. Clustering algorithms splits a data set into clusters such that elements within the same cluster have a high degree of similarity, while elements belonging to different clusters have a high degree of dissimilarity. The Fuzzy C-Means Algorithm (FCM) is a fuzzy clustering algorithm most used and discussed in the literature. The performance of the FCM is strongly affected by the selection of the initial centers of the clusters. Therefore, the choice of a good set of initial cluster centers is very important for the performance of the algorithm. However, in FCM, the choice of initial centers is made randomly, making it difficult to find a good set. This paper proposes three new methods to obtain initial cluster centers, deterministically, the FCM algorithm, and can also be used in variants of the FCM. In this work these initialization methods were applied in variant ckMeans.With the proposed methods, we intend to obtain a set of initial centers which are close to the real cluster centers. With these new approaches startup if you want to reduce the number of iterations to converge these algorithms and processing time without affecting the quality of the cluster or even improve the quality in some cases. Accordingly, cluster validation indices were used to measure the quality of the clusters obtained by the modified FCM and ckMeans algorithms with the proposed initialization methods when applied to various data sets

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The attributes describing a data set may often be arranged in meaningful subsets, each of which corresponds to a different aspect of the data. An unsupervised algorithm (SCAD) that simultaneously performs fuzzy clustering and aspects weighting was proposed in the literature. However, SCAD may fail and halt given certain conditions. To fix this problem, its steps are modified and then reordered to reduce the number of parameters required to be set by the user. In this paper we prove that each step of the resulting algorithm, named ASCAD, globally minimizes its cost-function with respect to the argument being optimized. The asymptotic analysis of ASCAD leads to a time complexity which is the same as that of fuzzy c-means. A hard version of the algorithm and a novel validity criterion that considers aspect weights in order to estimate the number of clusters are also described. The proposed method is assessed over several artificial and real data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Microsecond long Molecular Dynamics (MD) trajectories of biomolecular processes are now possible due to advances in computer technology. Soon, trajectories long enough to probe dynamics over many milliseconds will become available. Since these timescales match the physiological timescales over which many small proteins fold, all atom MD simulations of protein folding are now becoming popular. To distill features of such large folding trajectories, we must develop methods that can both compress trajectory data to enable visualization, and that can yield themselves to further analysis, such as the finding of collective coordinates and reduction of the dynamics. Conventionally, clustering has been the most popular MD trajectory analysis technique, followed by principal component analysis (PCA). Simple clustering used in MD trajectory analysis suffers from various serious drawbacks, namely, (i) it is not data driven, (ii) it is unstable to noise and change in cutoff parameters, and (iii) since it does not take into account interrelationships amongst data points, the separation of data into clusters can often be artificial. Usually, partitions generated by clustering techniques are validated visually, but such validation is not possible for MD trajectories of protein folding, as the underlying structural transitions are not well understood. Rigorous cluster validation techniques may be adapted, but it is more crucial to reduce the dimensions in which MD trajectories reside, while still preserving their salient features. PCA has often been used for dimension reduction and while it is computationally inexpensive, being a linear method, it does not achieve good data compression. In this thesis, I propose a different method, a nonmetric multidimensional scaling (nMDS) technique, which achieves superior data compression by virtue of being nonlinear, and also provides a clear insight into the structural processes underlying MD trajectories. I illustrate the capabilities of nMDS by analyzing three complete villin headpiece folding and six norleucine mutant (NLE) folding trajectories simulated by Freddolino and Schulten [1]. Using these trajectories, I make comparisons between nMDS, PCA and clustering to demonstrate the superiority of nMDS. The three villin headpiece trajectories showed great structural heterogeneity. Apart from a few trivial features like early formation of secondary structure, no commonalities between trajectories were found. There were no units of residues or atoms found moving in concert across the trajectories. A flipping transition, corresponding to the flipping of helix 1 relative to the plane formed by helices 2 and 3 was observed towards the end of the folding process in all trajectories, when nearly all native contacts had been formed. However, the transition occurred through a different series of steps in all trajectories, indicating that it may not be a common transition in villin folding. The trajectories showed competition between local structure formation/hydrophobic collapse and global structure formation in all trajectories. Our analysis on the NLE trajectories confirms the notion that a tight hydrophobic core inhibits correct 3-D rearrangement. Only one of the six NLE trajectories folded, and it showed no flipping transition. All the other trajectories get trapped in hydrophobically collapsed states. The NLE residues were found to be buried deeply into the core, compared to the corresponding lysines in the villin headpiece, thereby making the core tighter and harder to undo for 3-D rearrangement. Our results suggest that the NLE may not be a fast folder as experiments suggest. The tightness of the hydrophobic core may be a very important factor in the folding of larger proteins. It is likely that chaperones like GroEL act to undo the tight hydrophobic core of proteins, after most secondary structure elements have been formed, so that global rearrangement is easier. I conclude by presenting facts about chaperone-protein complexes and propose further directions for the study of protein folding.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OBJECTIVES: To develop and validate a wandering typology. ---------- DESIGN: Cross-sectional, correlational descriptive design. ---------- SETTING:: Twenty-two nursing homes and six assisted living facilities. ---------- PARTICIPANTS: One hundred forty-two residents with dementia who spoke English, met Diagnostic and Statistical Manual for Mental Disorders, Fourth Edition, criteria for dementia, scored less than 24 on the Mini-Mental State Examination (MMSE), were ambulatory (with or without assistive device), and maintained a stable regime of psychotropic medications were studied. ---------- MEASUREMENTS: Data on wandering were collected using direct observations, plotted serially according to rate and duration to yield 21 parameters, and reduced through factor analysis to four components: high rate, high duration, low to moderate rate and duration, and time of day. Other measures included the MMSE, Minimum Data Set 2.0 mobility items, Cumulative Illness Rating Scale—Geriatric, and tympanic body temperature readings. ---------- RESULTS: Three groups of wanderers were identified through cluster analysis: classic, moderate, and subclinical. MMSE, mobility, and cardiac and upper and lower gastrointestinal problems differed between groups of wanderers and in comparison with nonwanderers. ---------- CONCLUSION: Results have implications for improving identification of wanderers and treatment of possible contributing factors.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Microbial pollution in water periodically affects human health in Australia, particularly in times of drought and flood. There is an increasing need for the control of waterborn microbial pathogens. Methods, allowing the determination of the origin of faecal contamination in water, are generally referred to as Microbial Source Tracking (MST). Various approaches have been evaluated as indicatorsof microbial pathogens in water samples, including detection of different microorganisms and various host-specific markers. However, until today there have been no universal MST methods that could reliably determine the source (human or animal) of faecal contamination. Therefore, the use of multiple approaches is frequently advised. MST is currently recognised as a research tool, rather than something to be included in routine practices. The main focus of this research was to develop novel and universally applicable methods to meet the demands for MST methods in routine testing of water samples. Escherichia coli was chosen initially as the object organism for our studies as, historically and globally, it is the standard indicator of microbial contamination in water. In this thesis, three approaches are described: single nucleotide polymorphism (SNP) genotyping, clustered regularly interspaced short palindromic repeats (CRISPR) screening using high resolution melt analysis (HRMA) methods and phage detection development based on CRISPR types. The advantage of the combination SNP genotyping and CRISPR genes has been discussed in this study. For the first time, a highly discriminatory single nucleotide polymorphism interrogation of E. coli population was applied to identify the host-specific cluster. Six human and one animal-specific SNP profile were revealed. SNP genotyping was successfully applied in the field investigations of the Coomera watershed, South-East Queensland, Australia. Four human profiles [11], [29], [32] and [45] and animal specific SNP profile [7] were detected in water. Two human-specific profiles [29] and [11] were found to be prevalent in the samples over a time period of years. The rainfall (24 and 72 hours), tide height and time, general land use (rural, suburban), seasons, distance from the river mouth and salinity show a lack of relashionship with the diversity of SNP profiles present in the Coomera watershed (p values > 0.05). Nevertheless, SNP genotyping method is able to identify and distinquish between human- and non-human specific E. coli isolates in water sources within one day. In some samples, only mixed profiles were detected. To further investigate host-specificity in these mixed profiles CRISPR screening protocol was developed, to be used on the set of E. coli, previously analysed for SNP profiles. CRISPR loci, which are the pattern of previous DNA coliphages attacks, were considered to be a promising tool for detecting host-specific markers in E. coli. Spacers in CRISPR loci could also reveal the dynamics of virulence in E. coli as well in other pathogens in water. Despite the fact that host-specificity was not observed in the set of E. coli analysed, CRISPR alleles were shown to be useful in detection of the geographical site of sources. HRMA allows determination of ‘different’ and ‘same’ CRISPR alleles and can be introduced in water monitoring as a cost-effective and rapid method. Overall, we show that the identified human specific SNP profiles [11], [29], [32] and [45] can be useful as marker genotypes globally for identification of human faecal contamination in water. Developed in the current study, the SNP typing approach can be used in water monitoring laboratories as an inexpensive, high-throughput and easy adapted protocol. The unique approach based on E. coli spacers for the search for unknown phage was developed to examine the host-specifity in phage sequences. Preliminary experiments on the recombinant plasmids showed the possibility of using this method for recovering phage sequences. Future studies will determine the host-specificity of DNA phage genotyping as soon as first reliable sequences can be acquired. No doubt, only implication of multiple approaches in MST will allow identification of the character of microbial contamination with higher confidence and readability.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Breast cancer incidence and mortality rates are increasing despite our current knowledge on the disease. Ninety-five percent of breast cancer cases correspond to sporadic forms of the disease and are believed to involve an interaction between environmental and genetic determinants. The microRNA 17–92 cluster host gene (MIR17HG) has been shown to regulate expression of genes involved in breast cancer development and progression. Study of single-nucleotide polymorphisms (SNPs) located in this cluster gene could help provide a further understanding of its role in breast cancer. Therefore, this study investigated six SNPs in the MIR17HG using two independent Australian Caucasian case–control populations (GRC-BC and GU-CCQ BB populations) to determine association to breast cancer susceptibility. Genotyping was undertaken using chip-based matrix assisted laser desorption ionisation time-of-flight (MALDI-TOF) mass spectrometry (MS). We found significant association between rs4824505 and breast cancer at the allelic level in both study cohorts (GRC-BC p = 0.01 and GU-CCQ BB p = 0.03). Furthermore, haplotypic analysis of results from our combined population determined a significant association between rs4824505/rs7336610 and breast cancer susceptibility (p = 5 × 10−4). Our study is the first to show that the A allele of rs4824505 and the AC haplotype of rs4824505/rs7336610 are associated with risk of breast cancer development. However, definitive validation of this finding requires larger cohorts or populations in different ethnical backgrounds. Finally, functional studies of these SNPs could provide a deeper understanding of the role that MIR17HG plays in the pathophysiology of breast cancer.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Le travail de modélisation a été réalisé à travers EGSnrc, un logiciel développé par le Conseil National de Recherche Canada.