998 resultados para cluster validation
Resumo:
Clustering quality or validation indices allow the evaluation of the quality of clustering in order to support the selection of a specific partition or clustering structure in its natural unsupervised environment, where the real solution is unknown or not available. In this paper, we investigate the use of quality indices mostly based on the concepts of clusters` compactness and separation, for the evaluation of clustering results (partitions in particular). This work intends to offer a general perspective regarding the appropriate use of quality indices for the purpose of clustering evaluation. After presenting some commonly used indices, as well as indices recently proposed in the literature, key issues regarding the practical use of quality indices are addressed. A general methodological approach is presented which considers the identification of appropriate indices thresholds. This general approach is compared with the simple use of quality indices for evaluating a clustering solution.
Resumo:
Data clustering is applied to various fields such as data mining, image processing and pattern recognition technique. Clustering algorithms splits a data set into clusters such that elements within the same cluster have a high degree of similarity, while elements belonging to different clusters have a high degree of dissimilarity. The Fuzzy C-Means Algorithm (FCM) is a fuzzy clustering algorithm most used and discussed in the literature. The performance of the FCM is strongly affected by the selection of the initial centers of the clusters. Therefore, the choice of a good set of initial cluster centers is very important for the performance of the algorithm. However, in FCM, the choice of initial centers is made randomly, making it difficult to find a good set. This paper proposes three new methods to obtain initial cluster centers, deterministically, the FCM algorithm, and can also be used in variants of the FCM. In this work these initialization methods were applied in variant ckMeans.With the proposed methods, we intend to obtain a set of initial centers which are close to the real cluster centers. With these new approaches startup if you want to reduce the number of iterations to converge these algorithms and processing time without affecting the quality of the cluster or even improve the quality in some cases. Accordingly, cluster validation indices were used to measure the quality of the clusters obtained by the modified FCM and ckMeans algorithms with the proposed initialization methods when applied to various data sets
Resumo:
The attributes describing a data set may often be arranged in meaningful subsets, each of which corresponds to a different aspect of the data. An unsupervised algorithm (SCAD) that simultaneously performs fuzzy clustering and aspects weighting was proposed in the literature. However, SCAD may fail and halt given certain conditions. To fix this problem, its steps are modified and then reordered to reduce the number of parameters required to be set by the user. In this paper we prove that each step of the resulting algorithm, named ASCAD, globally minimizes its cost-function with respect to the argument being optimized. The asymptotic analysis of ASCAD leads to a time complexity which is the same as that of fuzzy c-means. A hard version of the algorithm and a novel validity criterion that considers aspect weights in order to estimate the number of clusters are also described. The proposed method is assessed over several artificial and real data sets.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Microsecond long Molecular Dynamics (MD) trajectories of biomolecular processes are now possible due to advances in computer technology. Soon, trajectories long enough to probe dynamics over many milliseconds will become available. Since these timescales match the physiological timescales over which many small proteins fold, all atom MD simulations of protein folding are now becoming popular. To distill features of such large folding trajectories, we must develop methods that can both compress trajectory data to enable visualization, and that can yield themselves to further analysis, such as the finding of collective coordinates and reduction of the dynamics. Conventionally, clustering has been the most popular MD trajectory analysis technique, followed by principal component analysis (PCA). Simple clustering used in MD trajectory analysis suffers from various serious drawbacks, namely, (i) it is not data driven, (ii) it is unstable to noise and change in cutoff parameters, and (iii) since it does not take into account interrelationships amongst data points, the separation of data into clusters can often be artificial. Usually, partitions generated by clustering techniques are validated visually, but such validation is not possible for MD trajectories of protein folding, as the underlying structural transitions are not well understood. Rigorous cluster validation techniques may be adapted, but it is more crucial to reduce the dimensions in which MD trajectories reside, while still preserving their salient features. PCA has often been used for dimension reduction and while it is computationally inexpensive, being a linear method, it does not achieve good data compression. In this thesis, I propose a different method, a nonmetric multidimensional scaling (nMDS) technique, which achieves superior data compression by virtue of being nonlinear, and also provides a clear insight into the structural processes underlying MD trajectories. I illustrate the capabilities of nMDS by analyzing three complete villin headpiece folding and six norleucine mutant (NLE) folding trajectories simulated by Freddolino and Schulten [1]. Using these trajectories, I make comparisons between nMDS, PCA and clustering to demonstrate the superiority of nMDS. The three villin headpiece trajectories showed great structural heterogeneity. Apart from a few trivial features like early formation of secondary structure, no commonalities between trajectories were found. There were no units of residues or atoms found moving in concert across the trajectories. A flipping transition, corresponding to the flipping of helix 1 relative to the plane formed by helices 2 and 3 was observed towards the end of the folding process in all trajectories, when nearly all native contacts had been formed. However, the transition occurred through a different series of steps in all trajectories, indicating that it may not be a common transition in villin folding. The trajectories showed competition between local structure formation/hydrophobic collapse and global structure formation in all trajectories. Our analysis on the NLE trajectories confirms the notion that a tight hydrophobic core inhibits correct 3-D rearrangement. Only one of the six NLE trajectories folded, and it showed no flipping transition. All the other trajectories get trapped in hydrophobically collapsed states. The NLE residues were found to be buried deeply into the core, compared to the corresponding lysines in the villin headpiece, thereby making the core tighter and harder to undo for 3-D rearrangement. Our results suggest that the NLE may not be a fast folder as experiments suggest. The tightness of the hydrophobic core may be a very important factor in the folding of larger proteins. It is likely that chaperones like GroEL act to undo the tight hydrophobic core of proteins, after most secondary structure elements have been formed, so that global rearrangement is easier. I conclude by presenting facts about chaperone-protein complexes and propose further directions for the study of protein folding.
Resumo:
This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.
Resumo:
International Scientific Forum, ISF 2013, ISF 2013, 12-14 December 2013, Tirana.
Resumo:
While Cluster-Tree network topologies look promising for WSN applications with timeliness and energy-efficiency requirements, we are yet to witness its adoption in commercial and academic solutions. One of the arguments that hinder the use of these topologies concerns the lack of flexibility in adapting to changes in the network, such as in traffic flows. This paper presents a solution to enable these networks with the ability to self-adapt their clusters’ duty-cycle and scheduling, to provide increased quality of service to multiple traffic flows. Importantly, our approach enables a network to change its cluster scheduling without requiring long inaccessibility times or the re-association of the nodes. We show how to apply our methodology to the case of IEEE 802.15.4/ZigBee cluster-tree WSNs without significant changes to the protocol. Finally, we analyze and demonstrate the validity of our methodology through a comprehensive simulation and experimental validation using commercially available technology on a Structural Health Monitoring application scenario.
Resumo:
In this work, cluster analysis is applied to a real dataset of biological features of several Portuguese reservoirs. All the statistical analysis is done using R statistical software. Several metrics and methods were explored, as well as the combination of Euclidean metric and the hierarchical Ward method. Although it did not present the best combination in terms of internal and stability validation, it was still a good solution and presented good results in terms of interpretation of the problem at hand.
Resumo:
BACKGROUND: Adequate pain assessment is critical for evaluating the efficacy of analgesic treatment in clinical practice and during the development of new therapies. Yet the currently used scores of global pain intensity fail to reflect the diversity of pain manifestations and the complexity of underlying biological mechanisms. We have developed a tool for a standardized assessment of pain-related symptoms and signs that differentiates pain phenotypes independent of etiology. METHODS AND FINDINGS: Using a structured interview (16 questions) and a standardized bedside examination (23 tests), we prospectively assessed symptoms and signs in 130 patients with peripheral neuropathic pain caused by diabetic polyneuropathy, postherpetic neuralgia, or radicular low back pain (LBP), and in 57 patients with non-neuropathic (axial) LBP. A hierarchical cluster analysis revealed distinct association patterns of symptoms and signs (pain subtypes) that characterized six subgroups of patients with neuropathic pain and two subgroups of patients with non-neuropathic pain. Using a classification tree analysis, we identified the most discriminatory assessment items for the identification of pain subtypes. We combined these six interview questions and ten physical tests in a pain assessment tool that we named Standardized Evaluation of Pain (StEP). We validated StEP for the distinction between radicular and axial LBP in an independent group of 137 patients. StEP identified patients with radicular pain with high sensitivity (92%; 95% confidence interval [CI] 83%-97%) and specificity (97%; 95% CI 89%-100%). The diagnostic accuracy of StEP exceeded that of a dedicated screening tool for neuropathic pain and spinal magnetic resonance imaging. In addition, we were able to reproduce subtypes of radicular and axial LBP, underscoring the utility of StEP for discerning distinct constellations of symptoms and signs. CONCLUSIONS: We present a novel method of identifying pain subtypes that we believe reflect underlying pain mechanisms. We demonstrate that this new approach to pain assessment helps separate radicular from axial back pain. Beyond diagnostic utility, a standardized differentiation of pain subtypes that is independent of disease etiology may offer a unique opportunity to improve targeted analgesic treatment.
Resumo:
Natural fluctuations in soil microbial communities are poorly documented because of the inherent difficulty to perform a simultaneous analysis of the relative abundances of multiple populations over a long time period. Yet, it is important to understand the magnitudes of community composition variability as a function of natural influences (e.g., temperature, plant growth, or rainfall) because this forms the reference or baseline against which external disturbances (e.g., anthropogenic emissions) can be judged. Second, definition of baseline fluctuations in complex microbial communities may help to understand at which point the systems become unbalanced and cannot return to their original composition. In this paper, we examined the seasonal fluctuations in the bacterial community of an agricultural soil used for regular plant crop production by using terminal restriction fragment length polymorphism profiling (T-RFLP) of the amplified 16S ribosomal ribonucleic acid (rRNA) gene diversity. Cluster and statistical analysis of T-RFLP data showed that soil bacterial communities fluctuated very little during the seasons (similarity indices between 0.835 and 0.997) with insignificant variations in 16S rRNA gene richness and diversity indices. Despite overall insignificant fluctuations, between 8 and 30% of all terminal restriction fragments changed their relative intensity in a significant manner among consecutive time samples. To determine the magnitude of community variations induced by external factors, soil samples were subjected to either inoculation with a pure bacterial culture, addition of the herbicide mecoprop, or addition of nutrients. All treatments resulted in statistically measurable changes of T-RFLP profiles of the communities. Addition of nutrients or bacteria plus mecoprop resulted in bacteria composition, which did not return to the original profile within 14 days. We propose that at less than 70% similarity in T-RFLP, the bacterial communities risk to drift apart to inherently different states.
Resumo:
Background Chronic obstructive pulmonary disease (COPD) is increasingly considered a heterogeneous condition. It was hypothesised that COPD, as currently defined, includes different clinically relevant subtypes. Methods To identify and validate COPD subtypes, 342 subjects hospitalised for the first time because of a COPD exacerbation were recruited. Three months after discharge, when clinically stable, symptoms and quality of life, lung function, exercise capacity, nutritional status, biomarkers of systemic and bronchial inflammation, sputum microbiology, CT of the thorax and echocardiography were assessed. COPD groups were identified by partitioning cluster analysis and validated prospectively against cause-specific hospitalisations and all-cause mortality during a 4 year follow-up. Results Three COPD groups were identified: group 1 (n ¼ 126, 67 years) was characterised by severe airflow limitation (postbronchodilator forced expiratory volume in 1 s (FEV 1 ) 38% predicted) and worse performance in most of the respiratory domains of the disease; group 2 (n ¼ 125, 69 years) showed milder airflow limitation (FEV 1 63% predicted); and group 3 (n ¼ 91, 67 years) combined a similarly milder airflow limitation (FEV 1 58% predicted) with a high proportion of obesity, cardiovascular disorders, iabetes and systemic inflammation. During follow-up, group 1 had more frequent hospitalisations due to COPD (HR 3.28, p < 0.001) and higher all-cause mortality (HR 2.36, p ¼ 0.018) than the other two groups, whereas group 3 had more admissions due to cardiovascular disease (HR 2.87, p ¼ 0.014). Conclusions In patients with COPD recruited at their first hospitalisation, three different COPD subtypes were identified and prospectively validated:"severe respiratory COPD","moderate respiratory COPD", and"systemic COPD'
Resumo:
Background Chronic obstructive pulmonary disease (COPD) is increasingly considered a heterogeneous condition. It was hypothesised that COPD, as currently defined, includes different clinically relevant subtypes. Methods To identify and validate COPD subtypes, 342 subjects hospitalised for the first time because of a COPD exacerbation were recruited. Three months after discharge, when clinically stable, symptoms and quality of life, lung function, exercise capacity, nutritional status, biomarkers of systemic and bronchial inflammation, sputum microbiology, CT of the thorax and echocardiography were assessed. COPD groups were identified by partitioning cluster analysis and validated prospectively against cause-specific hospitalisations and all-cause mortality during a 4 year follow-up. Results Three COPD groups were identified: group 1 (n ¼ 126, 67 years) was characterised by severe airflow limitation (postbronchodilator forced expiratory volume in 1 s (FEV 1 ) 38% predicted) and worse performance in most of the respiratory domains of the disease; group 2 (n ¼ 125, 69 years) showed milder airflow limitation (FEV 1 63% predicted); and group 3 (n ¼ 91, 67 years) combined a similarly milder airflow limitation (FEV 1 58% predicted) with a high proportion of obesity, cardiovascular disorders, iabetes and systemic inflammation. During follow-up, group 1 had more frequent hospitalisations due to COPD (HR 3.28, p < 0.001) and higher all-cause mortality (HR 2.36, p ¼ 0.018) than the other two groups, whereas group 3 had more admissions due to cardiovascular disease (HR 2.87, p ¼ 0.014). Conclusions In patients with COPD recruited at their first hospitalisation, three different COPD subtypes were identified and prospectively validated:"severe respiratory COPD","moderate respiratory COPD", and"systemic COPD'
Resumo:
Le travail de modélisation a été réalisé à travers EGSnrc, un logiciel développé par le Conseil National de Recherche Canada.
Resumo:
Resumen tomado de la publicación. Con el apoyo económico del departamento MIDE de la UNED