925 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices


Relevância:

50.00% 50.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.

Relevância:

50.00% 50.00%

Publicador:

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Background: The validity of ensemble averaging on event-related potential (ERP) data has been questioned, due to its assumption that the ERP is identical across trials. Thus, there is a need for preliminary testing for cluster structure in the data. New method: We propose a complete pipeline for the cluster analysis of ERP data. To increase the signalto-noise (SNR) ratio of the raw single-trials, we used a denoising method based on Empirical Mode Decomposition (EMD). Next, we used a bootstrap-based method to determine the number of clusters, through a measure called the Stability Index (SI). We then used a clustering algorithm based on a Genetic Algorithm (GA)to define initial cluster centroids for subsequent k-means clustering. Finally, we visualised the clustering results through a scheme based on Principal Component Analysis (PCA). Results: After validating the pipeline on simulated data, we tested it on data from two experiments – a P300 speller paradigm on a single subject and a language processing study on 25 subjects. Results revealed evidence for the existence of 6 clusters in one experimental condition from the language processing study. Further, a two-way chi-square test revealed an influence of subject on cluster membership.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Clustering is a difficult task: there is no single cluster definition and the data can have more than one underlying structure. Pareto-based multi-objective genetic algorithms (e.g., MOCK Multi-Objective Clustering with automatic K-determination and MOCLE-Multi-Objective Clustering Ensemble) were proposed to tackle these problems. However, the output of such algorithms can often contains a high number of partitions, becoming difficult for an expert to manually analyze all of them. In order to deal with this problem, we present two selection strategies, which are based on the corrected Rand, to choose a subset of solutions. To test them, they are applied to the set of solutions produced by MOCK and MOCLE in the context of several datasets. The study was also extended to select a reduced set of partitions from the initial population of MOCLE. These analysis show that both versions of selection strategy proposed are very effective. They can significantly reduce the number of solutions and, at the same time, keep the quality and the diversity of the partitions in the original set of solutions. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This paper proposes a filter-based algorithm for feature selection. The filter is based on the partitioning of the set of features into clusters. The number of clusters, and consequently the cardinality of the subset of selected features, is automatically estimated from data. The computational complexity of the proposed algorithm is also investigated. A variant of this filter that considers feature-class correlations is also proposed for classification problems. Empirical results involving ten datasets illustrate the performance of the developed algorithm, which in general has obtained competitive results in terms of classification accuracy when compared to state of the art algorithms that find clusters of features. We show that, if computational efficiency is an important issue, then the proposed filter May be preferred over their counterparts, thus becoming eligible to join a pool of feature selection algorithms to be used in practice. As an additional contribution of this work, a theoretical framework is used to formally analyze some properties of feature selection methods that rely on finding clusters of features. (C) 2011 Elsevier Inc. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A conceptual problem that appears in different contexts of clustering analysis is that of measuring the degree of compatibility between two sequences of numbers. This problem is usually addressed by means of numerical indexes referred to as sequence correlation indexes. This paper elaborates on why some specific sequence correlation indexes may not be good choices depending on the application scenario in hand. A variant of the Product-Moment correlation coefficient and a weighted formulation for the Goodman-Kruskal and Kendall`s indexes are derived that may be more appropriate for some particular application scenarios. The proposed and existing indexes are analyzed from different perspectives, such as their sensitivity to the ranks and magnitudes of the sequences under evaluation, among other relevant aspects of the problem. The results help suggesting scenarios within the context of clustering analysis that are possibly more appropriate for the application of each index. (C) 2008 Elsevier Inc. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The frequency of adenine mononucleotides (A), dinucleotides (AA) and clusters, and the positions of clusters, were studied in 502 molecules of the 5S rRNA.All frequencies were reduced in the evolutive lines of vertebrates, plants and fungi, in parallel with increasing organismic complexity. No change was observed in invertebrates. All frequencies were increased in mitochondria, plastids and mycoplasmas. The presumed relatives to the ancestors of the organelles, Rhodobacteria alfa and Cyanobacteria, showed intermediate values, relative to the eubacterial averages. Firmibacterid showed very high number of cluster sites.Clusters were more frequent in single-stranded regions in all organisms. The routes of organelles and mycoplasmas accummulated clusters at faster rates in double-stranded regions. Rates of change were higher for AA and clusters than for A in plants, vertebrates and organeltes, higher for cluster sites and A in mycoplasmas, and higher for AA and A in fungi. These data indicated that selection pressures acted more strongly on adenine clustering than on adenine frequency.It is proposed that AA and clusters, as sites of lower informational content. have the property of tolerating positional variation in the sites of other molecules (or other regions of the same molecule) that interact with the adenines. This reasoning was consistent with the degrees of genic polymorphism. low in plants and vertebrates and high in invertebrates. In the eubacteria endosymbiontic or parasitic to eukaryotes, the more tolerant RNA would be better adapted to interactions with the homologous nucleus-derived ribosomal proteins: the intermediate values observed in their precursors were interpreted as preadaptive.Among other groups, only the Deinococcus-Thermus eubacteria showed excessive AA and cluster contents, possibly related to their peculiar tolerance to mutagens, and the Ciliates showed excessive AA contents, indicative of retention of primitive characters.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A methodology for pipeline leakage detection using a combination of clustering and classification tools for fault detection is presented here. A fuzzy system is used to classify the running mode and identify the operational and process transients. The relationship between these transients and the mass balance deviation are discussed. This strategy allows for better identification of the leakage because the thresholds are adjusted by the fuzzy system as a function of the running mode and the classified transient level. The fuzzy system is initially off-line trained with a modified data set including simulated leakages. The methodology is applied to a small-scale LPG pipeline monitoring case where portability, robustness and reliability are amongst the most important criteria for the detection system. The results are very encouraging with relatively low levels of false alarms, obtaining increased leakage detection with low computational costs. (c) 2005 Elsevier B.V. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A total of 2400 samples of commercial Brazilian C gasoline were collected over a 6-month period from different gas stations in the São Paulo state, Brazil, and analysed with respect to 12 physicochemical parameters according to regulation 309 of the Brazilian Government Petroleum, Natural Gas and Biofuels Agency (ANP). The percentages (v/v) of hydrocarbons (olefins, aromatics and saturated) were also determined. Hierarchical cluster analysis (HCA) was employed to select 150 representative samples that exhibited least similarity on the basis of their physicochemical parameters and hydrocarbon compositions. The chromatographic profiles of the selected samples were measured by gas chromatography with flame ionisation detection and analysed using soft independent modelling of class analogy (SIMCA) method in order to create a classification scheme to identify conform gasolines according to ANP 309 regulation. Following the optimisation of the SIMCA algorithm, it was possible to classify correctly 96% of the commercial gasoline samples present in the training set of 100. In order to check the quality of the model, an external group of 50 gasoline samples (the prediction set) were analysed and the developed SIMCA model classified 94% of these correctly. The developed chemometric method is recommended for screening commercial gasoline quality and detection of potential adulteration. (c) 2007 Elsevier B.V. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Cytochrom c Oxidase (CcO), der Komplex IV der Atmungskette, ist eine der Häm-Kupfer enthaltenden Oxidasen und hat eine wichtige Funktion im Zellmetabolismus. Das Enzym enthält vier prosthetische Gruppen und befindet sich in der inneren Membran von Mitochondrien und in der Zellmembran einiger aerober Bakterien. Die CcO katalysiert den Elektronentransfer (ET) von Cytochrom c zu O2, wobei die eigentliche Reaktion am binuklearen Zentrum (CuB-Häm a3) erfolgt. Bei der Reduktion von O2 zu zwei H2O werden vier Protonen verbraucht. Zudem werden vier Protonen über die Membran transportiert, wodurch eine elektrochemische Potentialdifferenz dieser Ionen zwischen Matrix und Intermembranphase entsteht. Trotz ihrer Wichtigkeit sind Membranproteine wie die CcO noch wenig untersucht, weshalb auch der Mechanismus der Atmungskette noch nicht vollständig aufgeklärt ist. Das Ziel dieser Arbeit ist, einen Beitrag zum Verständnis der Funktion der CcO zu leisten. Hierzu wurde die CcO aus Rhodobacter sphaeroides über einen His-Anker, der am C-Terminus der Untereinheit II angebracht wurde, an eine funktionalisierte Metallelektrode in definierter Orientierung gebunden. Der erste Elektronenakzeptor, das CuA, liegt dabei am nächsten zur Metalloberfläche. Dann wurde eine Doppelschicht aus Lipiden insitu zwischen die gebundenen Proteine eingefügt, was zur sog. proteingebundenen Lipid-Doppelschicht Membran (ptBLM) führt. Dabei musste die optimale Oberflächenkonzentration der gebundenen Proteine herausgefunden werden. Elektrochemische Impedanzspektroskopie(EIS), Oberflächenplasmonenresonanzspektroskopie (SPR) und zyklische Voltammetrie (CV) wurden angewandt um die Aktivität der CcO als Funktion der Packungsdichte zu charakterisieren. Der Hauptteil der Arbeit betrifft die Untersuchung des direkten ET zur CcO unter anaeroben Bedingungen. Die Kombination aus zeitaufgelöster oberflächenverstärkter Infrarot-Absorptionsspektroskopie (tr-SEIRAS) und Elektrochemie hat sich dafür als besonders geeignet erwiesen. In einer ersten Studie wurde der ET mit Hilfe von fast scan CV untersucht, wobei CVs von nicht-aktivierter sowie aktivierter CcO mit verschiedenen Vorschubgeschwindigkeiten gemessen wurden. Die aktivierte Form wurde nach dem katalytischen Umsatz des Proteins in Anwesenheit von O2 erhalten. Ein vier-ET-modell wurde entwickelt um die CVs zu analysieren. Die Methode erlaubt zwischen dem Mechanismus des sequentiellen und des unabhängigen ET zu den vier Zentren CuA, Häm a, Häm a3 und CuB zu unterscheiden. Zudem lassen sich die Standardredoxpotentiale und die kinetischen Koeffizienten des ET bestimmen. In einer zweiten Studie wurde tr-SEIRAS im step scan Modus angewandt. Dafür wurden Rechteckpulse an die CcO angelegt und SEIRAS im ART-Modus verwendet um Spektren bei definierten Zeitscheiben aufzunehmen. Aus diesen Spektren wurden einzelne Banden isoliert, die Veränderungen von Vibrationsmoden der Aminosäuren und Peptidgruppen in Abhängigkeit des Redoxzustands der Zentren zeigen. Aufgrund von Zuordnungen aus der Literatur, die durch potentiometrische Titration der CcO ermittelt wurden, konnten die Banden versuchsweise den Redoxzentren zugeordnet werden. Die Bandenflächen gegen die Zeit aufgetragen geben dann die Redox-Kinetik der Zentren wieder und wurden wiederum mit dem vier-ET-Modell ausgewertet. Die Ergebnisse beider Studien erlauben die Schlussfolgerung, dass der ET zur CcO in einer ptBLM mit größter Wahrscheinlichkeit dem sequentiellen Mechanismus folgt, was dem natürlichen ET von Cytochrom c zur CcO entspricht.