65 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The taxonomy of the N(2)-fixing bacteria belonging to the genus Bradyrhizobium is still poorly refined, mainly due to conflicting results obtained by the analysis of the phenotypic and genotypic properties. This paper presents an application of a method aiming at the identification of possible new clusters within a Brazilian collection of 119 Bradryrhizobium strains showing phenotypic characteristics of B. japonicum and B. elkanii. The stability was studied as a function of the number of restriction enzymes used in the RFLP-PCR analysis of three ribosomal regions with three restriction enzymes per region. The method proposed here uses Clustering algorithms with distances calculated by average-linkage clustering. Introducing perturbations using sub-sampling techniques makes the stability analysis. The method showed efficacy in the grouping of the species B. japonicum and B. elkanii. Furthermore, two new clusters were clearly defined, indicating possible new species, and sub-clusters within each detected cluster. (C) 2008 Elsevier B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A large amount of biological data has been produced in the last years. Important knowledge can be extracted from these data by the use of data analysis techniques. Clustering plays an important role in data analysis, by organizing similar objects from a dataset into meaningful groups. Several clustering algorithms have been proposed in the literature. However, each algorithm has its bias, being more adequate for particular datasets. This paper presents a mathematical formulation to support the creation of consistent clusters for biological data. Moreover. it shows a clustering algorithm to solve this formulation that uses GRASP (Greedy Randomized Adaptive Search Procedure). We compared the proposed algorithm with three known other algorithms. The proposed algorithm presented the best clustering results confirmed statistically. (C) 2009 Elsevier Ltd. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective Clustering with automatic K-determination (MOCK). the algorithm most closely related to ours. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Today several different unsupervised classification algorithms are commonly used to cluster similar patterns in a data set based only on its statistical properties. Specially in image data applications, self-organizing methods for unsupervised classification have been successfully applied for clustering pixels or group of pixels in order to perform segmentation tasks. The first important contribution of this paper refers to the development of a self-organizing method for data classification, named Enhanced Independent Component Analysis Mixture Model (EICAMM), which was built by proposing some modifications in the Independent Component Analysis Mixture Model (ICAMM). Such improvements were proposed by considering some of the model limitations as well as by analyzing how it should be improved in order to become more efficient. Moreover, a pre-processing methodology was also proposed, which is based on combining the Sparse Code Shrinkage (SCS) for image denoising and the Sobel edge detector. In the experiments of this work, the EICAMM and other self-organizing models were applied for segmenting images in their original and pre-processed versions. A comparative analysis showed satisfactory and competitive image segmentation results obtained by the proposals presented herein. (C) 2008 Published by Elsevier B.V.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The identification, modeling, and analysis of interactions between nodes of neural systems in the human brain have become the aim of interest of many studies in neuroscience. The complex neural network structure and its correlations with brain functions have played a role in all areas of neuroscience, including the comprehension of cognitive and emotional processing. Indeed, understanding how information is stored, retrieved, processed, and transmitted is one of the ultimate challenges in brain research. In this context, in functional neuroimaging, connectivity analysis is a major tool for the exploration and characterization of the information flow between specialized brain regions. In most functional magnetic resonance imaging (fMRI) studies, connectivity analysis is carried out by first selecting regions of interest (ROI) and then calculating an average BOLD time series (across the voxels in each cluster). Some studies have shown that the average may not be a good choice and have suggested, as an alternative, the use of principal component analysis (PCA) to extract the principal eigen-time series from the ROI(s). In this paper, we introduce a novel approach called cluster Granger analysis (CGA) to study connectivity between ROIs. The main aim of this method was to employ multiple eigen-time series in each ROI to avoid temporal information loss during identification of Granger causality. Such information loss is inherent in averaging (e.g., to yield a single ""representative"" time series per ROI). This, in turn, may lead to a lack of power in detecting connections. The proposed approach is based on multivariate statistical analysis and integrates PCA and partial canonical correlation in a framework of Granger causality for clusters (sets) of time series. We also describe an algorithm for statistical significance testing based on bootstrapping. By using Monte Carlo simulations, we show that the proposed approach outperforms conventional Granger causality analysis (i.e., using representative time series extracted by signal averaging or first principal components estimation from ROIs). The usefulness of the CGA approach in real fMRI data is illustrated in an experiment using human faces expressing emotions. With this data set, the proposed approach suggested the presence of significantly more connections between the ROIs than were detected using a single representative time series in each ROI. (c) 2010 Elsevier Inc. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Clustering is a difficult task: there is no single cluster definition and the data can have more than one underlying structure. Pareto-based multi-objective genetic algorithms (e.g., MOCK Multi-Objective Clustering with automatic K-determination and MOCLE-Multi-Objective Clustering Ensemble) were proposed to tackle these problems. However, the output of such algorithms can often contains a high number of partitions, becoming difficult for an expert to manually analyze all of them. In order to deal with this problem, we present two selection strategies, which are based on the corrected Rand, to choose a subset of solutions. To test them, they are applied to the set of solutions produced by MOCK and MOCLE in the context of several datasets. The study was also extended to select a reduced set of partitions from the initial population of MOCLE. These analysis show that both versions of selection strategy proposed are very effective. They can significantly reduce the number of solutions and, at the same time, keep the quality and the diversity of the partitions in the original set of solutions. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This paper proposes a filter-based algorithm for feature selection. The filter is based on the partitioning of the set of features into clusters. The number of clusters, and consequently the cardinality of the subset of selected features, is automatically estimated from data. The computational complexity of the proposed algorithm is also investigated. A variant of this filter that considers feature-class correlations is also proposed for classification problems. Empirical results involving ten datasets illustrate the performance of the developed algorithm, which in general has obtained competitive results in terms of classification accuracy when compared to state of the art algorithms that find clusters of features. We show that, if computational efficiency is an important issue, then the proposed filter May be preferred over their counterparts, thus becoming eligible to join a pool of feature selection algorithms to be used in practice. As an additional contribution of this work, a theoretical framework is used to formally analyze some properties of feature selection methods that rely on finding clusters of features. (C) 2011 Elsevier Inc. All rights reserved.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A conceptual problem that appears in different contexts of clustering analysis is that of measuring the degree of compatibility between two sequences of numbers. This problem is usually addressed by means of numerical indexes referred to as sequence correlation indexes. This paper elaborates on why some specific sequence correlation indexes may not be good choices depending on the application scenario in hand. A variant of the Product-Moment correlation coefficient and a weighted formulation for the Goodman-Kruskal and Kendall`s indexes are derived that may be more appropriate for some particular application scenarios. The proposed and existing indexes are analyzed from different perspectives, such as their sensitivity to the ranks and magnitudes of the sequences under evaluation, among other relevant aspects of the problem. The results help suggesting scenarios within the context of clustering analysis that are possibly more appropriate for the application of each index. (C) 2008 Elsevier Inc. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Um inquérito de base populacional foi conduzido na população urbana de todas as capitais e do Distrito Federal no Brasil para fornecer informações sobre a prevalência de hepatites virais e fatores de risco, entre 2005 e 2009. Este artigo descreve o delineamento e a metodologia do estudo que envolveu a população com idade entre 5 e 19 anos para hepatite A e 10 a 69 anos para hepatite B e C. As entrevistas e amostras de sangue foram obtidas através de visitas domiciliares e a amostra selecionada a partir de uma amostragem estratificada em múltiplos estágios (por conglomerado) com igual probabilidade para cada domínio de estudo (região e faixa etária). Nacionalmente, 19.280 residências e ~31.000 indivíduos foram selecionados. O tamanho da amostra foi suficiente para detectar uma prevalência em torno de 0,1% e para avaliar os fatores de risco por região. A metodologia apresentou-se viável para distinguir entre diferentes padrões epidemiológicos da hepatite A, B e C. Estes dados serão de valia para a avaliação das políticas de vacinação e para o desenho de estratégias de controle.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Macro- and microarrays are well-established technologies to determine gene functions through repeated measurements of transcript abundance. We constructed a chicken skeletal muscle-associated array based on a muscle-specific EST database, which was used to generate a tissue expression dataset of similar to 4500 chicken genes across 5 adult tissues (skeletal muscle, heart, liver, brain, and skin). Only a small number of ESTs were sufficiently well characterized by BLAST searches to determine their probable cellular functions. Evidence of a particular tissue-characteristic expression can be considered an indication that the transcript is likely to be functionally significant. The skeletal muscle macroarray platform was first used to search for evidence of tissue-specific expression, focusing on the biological function of genes/transcripts, since gene expression profiles generated across tissues were found to be reliable and consistent. Hierarchical clustering analysis revealed consistent clustering among genes assigned to 'developmental growth', such as the ontology genes and germ layers. Accuracy of the expression data was supported by comparing information from known transcripts and tissue from which the transcript was derived with macroarray data. Hybridization assays resulted in consistent tissue expression profile, which will be useful to dissect tissue-regulatory networks and to predict functions of novel genes identified after extensive sequencing of the genomes of model organisms. Screening our skeletal-muscle platform using 5 chicken adult tissues allowed us identifying 43 'tissue-specific' transcripts, and 112 co-expressed uncharacterized transcripts with 62 putative motifs. This platform also represents an important tool for functional investigation of novel genes; to determine expression pattern according to developmental stages; to evaluate differences in muscular growth potential between chicken lines, and to identify tissue-specific genes.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background: Population antimicrobial use may influence resistance emergence. Resistance is an ecological phenomenon due to potential transmissibility. We investigated spatial and temporal patterns of ciprofloxacin (CIP) population consumption related to E. coli resistance emergence and dissemination in a major Brazilian city. A total of 4,372 urinary tract infection E. coli cases, with 723 CIP resistant, were identified in 2002 from two outpatient centres. Cases were address geocoded in a digital map. Raw CIP consumption data was transformed into usage density in DDDs by CIP selling points influence zones determination. A stochastic model coupled with a Geographical Information System was applied for relating resistance and usage density and for detecting city areas of high/low resistance risk. Results: E. coli CIP resistant cluster emergence was detected and significantly related to usage density at a level of 5 to 9 CIP DDDs. There were clustered hot-spots and a significant global spatial variation in the residual resistance risk after allowing for usage density. Conclusions: There were clustered hot-spots and a significant global spatial variation in the residual resistance risk after allowing for usage density. The usage density of 5-9 CIP DDDs per 1,000 inhabitants within the same influence zone was the resistance triggering level. This level led to E. coli resistance clustering, proving that individual resistance emergence and dissemination was affected by antimicrobial population consumption.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The A1763 superstructure at z = 0.23 contains the first galaxy filament to be directly detected using mid-infrared observations. Our previous work has shown that the frequency of starbursting galaxies, as characterized by 24 mu m emission is much higher within the filament than at either the center of the rich galaxy cluster, or the field surrounding the system. New Very Large Array and XMM-Newton data are presented here. We use the radio and X-ray data to examine the fraction and location of active galaxies, both active galactic nuclei (AGNs) and starbursts (SBs). The radio far-infrared correlation, X-ray point source location, IRAC colors, and quasar positions are all used to gain an understanding of the presence of dominant AGNs. We find very few MIPS-selected galaxies that are clearly dominated by AGN activity. Most radio-selected members within the filament are SBs. Within the supercluster, three of eight spectroscopic members detected both in the radio and in the mid-infrared are radio-bright AGNs. They are found at or near the core of A1763. The five SBs are located further along the filament. We calculate the physical properties of the known wide angle tail (WAT) source which is the brightest cluster galaxy of A1763. A second double lobe source is found along the filament well outside of the virial radius of either cluster. The velocity offset of the WAT from the X-ray centroid and the bend of the WAT in the intracluster medium are both consistent with ram pressure stripping, indicative of streaming motions along the direction of the filament. We consider this as further evidence of the cluster-feeding nature of the galaxy filament.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We discuss the properties of homogeneous and isotropic flat cosmologies in which the present accelerating stage is powered only by the gravitationally induced creation of cold dark matter (CCDM) particles (Omega(m) = 1). For some matter creation rates proposed in the literature, we show that the main cosmological functions such as the scale factor of the universe, the Hubble expansion rate, the growth factor, and the cluster formation rate are analytically defined. The best CCDM scenario has only one free parameter and our joint analysis involving baryonic acoustic oscillations + cosmic microwave background (CMB) + SNe Ia data yields (Omega) over tilde = 0.28 +/- 0.01 (1 sigma), where (Omega) over tilde (m) is the observed matter density parameter. In particular, this implies that the model has no dark energy but the part of the matter that is effectively clustering is in good agreement with the latest determinations from the large- scale structure. The growth of perturbation and the formation of galaxy clusters in such scenarios are also investigated. Despite the fact that both scenarios may share the same Hubble expansion, we find that matter creation cosmologies predict stronger small scale dynamics which implies a faster growth rate of perturbations with respect to the usual Lambda CDM cosmology. Such results point to the possibility of a crucial observational test confronting CCDM with Lambda CDM scenarios through a more detailed analysis involving CMB, weak lensing, as well as the large-scale structure.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Context. The cosmic time around the z similar to 1 redshift range appears crucial in the cluster and galaxy evolution, since it is probably the epoch of the first mature galaxy clusters. Our knowledge of the properties of the galaxy populations in these clusters is limited because only a handful of z similar to 1 clusters are presently known. Aims. In this framework, we report the discovery of a z similar to 0.87 cluster and study its properties at various wavelengths. Methods. We gathered X-ray and optical data (imaging and spectroscopy), and near and far infrared data (imaging) in order to confirm the cluster nature of our candidate, to determine its dynamical state, and to give insight on its galaxy population evolution. Results. Our candidate structure appears to be a massive z similar to 0.87 dynamically young cluster with an atypically high X-ray temperature as compared to its X-ray luminosity. It exhibits a significant percentage (similar to 90%) of galaxies that are also detected in the 24 mu m band. Conclusions. The cluster RXJ1257.2+4738 appears to be still in the process of collapsing. Its relatively high temperature is probably the consequence of significant energy input into the intracluster medium besides the regular gravitational infall contribution. A significant part of its galaxies are red objects that are probably dusty with on-going star formation.