873 resultados para agglomerative clustering
Resumo:
Background: High-throughput molecular approaches for gene expression profiling, such as Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS) or Sequencing-by-Synthesis (SBS) represent powerful techniques that provide global transcription profiles of different cell types through sequencing of short fragments of transcripts, denominated sequence tags. These techniques have improved our understanding about the relationships between these expression profiles and cellular phenotypes. Despite this, more reliable datasets are still necessary. In this work, we present a web-based tool named S3T: Score System for Sequence Tags, to index sequenced tags in accordance with their reliability. This is made through a series of evaluations based on a defined rule set. S3T allows the identification/selection of tags, considered more reliable for further gene expression analysis. Results: This methodology was applied to a public SAGE dataset. In order to compare data before and after filtering, a hierarchical clustering analysis was performed in samples from the same type of tissue, in distinct biological conditions, using these two datasets. Our results provide evidences suggesting that it is possible to find more congruous clusters after using S3T scoring system. Conclusion: These results substantiate the proposed application to generate more reliable data. This is a significant contribution for determination of global gene expression profiles. The library analysis with S3T is freely available at http://gdm.fmrp.usp.br/s3t/.S3T source code and datasets can also be downloaded from the aforementioned website.
Resumo:
We discuss the dynamics of the Universe within the framework of the massive graviton cold dark matter scenario (MGCDM) in which gravitons are geometrically treated as massive particles. In this modified gravity theory, the main effect of the gravitons is to alter the density evolution of the cold dark matter component in such a way that the Universe evolves to an accelerating expanding regime, as presently observed. Tight constraints on the main cosmological parameters of the MGCDM model are derived by performing a joint likelihood analysis involving the recent supernovae type Ia data, the cosmic microwave background shift parameter, and the baryonic acoustic oscillations as traced by the Sloan Digital Sky Survey red luminous galaxies. The linear evolution of small density fluctuations is also analyzed in detail. It is found that the growth factor of the MGCDM model is slightly different (similar to 1-4%) from the one provided by the conventional flat Lambda CDM cosmology. The growth rate of clustering predicted by MGCDM and Lambda CDM models are confronted to the observations and the corresponding best fit values of the growth index (gamma) are also determined. By using the expectations of realistic future x-ray and Sunyaev-Zeldovich cluster surveys we derive the dark matter halo mass function and the corresponding redshift distribution of cluster-size halos for the MGCDM model. Finally, we also show that the Hubble flow differences between the MGCDM and the Lambda CDM models provide a halo redshift distribution departing significantly from the those predicted by other dark energy models. These results suggest that the MGCDM model can observationally be distinguished from Lambda CDM and also from a large number of dark energy models recently proposed in the literature.
Resumo:
We discuss the properties of homogeneous and isotropic flat cosmologies in which the present accelerating stage is powered only by the gravitationally induced creation of cold dark matter (CCDM) particles (Omega(m) = 1). For some matter creation rates proposed in the literature, we show that the main cosmological functions such as the scale factor of the universe, the Hubble expansion rate, the growth factor, and the cluster formation rate are analytically defined. The best CCDM scenario has only one free parameter and our joint analysis involving baryonic acoustic oscillations + cosmic microwave background (CMB) + SNe Ia data yields (Omega) over tilde = 0.28 +/- 0.01 (1 sigma), where (Omega) over tilde (m) is the observed matter density parameter. In particular, this implies that the model has no dark energy but the part of the matter that is effectively clustering is in good agreement with the latest determinations from the large- scale structure. The growth of perturbation and the formation of galaxy clusters in such scenarios are also investigated. Despite the fact that both scenarios may share the same Hubble expansion, we find that matter creation cosmologies predict stronger small scale dynamics which implies a faster growth rate of perturbations with respect to the usual Lambda CDM cosmology. Such results point to the possibility of a crucial observational test confronting CCDM with Lambda CDM scenarios through a more detailed analysis involving CMB, weak lensing, as well as the large-scale structure.
Resumo:
Context. We study galaxy evolution and spatial patterns in the surroundings of a sample of 2dF groups. Aims. Our aim is to find evidence of galaxy evolution and clustering out to 10 times the virial radius of the groups and so redefine their properties according to the spatial patterns in the fields and relate them to galaxy evolution. Methods. Group members and interlopers were redefined after the identification of gaps in the redshift distribution. We then used exploratory spatial statistics based on the the second moment of the Ripley function to probe the anisotropy in the galaxy distribution around the groups. Results. We found an important anticorrelation between anisotropy around groups and the fraction of early-type galaxies in these fields. Our results illustrate how the dynamical state of galaxy groups can be ascertained by the systematic study of their neighborhoods. This is an important achievement, since the correct estimate of the extent to which galaxies are affected by the group environment and follow large-scale filamentary structure is relevant to understanding the process of galaxy clustering and evolution in the Universe.
Resumo:
A great part of the interest in complex networks has been motivated by the presence of structured, frequently nonuniform, connectivity. Because diverse connectivity patterns tend to result in distinct network dynamics, and also because they provide the means to identify and classify several types of complex network, it becomes important to obtain meaningful measurements of the local network topology. In addition to traditional features such as the node degree, clustering coefficient, and shortest path, motifs have been introduced in the literature in order to provide complementary descriptions of the network connectivity. The current work proposes a different type of motif, namely, chains of nodes, that is, sequences of connected nodes with degree 2. These chains have been subdivided into cords, tails, rings, and handles, depending on the type of their extremities (e.g., open or connected). A theoretical analysis of the density of such motifs in random and scale-free networks is described, and an algorithm for identifying these motifs in general networks is presented. The potential of considering chains for network characterization has been illustrated with respect to five categories of real-world networks including 16 cases. Several interesting findings were obtained, including the fact that several chains were observed in real-world networks, especially the world wide web, books, and the power grid. The possibility of chains resulting from incompletely sampled networks is also investigated.
Resumo:
We numerically study the dynamics of a discrete spring-block model introduced by Olami, Feder, and Christensen (OFC) to mimic earthquakes and investigate to what extent this simple model is able to reproduce the observed spatiotemporal clustering of seismicity. Following a recently proposed method to characterize such clustering by networks of recurrent events [J. Davidsen, P. Grassberger, and M. Paczuski, Geophys. Res. Lett. 33, L11304 (2006)], we find that for synthetic catalogs generated by the OFC model these networks have many nontrivial statistical properties. This includes characteristic degree distributions, very similar to what has been observed for real seismicity. There are, however, also significant differences between the OFC model and earthquake catalogs, indicating that this simple model is insufficient to account for certain aspects of the spatiotemporal clustering of seismicity.
Resumo:
Large-scale cortical networks exhibit characteristic topological properties that shape communication between brain regions and global cortical dynamics. Analysis of complex networks allows the description of connectedness, distance, clustering, and centrality that reveal different aspects of how the network's nodes communicate. Here, we focus on a novel analysis of complex walks in a series of mammalian cortical networks that model potential dynamics of information flow between individual brain regions. We introduce two new measures called absorption and driftness. Absorption is the average length of random walks between any two nodes, and takes into account all paths that may diffuse activity throughout the network. Driftness is the ratio between absorption and the corresponding shortest path length. For a given node of the network, we also define four related measurements, namely in-and out-absorption as well as in-and out-driftness, as the averages of the corresponding measures from all nodes to that node, and from that node to all nodes, respectively. We find that the cat thalamo-cortical system incorporates features of two classic network topologies, Erdos-Renyi graphs with respect to in-absorption and in-driftness, and configuration models with respect to out-absorption and out-driftness. Moreover, taken together these four measures separate the network nodes based on broad functional roles (visual, auditory, somatomotor, and frontolimbic).
Resumo:
Biological neuronal networks constitute a special class of dynamical systems, as they are formed by individual geometrical components, namely the neurons. In the existing literature, relatively little attention has been given to the influence of neuron shape on the overall connectivity and dynamics of the emerging networks. The current work addresses this issue by considering simplified neuronal shapes consisting of circular regions (soma/axons) with spokes (dendrites). Networks are grown by placing these patterns randomly in the two-dimensional (2D) plane and establishing connections whenever a piece of dendrite falls inside an axon. Several topological and dynamical properties of the resulting graph are measured, including the degree distribution, clustering coefficients, symmetry of connections, size of the largest connected component, as well as three hierarchical measurements of the local topology. By varying the number of processes of the individual basic patterns, we can quantify relationships between the individual neuronal shape and the topological and dynamical features of the networks. Integrate-and-fire dynamics on these networks is also investigated with respect to transient activation from a source node, indicating that long-range connections play an important role in the propagation of avalanches.
Resumo:
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov-Smirnov-type goodness-of-fit test proposed by Balding et at. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford-Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton-Watson related processes.
Resumo:
The k(0)-method instrumental neutron activation analysis (k(0)-INAA) was employed for determining chemical elements in bird feathers. A collection was obtained taking into account several bird species from wet ecosystems in diverse regions of Brazil. For comparison reason, feathers were actively sampled in a riparian forest from the Marins Stream, Piracicaba, Sao Paulo State, using mist nets specific for capturing birds. Biological certified reference materials were used for assessing the quality of analytical procedure. Quantification of chemical elements was performed using the k(0)-INAA Quantu Software. Sixteen chemical elements, including macro and micronutrients, and trace elements, have been quantified in feathers, in which analytical uncertainties varied from 2% to 40% depending on the chemical element mass fraction. Results indicated high mass fractions of Br (max=7.9 mgkg(-1)), Co (max= 0.47 mg kg(-1)), Cr (max =68 mg kg(-1)), Hg (max =2.79 mg kg(-1)), Sb (max= 0.20 mg kg(-1)), Se (max=1.3 mg kg(-1)) and Zn (max =192 mg kg(-1)) in bird feathers, probably associated with the degree of pollution of the areas evaluated. In order to corroborate the use of k(0)-INAA results in biomonitoring studies using avian community, different factor analysis methods were used to check chemical element source apportionment and locality clustering based on feather chemical composition. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Various molecular systems are available for epidemiological, genetic, evolutionary, taxonomic and systematic studies of innumerable fungal infections, especially those caused by the opportunistic pathogen C. albicans. A total of 75 independent oral isolates were selected in order to compare Multilocus Enzyme Electrophoresis (MLEE), Electrophoretic Karyotyping (EK) and Microsatellite Markers (Simple Sequence Repeats - SSRs), in their abilities to differentiate and group C. albicans isolates (discriminatory power), and also, to evaluate the concordance and similarity of the groups of strains determined by cluster analysis for each fingerprinting method. Isoenzyme typing was performed using eleven enzyme systems: Adh, Sdh, M1p, Mdh, Idh, Gdh, G6pdh, Asd, Cat, Po, and Lap (data previously published). The EK method consisted of chromosomal DNA separation by pulsed-field gel electrophoresis using a CHEF system. The microsatellite markers were investigated by PCR using three polymorphic loci: EF3, CDC3, and HIS3. Dendrograms were generated by the SAHN method and UPGMA algorithm based on similarity matrices (S(SM)). The discriminatory power of the three methods was over 95%, however a paired analysis among them showed a parity of 19.7-22.4% in the identification of strains. Weak correlation was also observed among the genetic similarity matrices (S(SM)(MLEE) x S(SM)(EK) x S(SM)(SSRs)). Clustering analyses showed a mean of 9 +/- 12.4 isolates per cluster (3.8 +/- 8 isolates/taxon) for MLEE, 6.2 +/- 4.9 isolates per cluster (4 +/- 4.5 isolates/taxon) for SSRs, and 4.1 +/- 2.3 isolates per cluster (2.6 +/- 2.3 isolates/taxon) for EK. A total of 45 (13%), 39(11.2%), 5 (1.4%) and 3 (0.9%) clusters pairs from 347 showed similarity (Si) of 0.1-10%, 10.1-20%, 20.1-30% and 30.1-40%, respectively. Clinical and molecular epidemiological correlation involving the opportunistic pathogen C. albicans may be attributed dependently of each method of genotyping (i.e., MLEE, EK, and SSRs) supplemented with similarity and grouping analysis. Therefore, the use of genotyping systems that give results which offer minimum disparity, or the combination of the results of these systems, can provide greater security and consistency in the determination of strains and their genetic relationships. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Today several different unsupervised classification algorithms are commonly used to cluster similar patterns in a data set based only on its statistical properties. Specially in image data applications, self-organizing methods for unsupervised classification have been successfully applied for clustering pixels or group of pixels in order to perform segmentation tasks. The first important contribution of this paper refers to the development of a self-organizing method for data classification, named Enhanced Independent Component Analysis Mixture Model (EICAMM), which was built by proposing some modifications in the Independent Component Analysis Mixture Model (ICAMM). Such improvements were proposed by considering some of the model limitations as well as by analyzing how it should be improved in order to become more efficient. Moreover, a pre-processing methodology was also proposed, which is based on combining the Sparse Code Shrinkage (SCS) for image denoising and the Sobel edge detector. In the experiments of this work, the EICAMM and other self-organizing models were applied for segmenting images in their original and pre-processed versions. A comparative analysis showed satisfactory and competitive image segmentation results obtained by the proposals presented herein. (C) 2008 Published by Elsevier B.V.
Resumo:
The taxonomy of the N(2)-fixing bacteria belonging to the genus Bradyrhizobium is still poorly refined, mainly due to conflicting results obtained by the analysis of the phenotypic and genotypic properties. This paper presents an application of a method aiming at the identification of possible new clusters within a Brazilian collection of 119 Bradryrhizobium strains showing phenotypic characteristics of B. japonicum and B. elkanii. The stability was studied as a function of the number of restriction enzymes used in the RFLP-PCR analysis of three ribosomal regions with three restriction enzymes per region. The method proposed here uses Clustering algorithms with distances calculated by average-linkage clustering. Introducing perturbations using sub-sampling techniques makes the stability analysis. The method showed efficacy in the grouping of the species B. japonicum and B. elkanii. Furthermore, two new clusters were clearly defined, indicating possible new species, and sub-clusters within each detected cluster. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
This paper analyses the presence of financial constraint in the investment decisions of 367 Brazilian firms from 1997 to 2004, using a Bayesian econometric model with group-varying parameters. The motivation for this paper is the use of clustering techniques to group firms in a totally endogenous form. In order to classify the firms we used a hybrid clustering method, that is, hierarchical and non-hierarchical clustering techniques jointly. To estimate the parameters a Bayesian approach was considered. Prior distributions were assumed for the parameters, classifying the model in random or fixed effects. Ordinate predictive density criterion was used to select the model providing a better prediction. We tested thirty models and the better prediction considers the presence of 2 groups in the sample, assuming the fixed effect model with a Student t distribution with 20 degrees of freedom for the error. The results indicate robustness in the identification of financial constraint when the firms are classified by the clustering techniques. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
In this paper, a framework for detection of human skin in digital images is proposed. This framework is composed of a training phase and a detection phase. A skin class model is learned during the training phase by processing several training images in a hybrid and incremental fuzzy learning scheme. This scheme combines unsupervised-and supervised-learning: unsupervised, by fuzzy clustering, to obtain clusters of color groups from training images; and supervised to select groups that represent skin color. At the end of the training phase, aggregation operators are used to provide combinations of selected groups into a skin model. In the detection phase, the learned skin model is used to detect human skin in an efficient way. Experimental results show robust and accurate human skin detection performed by the proposed framework.