853 resultados para Unsupervised clustering
Resumo:
Using the network random generation models from Gustedt (2009)[23], we simulate and analyze several characteristics (such as the number of components, the degree distribution and the clustering coefficient) of the generated networks. This is done for a variety of distributions (fixed value, Bernoulli, Poisson, binomial) that are used to control the parameters of the generation process. These parameters are in particular the size of newly appearing sets of objects, the number of contexts in which new elements appear initially, the number of objects that are shared with `parent` contexts, and, the time period inside which a context may serve as a parent context (aging). The results show that these models allow to fine-tune the generation process such that the graphs adopt properties as can be found in real world graphs. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Survival models involving frailties are commonly applied in studies where correlated event time data arise due to natural or artificial clustering. In this paper we present an application of such models in the animal breeding field. Specifically, a mixed survival model with a multivariate correlated frailty term is proposed for the analysis of data from over 3611 Brazilian Nellore cattle. The primary aim is to evaluate parental genetic effects on the trait length in days that their progeny need to gain a commercially specified standard weight gain. This trait is not measured directly but can be estimated from growth data. Results point to the importance of genetic effects and suggest that these models constitute a valuable data analysis tool for beef cattle breeding.
Resumo:
Target region amplification polymorphism (TRAP) markers were used to estimate the genetic similarity (GS) among 53 sugarcane varieties and five species of the Saccharum complex. Seven fixed primers designed from candidate genes involved in sucrose metabolism and three from those involved in drought response metabolism were used in combination with three arbitrary primers. The clustering of the genotypes for sucrose metabolism and drought response were similar, but the GS based on Jaccard`s coefficient changed. The GS based on polymorphism in sucrose genes estimated in a set of 46 Brazilian varieties, all of which belong to the three Brazilian breeding programs, ranged from 0.52 to 0.9, and that based on drought data ranged from 0.44 to 0.95. The results suggest that genetic variability in the evaluated genes was lower in the sucrose metabolism genes than in the drought response metabolism ones.
Resumo:
The rhizosphere constitutes a complex niche that may be exploited by a wide variety of bacteria. Bacterium-plant interactions in this niche can be influenced by factors such as the expression of heterologous genes in the plant. The objective of this work was to describe the bacterial communities associated with the rhizosphere and rhizoplane regions of tobacco plants, and to compare communities from transgenic tobacco lines (CAB1, CAB2 and TRP) with those found in wild-type (WT) plants. Samples were collected at two stages of plant development, the vegetative and flowering stages (1 and 3 months after germination). The diversity of the culturable microbial community was assessed by isolation and further characterization of isolates by amplified ribosomal RNA gene restriction analysis (ARDRA) and 16S rRNA sequencing. These analyses revealed the presence of fairly common rhizosphere organisms with the main groups Alphaproteobacteria, Betaproteobacteria, Actinobacteria and Bacilli. Analysis of the total bacterial communities using PCR-DGGE (denaturing gradient gel electrophoresis) revealed that shifts in bacterial communities occurred during early plant development, but the reestablishment of original community structure was observed over time. The effects were smaller in rhizosphere than in rhizoplane samples, where selection of specific bacterial groups by the different plant lines was demonstrated. Clustering patterns and principal components analysis (PCA) were used to distinguish the plant lines according to the fingerprint of their associated bacterial communities. Bands differentially detected in plant lines were found to be affiliated with the genera Pantoea, Bacillus and Burkholderia in WT, CAB and TRP plants, respectively. The data revealed that, although rhizosphere/rhizoplane microbial communities can be affected by the cultivation of transgenic plants, soil resilience may be able to restore the original bacterial diversity after one cycle of plant cultivation.
Resumo:
The bacterial diversity present in sediments of a well-preserved mangrove in Ilha do Cardoso, located in the extreme south of So Paulo State coastline, Brazil, was assessed using culture-independent molecular approaches (denaturing gradient gel electrophoresis (DGGE) and analysis of 166 sequences from a clone library). The data revealed a bacterial community dominated by Alphaproteobacteria (40.36% of clones), Gammaproteobacteria (19.28% of clones) and Acidobacteria (27.71% of clones), while minor components of the assemblage were affiliated to Betaproteobacteria, Deltaproteobacteria, Firmicutes, Actinobacteria and Bacteroidetes. The clustering and redundancy analysis (RDA) based on DGGE were used to determine factors that modulate the diversity of bacterial communities in mangroves, such as depth, seasonal fluctuations, and locations over a transect area from the sea to the land. Profiles of specific DGGE gels showed that both dominant (`universal` Bacteria and Alphaproteobacteria) and low-density bacterial communities (Betaproteobacteria and Actinobacteria) are responsive to shifts in environmental factors. The location within the mangrove was determinant for all fractions of the community studied, whereas season was significant for Bacteria, Alphaproteobacteria, and Betaproteobacteria and sample depth determined the diversity of Alphaproteobacteria and Actinobacteria.
Resumo:
The supervised pattern recognition methods K-Nearest Neighbors (KNN), stepwise discriminant analysis (SDA), and soft independent modelling of class analogy (SIMCA) were employed in this work with the aim to investigate the relationship between the molecular structure of 27 cannabinoid compounds and their analgesic activity. Previous analyses using two unsupervised pattern recognition methods (PCA-principal component analysis and HCA-hierarchical cluster analysis) were performed and five descriptors were selected as the most relevants for the analgesic activity of the compounds studied: R (3) (charge density on substituent at position C(3)), Q (1) (charge on atom C(1)), A (surface area), log P (logarithm of the partition coefficient) and MR (molecular refractivity). The supervised pattern recognition methods (SDA, KNN, and SIMCA) were employed in order to construct a reliable model that can be able to predict the analgesic activity of new cannabinoid compounds and to validate our previous study. The results obtained using the SDA, KNN, and SIMCA methods agree perfectly with our previous model. Comparing the SDA, KNN, and SIMCA results with the PCA and HCA ones we could notice that all multivariate statistical methods classified the cannabinoid compounds studied in three groups exactly in the same way: active, moderately active, and inactive.
Resumo:
Rectangular dropshafts, commonly used in sewers and storm water systems, are characterised by significant flow aeration. New detailed air-water flow measurements were conducted in a near-full-scale dropshaft at large discharges. In the shaft pool and outflow channel, the results demonstrated the complexity of different competitive air entrainment mechanisms. Bubble size measurements showed a broad range of entrained bubble sizes. Analysis of streamwise distributions of bubbles suggested further some clustering process in the bubbly flow although, in the outflow channel, bubble chords were in average smaller than in the shaft pool. A robust hydrophone was tested to measure bubble acoustic spectra and to assess its field application potential. The acoustic results characterised accurately the order of magnitude of entrained bubble sizes, but the transformation from acoustic frequencies to bubble radii did not predict correctly the probability distribution functions of bubble sizes.
Resumo:
In an open channel, a hydraulic jump is the rapid transition from super- to sub-critical flow associated with strong turbulence and air bubble entrainment in the mixing layer. New experiments were performed at relatively large Reynolds numbers using phase-detection probes. Some new signal analysis provided characteristic air-water time and length scales of the vortical structures advecting the air bubbles in the developing shear flow. An analysis of the longitudinal air-water flow structure suggested little bubble clustering in the mixing layer, although an interparticle arrival time analysis showed some preferential bubble clustering for small bubbles with chord times below 3 ms. Correlation analyses yielded longitudinal air-water time scales Txx*V1/d1 of about 0.8 in average. The transverse integral length scale Z/d1 of the eddies advecting entrained bubbles was typically between 0.25 and 0.4, irrespective of the inflow conditions within the range of the investigations. Overall the findings highlighted the complicated nature of the air-water flow
Resumo:
A combination of deductive reasoning, clustering, and inductive learning is given as an example of a hybrid system for exploratory data analysis. Visualization is replaced by a dialogue with the data.
Resumo:
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
Resumo:
Multiple sclerosis and idiopathic dilated cardiomyopathy are two conditions in which an autoimmune process is implicated in the pathogenesis. There is evidence to support clustering of autoimmune diseases in patients with multiple sclerosis and their families. To our knowledge, this is the first report of idiopathic dilated cardiomyopathy occurring in a patient with multiple sclerosis.
Resumo:
Liver samples from rabbits killed by RHDV, collected from five States in Australia in 1996 and 1997 were analysed by RT-PCR. A 398 bp fragment of the capsid protein (VP60) gene was amplified by PCR and directly sequenced. The alignment of the nucleotide and amino acid sequences and their comparison with the original strain of the virus released in Australia indicated genetic changes after two years have been small with 98.2% to 100% identity. The constructed phylogenetic tree suggests slight differences in nucleotide substitutions in various States but there is no clear evidence of clustering of sequences according to their geographic origin. In practical terms, sequencing of viral RNA provides a means of testing the efficacy of further releases and subsequent spread of the virus if such a strategy is employed as a means of enhancing RHD as a biological control of the wild rabbit in Australia.
Resumo:
Cylindrospermopsis raciborskii is a toxic-bloom-forming cyanobacterium that is commonly found in tropical to subtropical climatic regions worldwide, but it is also recognized as a common component of cyanobacterial communities in temperate climates. Genetic profiles of C. raciborskii were examined in 19 cultured isolates originating from geographically diverse regions of Australia and represented by two distinct morphotypes. A 609-bp region of rpoC1, a DNA-dependent RNA polymerase gene, was amplified by PCR from these isolates with cyanobacterium-specific primers. Sequence analysis revealed that all isolates belonged to the same species, including morphotypes with straight or coiled trichomes. Additional rpoC1 gene sequences obtained for a range of cyanobacteria highlighted clustering of C. raciborskii with other heterocyst-producing cyanobacteria (orders Nostocales and Stigonematales). In contrast, randomly amplified polymorphic DNA and short tandemly repeated repetitive sequence profiles revealed a greater level of genetic heterogeneity among C. raciborskii isolates than did rpoC1 gene analysis, and unique band profiles were also found among each of the cyanobacterial genera examined. A PCR test targeting a region of the rpoC1 gene unique to C. raciborskii was developed for the specific identification of C. raciborskii from both purified genomic DNA and environmental samples. The PCR was evaluated with a number of cyanobacterial isolates, but a PCR-positive result was only achieved with C, raciborskii. This method provides an accurate alternative to traditional morphological identification of C. raciborskii.
Resumo:
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.
Resumo:
This paper develops an interactive approach for exploratory spatial data analysis. Measures of attribute similarity and spatial proximity are combined in a clustering model to support the identification of patterns in spatial information. Relationships between the developed clustering approach, spatial data mining and choropleth display are discussed. Analysis of property crime rates in Brisbane, Australia is presented. A surprising finding in this research is that there are substantial inconsistencies in standard choropleth display options found in two widely used commercial geographical information systems, both in terms of definition and performance. The comparative results demonstrate the usefulness and appeal of the developed approach in a geographical information system environment for exploratory spatial data analysis.