930 resultados para HIERARCHICAL CLUSTER ANALYSIS
Resumo:
Rigid adherence to pre-specified thresholds and static graphical representations can lead to incorrect decisions on merging of clusters. As an alternative to existing automated or semi-automated methods, we developed a visual analytics approach for performing hierarchical clustering analysis of short time-series gene expression data. Dynamic sliders control parameters such as the similarity threshold at which clusters are merged and the level of relative intra-cluster distinctiveness, which can be used to identify "weak-edges" within clusters. An expert user can drill down to further explore the dendrogram and detect nested clusters and outliers. This is done by using the sliders and by pointing and clicking on the representation to cut the branches of the tree in multiple-heights. A prototype of this tool has been developed in collaboration with a small group of biologists for analysing their own datasets. Initial feedback on the tool has been positive.
Resumo:
The general purpose of this work is to describe and analyse the financing phenomenon of crowdfunding and to investigate the relations among crowdfunders, project creators and crowdfunding websites. More specifically, it also intends to describe the profile differences between major crowdfunding platforms, such as Kickstarter and Indiegogo. The findings are supported by literature, gathered from different scientific research papers. In the empirical part, data about Kickstarter and Indiegogo was collected from their websites and also complemented with further data from other statistical websites. For finding out specific information, such as satisfaction of entrepreneurs from both platforms, a satisfaction survey was applied among 200 entrepreneurs from different countries. To identify the profile of users of the Kickstarter and of the Indiegogo platforms, a multivariate analysis was performed, using a Hierarchical Clusters Analysis for each platform under study. Descriptive analysis was used for exploring information about popularity of platforms, average cost and the most popular area of projects, profile of users and future opportunities of platforms. To assess differences between groups, association between variables, and answering to the research hypothesis, an inferential analysis it was applied. The results showed that the Kickstarter and Indiegogo are one of the most popular crowdfunding platforms. Both of them have thousands of users and they are generally satisfied. Each of them uses individual approach for crowdfunders. Despite this, they both could benefit from further improving their services. Furthermore, according the results it was possible to observe that there is a direct and positive relationship between the money needed for the projects and the money collected from the investors for the projects, per platform.
Resumo:
Twenty areas from eight Brazilian states were compared according to a list of 224 species of Poaceae. In order to determinate affinity patterns between the areas, a binary matrix was submitted to cluster and ordination analysis. The patterns found were then faced to climate and geographic position. The scores corresponding to the areas obtained from the cluster analysis showed a strong correlation to temperature. The scores corresponding to the species suggest a gradient that associates distribution patterns to the photosynthetic pathway (C3 or C4). The current results suggest that the traditional classification of the Southern American grasslands might require some modification in order to be broadly applicable in the Brazilian context.
Resumo:
Macro- and microarrays are well-established technologies to determine gene functions through repeated measurements of transcript abundance. We constructed a chicken skeletal muscle-associated array based on a muscle-specific EST database, which was used to generate a tissue expression dataset of similar to 4500 chicken genes across 5 adult tissues (skeletal muscle, heart, liver, brain, and skin). Only a small number of ESTs were sufficiently well characterized by BLAST searches to determine their probable cellular functions. Evidence of a particular tissue-characteristic expression can be considered an indication that the transcript is likely to be functionally significant. The skeletal muscle macroarray platform was first used to search for evidence of tissue-specific expression, focusing on the biological function of genes/transcripts, since gene expression profiles generated across tissues were found to be reliable and consistent. Hierarchical clustering analysis revealed consistent clustering among genes assigned to 'developmental growth', such as the ontology genes and germ layers. Accuracy of the expression data was supported by comparing information from known transcripts and tissue from which the transcript was derived with macroarray data. Hybridization assays resulted in consistent tissue expression profile, which will be useful to dissect tissue-regulatory networks and to predict functions of novel genes identified after extensive sequencing of the genomes of model organisms. Screening our skeletal-muscle platform using 5 chicken adult tissues allowed us identifying 43 'tissue-specific' transcripts, and 112 co-expressed uncharacterized transcripts with 62 putative motifs. This platform also represents an important tool for functional investigation of novel genes; to determine expression pattern according to developmental stages; to evaluate differences in muscular growth potential between chicken lines, and to identify tissue-specific genes.
Resumo:
This paper aims to find relations between the socioeconomic characteristics, activity participation, land use patterns and travel behavior of the residents in the Sao Paulo Metropolitan Area (SPMA) by using Exploratory Multivariate Data Analysis (EMDA) techniques. The variables influencing travel pattern choices are investigated using: (a) Cluster Analysis (CA), grouping and characterizing the Traffic Zones (17), proposing the independent variable called Origin Cluster and, (b) Decision Tree (DT) to find a priori unknown relations among socioeconomic characteristics, land use attributes of the origin TZ and destination choices. The analysis was based on the origin-destination home-interview survey carried out in SPMA in 1997. The DT application revealed the variables of greatest influence on the travel pattern choice. The most important independent variable considered by DT is car ownership, followed by the Use of Transportation ""credits"" for Transit tariff, and, finally, activity participation variables and Origin Cluster. With these results, it was possible to analyze the influence of a family income, car ownership, position of the individual in the family, use of transportation ""credits"" for transit tariff (mainly for travel mode sequence choice), activities participation (activity sequence choice) and Origin Cluster (destination/travel distance choice). (c) 2010 Elsevier Ltd. All rights reserved.
Wavelet correlation between subjects: A time-scale data driven analysis for brain mapping using fMRI
Resumo:
Functional magnetic resonance imaging (fMRI) based on BOLD signal has been used to indirectly measure the local neural activity induced by cognitive tasks or stimulation. Most fMRI data analysis is carried out using the general linear model (GLM), a statistical approach which predicts the changes in the observed BOLD response based on an expected hemodynamic response function (HRF). In cases when the task is cognitively complex or in cases of diseases, variations in shape and/or delay may reduce the reliability of results. A novel exploratory method using fMRI data, which attempts to discriminate between neurophysiological signals induced by the stimulation protocol from artifacts or other confounding factors, is introduced in this paper. This new method is based on the fusion between correlation analysis and the discrete wavelet transform, to identify similarities in the time course of the BOLD signal in a group of volunteers. We illustrate the usefulness of this approach by analyzing fMRI data from normal subjects presented with standardized human face pictures expressing different degrees of sadness. The results show that the proposed wavelet correlation analysis has greater statistical power than conventional GLM or time domain intersubject correlation analysis. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The Canoparmelia texana epiphytic lichenized fungi was used to monitor atmospheric pollution in the Sao Paulo metropolitan region, SP, Brazil. The cluster analysis applied to the element concentration values confirmed the site groups of different levels of pollution due to industrial and vehicular emissions. In the distribution maps of element concentrations, higher concentrations of Ba and Mn were observed in the vicinity of industries and of a petrochemical complex. The highest concentration of Co found in lichens from the Sao Miguel Paulista site is due to the emissions from a metallurgical processing plant that produces this element. For Br and Zn, the highest concentrations could be associated both to vehicular and industrial emissions. Exploratory analyses revealed that the accumulation of toxic elements in C. texana may be of use in evaluating the human risk of cardiopulmonary mortality due to prolonged exposure to ambient levels of air pollution. (c) 2007 Elsevier Ltd. All rights reserved.
Resumo:
Existing procedures for the generation of polymorphic DNA markers are not optimal for insect studies in which the organisms are often tiny and background molecular Information is often non-existent. We have used a new high throughput DNA marker generation protocol called randomly amplified DNA fingerprints (RAF) to analyse the genetic variability In three separate strains of the stored grain pest, Rhyzopertha dominica. This protocol is quick, robust and reliable even though it requires minimal sample preparation, minute amounts of DNA and no prior molecular analysis of the organism. Arbitrarily selected oligonucleotide primers routinely produced similar to 50 scoreable polymorphic DNA markers, between individuals of three Independent field isolates of R. dominica. Multivariate cluster analysis using forty-nine arbitrarily selected polymorphisms generated from a single primer reliably separated individuals into three clades corresponding to their geographical origin. The resulting clades were quite distinct, with an average genetic difference of 37.5 +/- 6.0% between clades and of 21.0 +/- 7.1% between individuals within clades. As a prelude to future gene mapping efforts, we have also assessed the performance of RAF under conditions commonly used in gene mapping. In this analysis, fingerprints from pooled DNA samples accurately and reproducibly reflected RAF profiles obtained from Individual DNA samples that had been combined to create the bulked samples.
Resumo:
Lucerne (Medicago sativa L.) is autotetraploid, and predominantly allogamous. This complex breeding structure maximises the genetic diversity within lucerne populations making it difficult to genetically discriminate between populations. The objective of this study was to evaluate the level of random genetic diversity within and between a selection of Australian-grown lucerne cultivars, with tetraploid M. falcata included as a possible divergent control source. This diversity was evaluated using random amplified polymorphic DNA (RAPDs). Nineteen plants from each of 10 cultivars were analysed. Using 11 RAPD primers, 96 polymorphic bands were scored as present or absent across the 190 individuals. Genetic similarity estimates (GSEs) of all pair-wise comparisons were calculated from these data. Mean GSEs within cultivars ranged from 0.43 to 0.51. Cultivar Venus (0.43) had the highest level of intra-population genetic diversity and cultivar Sequel HR (0.51) had the lowest level of intra-population genetic diversity. Mean GSEs between cultivars ranged from 0.31 to 0.49, which overlapped with values obtained for within-cultivar GSE, thus not allowing separation of the cultivars. The high level of intra- and inter-population diversity that was detected is most likely due to the breeding of synthetic cultivars using parents derived from a number of diverse sources. Cultivar-specific polymorphisms were only identified in the M. falcata source, which like M. sativa, is outcrossing and autotetraploid. From a cluster analysis and a principal components analysis, it was clear that M. falcata was distinct from the other cultivars. The results indicate that the M. falcata accession tested has not been widely used in Australian lucerne breeding programs, and offers a means of introducing new genetic diversity into the lucerne gene pool. This provides a means of maximising heterozygosity, which is essential to maximising productivity in lucerne.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
Dissertação de Mestrado em Gestão de Empresas/MBA.
Resumo:
TPM Vol. 21, No. 4, December 2014, 435-447 – Special Issue © 2014 Cises.
Resumo:
Beyond the classical statistical approaches (determination of basic statistics, regression analysis, ANOVA, etc.) a new set of applications of different statistical techniques has increasingly gained relevance in the analysis, processing and interpretation of data concerning the characteristics of forest soils. This is possible to be seen in some of the recent publications in the context of Multivariate Statistics. These new methods require additional care that is not always included or refered in some approaches. In the particular case of geostatistical data applications it is necessary, besides to geo-reference all the data acquisition, to collect the samples in regular grids and in sufficient quantity so that the variograms can reflect the spatial distribution of soil properties in a representative manner. In the case of the great majority of Multivariate Statistics techniques (Principal Component Analysis, Correspondence Analysis, Cluster Analysis, etc.) despite the fact they do not require in most cases the assumption of normal distribution, they however need a proper and rigorous strategy for its utilization. In this work, some reflections about these methodologies and, in particular, about the main constraints that often occur during the information collecting process and about the various linking possibilities of these different techniques will be presented. At the end, illustrations of some particular cases of the applications of these statistical methods will also be presented.
Resumo:
This study aims to optimize the water quality monitoring of a polluted watercourse (Leça River, Portugal) through the principal component analysis (PCA) and cluster analysis (CA). These statistical methodologies were applied to physicochemical, bacteriological and ecotoxicological data (with the marine bacterium Vibrio fischeri and the green alga Chlorella vulgaris) obtained with the analysis of water samples monthly collected at seven monitoring sites and during five campaigns (February, May, June, August, and September 2006). The results of some variables were assigned to water quality classes according to national guidelines. Chemical and bacteriological quality data led to classify Leça River water quality as “bad” or “very bad”. PCA and CA identified monitoring sites with similar pollution pattern, giving to site 1 (located in the upstream stretch of the river) a distinct feature from all other sampling sites downstream. Ecotoxicity results corroborated this classification thus revealing differences in space and time. The present study includes not only physical, chemical and bacteriological but also ecotoxicological parameters, which broadens new perspectives in river water characterization. Moreover, the application of PCA and CA is very useful to optimize water quality monitoring networks, defining the minimum number of sites and their location. Thus, these tools can support appropriate management decisions.
Resumo:
Cork stopper manufacturing process includes an operation, known as stabilisation, by which humid cork slabs are extensively colonised by fungi. The effects of fungal growth on cork are yet to be completely understood and are considered to be involved in the so called “cork taint” of bottled wine. It is essential to identify environmental constraints which define the appearance of the colonising fungal species and to trace their origin to the forest and/or as residents in the manufacturing space. The present article correlates two sets of data, from consecutive years and the same season, of systematic biologic sampling of two manufacturing units, located in the North and South of Portugal. Chrysonilia sitophila dominance was identified, followed by a high diversity of Penicillium species. Penicillium glabrum, found in all samples, was the most frequent isolated species. P. glabrum intra-species variability was investigated using DNA fingerprinting techniques revealing highly discriminative polymorphic markers in the genome. Cluster analysis of P. glabrum data was discussed in relation to the geographical location of strains, and results suggest that P. glabrum arise from predominantly the manufacturing space, although cork resident fungi can also contrib