263 resultados para Hierarchical clustering model
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
A continuous version of the hierarchical spherical model at dimension d=4 is investigated. Two limit distributions of the block spin variable X(gamma), normalized with exponents gamma = d + 2 and gamma=d at and above the critical temperature, are established. These results are proven by solving certain evolution equations corresponding to the renormalization group (RG) transformation of the O(N) hierarchical spin model of block size L(d) in the limit L down arrow 1 and N ->infinity. Starting far away from the stationary Gaussian fixed point the trajectories of these dynamical system pass through two different regimes with distinguishable crossover behavior. An interpretation of this trajectories is given by the geometric theory of functions which describe precisely the motion of the Lee-Yang zeroes. The large-N limit of RG transformation with L(d) fixed equal to 2, at the criticality, has recently been investigated in both weak and strong (coupling) regimes by Watanabe (J. Stat. Phys. 115:1669-1713, 2004) . Although our analysis deals only with N = infinity case, it complements various aspects of that work.
Resumo:
Macro- and microarrays are well-established technologies to determine gene functions through repeated measurements of transcript abundance. We constructed a chicken skeletal muscle-associated array based on a muscle-specific EST database, which was used to generate a tissue expression dataset of similar to 4500 chicken genes across 5 adult tissues (skeletal muscle, heart, liver, brain, and skin). Only a small number of ESTs were sufficiently well characterized by BLAST searches to determine their probable cellular functions. Evidence of a particular tissue-characteristic expression can be considered an indication that the transcript is likely to be functionally significant. The skeletal muscle macroarray platform was first used to search for evidence of tissue-specific expression, focusing on the biological function of genes/transcripts, since gene expression profiles generated across tissues were found to be reliable and consistent. Hierarchical clustering analysis revealed consistent clustering among genes assigned to 'developmental growth', such as the ontology genes and germ layers. Accuracy of the expression data was supported by comparing information from known transcripts and tissue from which the transcript was derived with macroarray data. Hybridization assays resulted in consistent tissue expression profile, which will be useful to dissect tissue-regulatory networks and to predict functions of novel genes identified after extensive sequencing of the genomes of model organisms. Screening our skeletal-muscle platform using 5 chicken adult tissues allowed us identifying 43 'tissue-specific' transcripts, and 112 co-expressed uncharacterized transcripts with 62 putative motifs. This platform also represents an important tool for functional investigation of novel genes; to determine expression pattern according to developmental stages; to evaluate differences in muscular growth potential between chicken lines, and to identify tissue-specific genes.
Resumo:
Online music databases have increased significantly as a consequence of the rapid growth of the Internet and digital audio, requiring the development of faster and more efficient tools for music content analysis. Musical genres are widely used to organize music collections. In this paper, the problem of automatic single and multi-label music genre classification is addressed by exploring rhythm-based features obtained from a respective complex network representation. A Markov model is built in order to analyse the temporal sequence of rhythmic notation events. Feature analysis is performed by using two multi-variate statistical approaches: principal components analysis (unsupervised) and linear discriminant analysis (supervised). Similarly, two classifiers are applied in order to identify the category of rhythms: parametric Bayesian classifier under the Gaussian hypothesis (supervised) and agglomerative hierarchical clustering (unsupervised). Qualitative results obtained by using the kappa coefficient and the obtained clusters corroborated the effectiveness of the proposed method.
Resumo:
This paper analyses the presence of financial constraint in the investment decisions of 367 Brazilian firms from 1997 to 2004, using a Bayesian econometric model with group-varying parameters. The motivation for this paper is the use of clustering techniques to group firms in a totally endogenous form. In order to classify the firms we used a hybrid clustering method, that is, hierarchical and non-hierarchical clustering techniques jointly. To estimate the parameters a Bayesian approach was considered. Prior distributions were assumed for the parameters, classifying the model in random or fixed effects. Ordinate predictive density criterion was used to select the model providing a better prediction. We tested thirty models and the better prediction considers the presence of 2 groups in the sample, assuming the fixed effect model with a Student t distribution with 20 degrees of freedom for the error. The results indicate robustness in the identification of financial constraint when the firms are classified by the clustering techniques. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Clustering is a difficult task: there is no single cluster definition and the data can have more than one underlying structure. Pareto-based multi-objective genetic algorithms (e.g., MOCK Multi-Objective Clustering with automatic K-determination and MOCLE-Multi-Objective Clustering Ensemble) were proposed to tackle these problems. However, the output of such algorithms can often contains a high number of partitions, becoming difficult for an expert to manually analyze all of them. In order to deal with this problem, we present two selection strategies, which are based on the corrected Rand, to choose a subset of solutions. To test them, they are applied to the set of solutions produced by MOCK and MOCLE in the context of several datasets. The study was also extended to select a reduced set of partitions from the initial population of MOCLE. These analysis show that both versions of selection strategy proposed are very effective. They can significantly reduce the number of solutions and, at the same time, keep the quality and the diversity of the partitions in the original set of solutions. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Background: High-throughput molecular approaches for gene expression profiling, such as Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS) or Sequencing-by-Synthesis (SBS) represent powerful techniques that provide global transcription profiles of different cell types through sequencing of short fragments of transcripts, denominated sequence tags. These techniques have improved our understanding about the relationships between these expression profiles and cellular phenotypes. Despite this, more reliable datasets are still necessary. In this work, we present a web-based tool named S3T: Score System for Sequence Tags, to index sequenced tags in accordance with their reliability. This is made through a series of evaluations based on a defined rule set. S3T allows the identification/selection of tags, considered more reliable for further gene expression analysis. Results: This methodology was applied to a public SAGE dataset. In order to compare data before and after filtering, a hierarchical clustering analysis was performed in samples from the same type of tissue, in distinct biological conditions, using these two datasets. Our results provide evidences suggesting that it is possible to find more congruous clusters after using S3T scoring system. Conclusion: These results substantiate the proposed application to generate more reliable data. This is a significant contribution for determination of global gene expression profiles. The library analysis with S3T is freely available at http://gdm.fmrp.usp.br/s3t/.S3T source code and datasets can also be downloaded from the aforementioned website.
Resumo:
In breast cancer patients, primary chemotherapy is associated with the same survival benefits as adjuvant chemotherapy. Residual tumors represent a clinical challenge, Lis they may be resistant to additional cycles of the same drugs. Our aim was to identify differential transcripts expressed in residual tumors, after neoadjuvant chemotherapy, that might be related with tumor resistance. Hence, 16 patients with paired tumor samples, collected before and after treatment (4 cycles doxorubicin/cyclophosphamide, AC) had their gene expression evaluated on cDNA microarray slides containing 4,608 genes. Three hundred and eighty-nine genes were differentially expressed (paired Student`s t-test, pFDR<0.01) between pre- and post-chemotherapy samples and among the regulated functions were the JNK cascade and cell death. Unsupervised hierarchical clustering identified one branch comprising exclusively, eight pre-chemotherapy samples and another branch, including the former correspondent eight post-chemotherapy samples and other 16 paired pre/post-chemotherapy samples. No differences in clinical and tumor parameters could explain this clustering. Another group of I I patients with paired samples had expression of selected genes determined by real-time RT-PCR and CTGF and DUSP1 were confirmed more expressed in post- as compared to pre-chemotherapy samples. After neoadjuvant chemotherapy some residual samples may retain their molecular signature while others present significant changes in their gene expression, probably induced by the treatment. CTGF and DUSP1 overexpression in residual samples may be a reflection of resistance to further administration of AC regimen.
Resumo:
Objective. To explore the relationship between biomarkers of pulmonary arterial hypertension (PAH), interferon (IFN)-regulated gene expression, and the alternative activation pathway in systemic sclerosis (SSc). Methods. Peripheral blood mononuclear cells (PBMCs) were purified from healthy controls, patients with idiopathic PAH, and SSc patients (classified as having diffuse cutaneous SSc, limited cutaneous SSc [lcSSc] without PAH, and lcSSc with PAH). IFN-regulated and ""PAH biomarker"" genes were compared after supervised hierarchical clustering. Messenger RNA levels of selected IFN-regulated genes (Siglec1 and MX1), biomarker genes (IL13RA1, CCR1, and JAK2), and the alternative activation marker gene (MRC1) were analyzed on PBMCs and on CD14- and CD14+ cell populations. Interleukin-13 (IL-13) and IL-4 concentrations were measured in plasma by immunoassay. CD14, MRC1, and IL13RA1 surface expression was analyzed by flow cytometry. Results. Increased PBMC expression of both IFN-regulated and biomarker genes distinguished SSc patients from healthy controls. Expression of genes in the biomarker cluster, but not in the IFN-regulated cluster, distinguished lcSSc with PAH from lcSSc without PAH. The genes CCR1 (P < 0.001) and JAK2 (P < 0.001) were expressed more highly in lcSSc patients with PAH compared with controls and mainly by CD14+ cells. MRC1 expression was increased exclusively in lcSSc patients with PAH (P < 0.001) and correlated strongly with pulmonary artery pressure (r = 0.52, P = 0.03) and higher mortality (P = 0.02). MRC1 expression was higher in CD14+ cells and was greatly increased by stimulation with IL-13. IL-13 concentrations in plasma were most highly increased in lcSSc patients with PAH (P < 0.001). Conclusion. IFN-regulated and biomarker genes represent distinct, although related, clusters in lcSSc patients with PAH. MRC1, a marker for the effect of IL-13 on alternative monocyte/macrophage activation, is associated with this severe complication and is related to mortality.
Resumo:
The expression of peripheral tissue antigens (PTAs) in the thymus by medullary thymic epithelial cells (mTECs) is essential for the central self-tolerance in the generation of the T cell repertoire. Due to heterogeneity of autoantigen representation, this phenomenon has been termed promiscuous gene expression (PGE), in which the autoimmune regulator (Aire) gene plays a key role as a transcription factor in part of these genes. Here we used a microarray strategy to access PGE in cultured murine CD80(+) 3.10 mTEC line. Hierarchical clustering of the data allowed observation that PTA genes were differentially expressed being possible to found their respective induced or repressed mRNAs. To further investigate the control of PGE, we tested the hypothesis that genes involved in this phenomenon might also be modulated by transcriptional network. We then reconstructed such network based on the microarray expression data, featuring the guanylate cyclase 2d (Gucy2d) gene as a main node. In such condition, we established 167 positive and negative interactions with downstream PTA genes. Silencing Aire by RNA interference, Gucy2d while down regulated established a larger number (355) of interactions with PTA genes. T- and G-boxes corresponding to AIRE protein binding sites located upstream to ATG codon of Gucy2d supports this effect. These findings provide evidence that Aire plays a role in association with Gucy2d, which is connected to Several PTA genes and establishes a cascade-like transcriptional control of promiscuous gene expression in mTEC cells. (C) 2009 Elsevier Ltd. All rights reserved.
Resumo:
Urinary bladder cancer is the fourth most common malignancy in the Western world. Transitional cell carcinoma (TCC) is the most common subtype, accounting for about 90% of all bladder cancers. The TP53 gene plays an essential role in the regulation of the cell cycle and apoptosis and therefore contributes to cellular transformation and malignancy; however, little is known about the differential gene expression patterns in human tumors that present with the wild-type or mutated TP53 gene. Therefore, because gene profiling can provide new insights into the molecular biology of bladder cancer, the present study aimed to compare the molecular profiles of bladder cancer cell lines with different TP53 alleles, including the wild type (RT4) and two mutants (5637, with mutations in codons 280 and 72; and T24, a TP53 allele encoding an in-frame deletion of tyrosine 126). Unsupervised hierarchical clustering and gene networks were constructed based on data generated by cDNA microarrays using mRNA from the three cell lines. Differentially expressed genes related to the cell cycle, cell division, cell death, and cell proliferation were observed in the three cell lines. However, the cDNA microarray data did not cluster cell lines based on their TP53 allele. The gene profiles of the RT4 cells were more similar to those of T24 than to those of the 5637 cells. While the deregulation of both the cell cycle and the apoptotic pathways was particularly related to TCC, these alterations were not associated with the TP53 status.
Resumo:
Mesenchymal stem cells (MSC) are multipotent cells which can be obtained from several adult and fetal tissues including human umbilical cord units. We have recently shown that umbilical cord tissue (UC) is richer in MSC than umbilical cord blood (UCB) but their origin and characteristics in blood as compared to the cord remains unknown. Here we compared, for the first time, the exonic protein-coding and intronic noncoding RNA (ncRNA) expression profiles of MSC from match-paired UC and UCB samples, harvested from the same donors, processed simultaneously and under the same culture conditions. The patterns of intronic ncRNA expression in MSC from UC and UCB paired units were highly similar, indicative of their common donor origin. The respective exonic protein-coding transcript expression profiles, however, were significantly different. Hierarchical clustering based on protein-coding expression similarities grouped MSC according to their tissue location rather than original donor. Genes related to systems development, osteogenesis and immune system were expressed at higher levels in UCB, whereas genes related to cell adhesion, morphogenesis, secretion, angiogenesis and neurogenesis were more expressed in UC cells. These molecular differences verified in tissue-specific MSC gene expression may reflect functional activities influenced by distinct niches and should be considered when developing clinical protocols involving MSC from different sources. In addition, these findings reinforce our previous suggestion on the importance of banking the whole umbilical cord unit for research or future therapeutic use.
Resumo:
This work presents a Bayesian semiparametric approach for dealing with regression models where the covariate is measured with error. Given that (1) the error normality assumption is very restrictive, and (2) assuming a specific elliptical distribution for errors (Student-t for example), may be somewhat presumptuous; there is need for more flexible methods, in terms of assuming only symmetry of errors (admitting unknown kurtosis). In this sense, the main advantage of this extended Bayesian approach is the possibility of considering generalizations of the elliptical family of models by using Dirichlet process priors in dependent and independent situations. Conditional posterior distributions are implemented, allowing the use of Markov Chain Monte Carlo (MCMC), to generate the posterior distributions. An interesting result shown is that the Dirichlet process prior is not updated in the case of the dependent elliptical model. Furthermore, an analysis of a real data set is reported to illustrate the usefulness of our approach, in dealing with outliers. Finally, semiparametric proposed models and parametric normal model are compared, graphically with the posterior distribution density of the coefficients. (C) 2009 Elsevier Inc. All rights reserved.
Resumo:
Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.
Resumo:
We numerically study the dynamics of a discrete spring-block model introduced by Olami, Feder, and Christensen (OFC) to mimic earthquakes and investigate to what extent this simple model is able to reproduce the observed spatiotemporal clustering of seismicity. Following a recently proposed method to characterize such clustering by networks of recurrent events [J. Davidsen, P. Grassberger, and M. Paczuski, Geophys. Res. Lett. 33, L11304 (2006)], we find that for synthetic catalogs generated by the OFC model these networks have many nontrivial statistical properties. This includes characteristic degree distributions, very similar to what has been observed for real seismicity. There are, however, also significant differences between the OFC model and earthquake catalogs, indicating that this simple model is insufficient to account for certain aspects of the spatiotemporal clustering of seismicity.
Resumo:
The benefits of breastfeeding for the children`s health have been highlighted in many studies. The innovative aspect of the present study lies in its use of a multilevel model, a technique that has rarely been applied to studies on breastfeeding. The data reported were collected from a larger study, the Family Budget Survey-Pesquisa de Orcamentos Familiares, carried out between 2002 and 2003 in Brazil that involved a sample of 48 470 households. A representative national sample of 1477 infants aged 0-6 months was used. The statistical analysis was performed using a multilevel model, with two levels grouped by region. In Brazil, breastfeeding prevalence was 58%. The factors that bore a negative influence on breastfeeding were over four residents living in the same household [odds ratio (OR) = 0.68, 90% confidence interval (CI) = 0.51-0.89] and mothers aged 30 years or more (OR = 0.68, 90% CI = 0.53-0.89). The factors that positively influenced breastfeeding were the following: higher socio-economic levels (OR = 1.37, 90% CI = 1.01-1.88), families with over two infants under 5 years (OR = 1.25, 90% CI = 1.00-1.58) and being a resident in rural areas (OR = 1.25, 90% CI = 1.00-1.58). Although majority of the mothers was aware of the value of maternal milk and breastfed their babies, the prevalence of breastfeeding remains lower than the rate advised by the World Health Organization, and the number of residents living in the same household along with mothers aged 30 years or older were both factors associated with early cessation of infant breastfeeding before 6 months.