783 resultados para grid, clustering, statistical, clustering
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Foi realizado o inventário da araneofauna do Parque Nacional de Sete Cidades (municípios de Brasileira e Piracuruca, Piauí), utilizando amostragem padronizada para permitir comparações entre as assembléias de aranhas de seis fitofisionomias existentes na área de estudo e obter estimativas de riqueza. Utilizaram-se dados oriundos amostragem com armadilhas de queda (PTF), extratores de Winkler (WIN), guarda-chuva entomológico (GCE), rede de varredura (RV) e coletas manuais noturnas (MN), totalizando 1386 amostras; além do exame de todos os demais espécimes já coletados na área de estudo (n=1166). As análises estatísticas foram realizadas utilizando-se os dados obtidos com GCE, RV e MN. Ao todo, foram coletados 14.890 indivíduos (4491 adultos), segregados em 364 espécies. Destas, 72 foram determinadas a nível específico, 62 são novos registros para a área de estudo, 2 são novos registros para o Brasil e 48 foram reconhecidas como espécies novas por especialistas. A aplicação dos métodos GCE, RV e MN resultou em 11.085 aranhas, pertencentes a 303 espécies. As estimativas de riqueza variaram entre 355 (Bootstrap) e 467 (Jack 2). Entretanto o estimador que apresentou maior tendência a atingir a assíntota foi Chao 2 (403 spp.). A riqueza observada foi maior na mata seca semi-decídua (131 spp.), seguida pela mata de galeria (104 spp.), campo limpo (102 spp.), cerradão (91 spp.), cerrado típico (88 spp.) e cerrado rupestre (78 spp.). A eficiência dos métodos de coleta exibiu variação de acordo com a fitofisionomia onde o método foi aplicado, destacando-se a elevada eficiência da rede de varredura em áreas abertas. A composição de espécies variou entre as fitofisionomias e pode ser, em parte, explicada pela complexidade estrutural das áreas em questão. Os resultados das análises de agrupamento sugerem que em condições de dominância elevada, estes testes sejam realizados com coeficientes que utilizem dados qualitativos, a fim de anular-se o efeito da escolha do coeficiente e/ou a necessidade de transformação dos dados. De maneira geral, a araneofauna do Parque Nacional de Sete Cidades não segue padrões de agrupamento como sugerido para as análises botânicas, em que fitofisionomias campestre, savânicas e florestais são agrupadas.
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Pós-graduação em Agronomia (Horticultura) - FCA
Resumo:
Pós-graduação em Engenharia Mecânica - FEIS
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Pós-graduação em Saúde Coletiva - FMB
Resumo:
Background: Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called ‘gene shaving’. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be ‘unsupervised’, that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings. Results: We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival. Conclusions: The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.
Discriminating Different Classes of Biological Networks by Analyzing the Graphs Spectra Distribution
Resumo:
The brain's structural and functional systems, protein-protein interaction, and gene networks are examples of biological systems that share some features of complex networks, such as highly connected nodes, modularity, and small-world topology. Recent studies indicate that some pathologies present topological network alterations relative to norms seen in the general population. Therefore, methods to discriminate the processes that generate the different classes of networks (e. g., normal and disease) might be crucial for the diagnosis, prognosis, and treatment of the disease. It is known that several topological properties of a network (graph) can be described by the distribution of the spectrum of its adjacency matrix. Moreover, large networks generated by the same random process have the same spectrum distribution, allowing us to use it as a "fingerprint". Based on this relationship, we introduce and propose the entropy of a graph spectrum to measure the "uncertainty" of a random graph and the Kullback-Leibler and Jensen-Shannon divergences between graph spectra to compare networks. We also introduce general methods for model selection and network model parameter estimation, as well as a statistical procedure to test the nullity of divergence between two classes of complex networks. Finally, we demonstrate the usefulness of the proposed methods by applying them to (1) protein-protein interaction networks of different species and (2) on networks derived from children diagnosed with Attention Deficit Hyperactivity Disorder (ADHD) and typically developing children. We conclude that scale-free networks best describe all the protein-protein interactions. Also, we show that our proposed measures succeeded in the identification of topological changes in the network while other commonly used measures (number of edges, clustering coefficient, average path length) failed.
Resumo:
We propose an alternative, nonsingular, cosmic scenario based on gravitationally induced particle production. The model is an attempt to evade the coincidence and cosmological constant problems of the standard model (Lambda CDM) and also to connect the early and late time accelerating stages of the Universe. Our space-time emerges from a pure initial de Sitter stage thereby providing a natural solution to the horizon problem. Subsequently, due to an instability provoked by the production of massless particles, the Universe evolves smoothly to the standard radiation dominated era thereby ending the production of radiation as required by the conformal invariance. Next, the radiation becomes subdominant with the Universe entering in the cold dark matter dominated era. Finally, the negative pressure associated with the creation of cold dark matter (CCDM model) particles accelerates the expansion and drives the Universe to a final de Sitter stage. The late time cosmic expansion history of the CCDM model is exactly like in the standard Lambda CDM model; however, there is no dark energy. The model evolves between two limiting (early and late time) de Sitter regimes. All the stages are also discussed in terms of a scalar field description. This complete scenario is fully determined by two extreme energy densities, or equivalently, the associated de Sitter Hubble scales connected by rho(I)/rho(f) = (H-I/H-f)(2) similar to 10(122), a result that has no correlation with the cosmological constant problem. We also study the linear growth of matter perturbations at the final accelerating stage. It is found that the CCDM growth index can be written as a function of the Lambda growth index, gamma(Lambda) similar or equal to 6/11. In this framework, we also compare the observed growth rate of clustering with that predicted by the current CCDM model. Performing a chi(2) statistical test we show that the CCDM model provides growth rates that match sufficiently well with the observed growth rate of structure.
Resumo:
Abstract Background To understand the molecular mechanisms underlying important biological processes, a detailed description of the gene products networks involved is required. In order to define and understand such molecular networks, some statistical methods are proposed in the literature to estimate gene regulatory networks from time-series microarray data. However, several problems still need to be overcome. Firstly, information flow need to be inferred, in addition to the correlation between genes. Secondly, we usually try to identify large networks from a large number of genes (parameters) originating from a smaller number of microarray experiments (samples). Due to this situation, which is rather frequent in Bioinformatics, it is difficult to perform statistical tests using methods that model large gene-gene networks. In addition, most of the models are based on dimension reduction using clustering techniques, therefore, the resulting network is not a gene-gene network but a module-module network. Here, we present the Sparse Vector Autoregressive model as a solution to these problems. Results We have applied the Sparse Vector Autoregressive model to estimate gene regulatory networks based on gene expression profiles obtained from time-series microarray experiments. Through extensive simulations, by applying the SVAR method to artificial regulatory networks, we show that SVAR can infer true positive edges even under conditions in which the number of samples is smaller than the number of genes. Moreover, it is possible to control for false positives, a significant advantage when compared to other methods described in the literature, which are based on ranks or score functions. By applying SVAR to actual HeLa cell cycle gene expression data, we were able to identify well known transcription factor targets. Conclusion The proposed SVAR method is able to model gene regulatory networks in frequent situations in which the number of samples is lower than the number of genes, making it possible to naturally infer partial Granger causalities without any a priori information. In addition, we present a statistical test to control the false discovery rate, which was not previously possible using other gene regulatory network models.
Resumo:
[EN] Breast cancer patients show a wide variation in normal tissue reactions after radiotherapy. The individual sensitivity to x-rays limits the efficiency of the therapy. Prediction of individual sensitivity to radiotherapy could help to select the radiation protocol and to improve treatment results. The aim of this study was to assess the relationship between gene expression profiles of ex vivo un-irradiated and irradiated lymphocytes and the development of toxicity due to high-dose hyperfractionated radiotherapy in patients with locally advanced breast cancer. Raw data from microarray experiments were uploaded to the Gene Expression Omnibus Database http://www.ncbi.nlm.nih.gov/geo/ (GEO accession GSE15341). We obtained a small group of 81 genes significantly regulated by radiotherapy, lumped in 50 relevant pathways. Using ANOVA and t-test statistical tools we found 20 and 26 constitutive genes (0 Gy) that segregate patients with and without acute and late toxicity, respectively. Non-supervised hierarchical clustering was used for the visualization of results. Six and 9 pathways were significantly regulated respectively. Concerning to irradiated lymphocytes (2 Gy), we founded 29 genes that separate patients with acute toxicity and without it. Those genes were gathered in 4 significant pathways. We could not identify a set of genes that segregates patients with and without late toxicity. In conclusion, we have found an association between the constitutive gene expression profile of peripheral blood lymphocytes and the development of acute and late toxicity in consecutive, unselected patients. These observations suggest the possibility of predicting normal tissue response to irradiation in high-dose non-conventional radiation therapy regimens. Prospective studies with higher number of patients are needed to validate these preliminary results.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
The purpose of this Thesis is to develop a robust and powerful method to classify galaxies from large surveys, in order to establish and confirm the connections between the principal observational parameters of the galaxies (spectral features, colours, morphological indices), and help unveil the evolution of these parameters from $z \sim 1$ to the local Universe. Within the framework of zCOSMOS-bright survey, and making use of its large database of objects ($\sim 10\,000$ galaxies in the redshift range $0 < z \lesssim 1.2$) and its great reliability in redshift and spectral properties determinations, first we adopt and extend the \emph{classification cube method}, as developed by Mignoli et al. (2009), to exploit the bimodal properties of galaxies (spectral, photometric and morphologic) separately, and then combining together these three subclassifications. We use this classification method as a test for a newly devised statistical classification, based on Principal Component Analysis and Unsupervised Fuzzy Partition clustering method (PCA+UFP), which is able to define the galaxy population exploiting their natural global bimodality, considering simultaneously up to 8 different properties. The PCA+UFP analysis is a very powerful and robust tool to probe the nature and the evolution of galaxies in a survey. It allows to define with less uncertainties the classification of galaxies, adding the flexibility to be adapted to different parameters: being a fuzzy classification it avoids the problems due to a hard classification, such as the classification cube presented in the first part of the article. The PCA+UFP method can be easily applied to different datasets: it does not rely on the nature of the data and for this reason it can be successfully employed with others observables (magnitudes, colours) or derived properties (masses, luminosities, SFRs, etc.). The agreement between the two classification cluster definitions is very high. ``Early'' and ``late'' type galaxies are well defined by the spectral, photometric and morphological properties, both considering them in a separate way and then combining the classifications (classification cube) and treating them as a whole (PCA+UFP cluster analysis). Differences arise in the definition of outliers: the classification cube is much more sensitive to single measurement errors or misclassifications in one property than the PCA+UFP cluster analysis, in which errors are ``averaged out'' during the process. This method allowed us to behold the \emph{downsizing} effect taking place in the PC spaces: the migration between the blue cloud towards the red clump happens at higher redshifts for galaxies of larger mass. The determination of $M_{\mathrm{cross}}$ the transition mass is in significant agreement with others values in literature.