912 resultados para hierarchical clustering techniques


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Pós-graduação em Agronomia (Produção Vegetal) - FCAV

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The increase in new electronic devices had generated a considerable increase in obtaining spatial data information; hence these data are becoming more and more widely used. As well as for conventional data, spatial data need to be analyzed so interesting information can be retrieved from them. Therefore, data clustering techniques can be used to extract clusters of a set of spatial data. However, current approaches do not consider the implicit semantics that exist between a region and an object’s attributes. This paper presents an approach that enhances spatial data mining process, so they can use the semantic that exists within a region. A framework was developed, OntoSDM, which enables spatial data mining algorithms to communicate with ontologies in order to enhance the algorithm’s result. The experiments demonstrated a semantically improved result, generating more interesting clusters, therefore reducing manual analysis work of an expert.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background: Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called ‘gene shaving’. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be ‘unsupervised’, that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings. Results: We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival. Conclusions: The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Biogeography has been difficult to apply as a methodological approach because organismic biology is incomplete at levels where the process of formulating comparisons and analogies is complex. The study of insect biogeography became necessary because insects possess numerous evolutionary traits and play an important role as pollinators. Among insects, the euglossine bees, or orchid bees, attract interest because the study of their biology allows us to explain important steps in the evolution of social behavior and many other adaptive tradeoffs. We analyzed the distribution of morphological characteristics in Colombian orchid bees from an ecological perspective. The aim of this study was to observe the distribution of these attributes on a regional basis. Data corresponding to Colombian euglossine species were ordered with a correspondence analysis and with subsequent hierarchical clustering. Later, and based on community proprieties, we compared the resulting hierarchical model with the collection localities to seek to identify a biogeographic classification pattern. From this analysis, we derived a model that classifies the territory of Colombia into 11 biogeographic units or natural clusters. Ecological assumptions in concordance with the derived classification levels suggest that species characteristics associated with flight performance, nectar uptake, and social behavior are the factors that served to produce the current geographical structure.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Alzheimer's disease (AD) is the most common cause of dementia in the human population, characterized by a spectrum of neuropathological abnormalities that results in memory impairment and loss of other cognitive processes as well as the presence of non-cognitive symptoms. Transcriptomic analyses provide an important approach to elucidating the pathogenesis of complex diseases like AD, helping to figure out both pre-clinical markers to identify susceptible patients and the early pathogenic mechanisms to serve as therapeutic targets. This study provides the gene expression profile of postmortem brain tissue from subjects with clinic-pathological AD (Braak IV, V, or V and CERAD B or C; and CDR >= 1), preclinical AD (Braak IV, V, or VI and CERAD B or C; and CDR = 0), and healthy older individuals (Braak <= II and CERAD 0 or A; and CDR = 0) in order to establish genes related to both AD neuropathology and clinical emergence of dementia. Based on differential gene expression, hierarchical clustering and network analysis, genes involved in energy metabolism, oxidative stress, DNA damage/repair, senescence, and transcriptional regulation were implicated with the neuropathology of AD; a transcriptional profile related to clinical manifestation of AD could not be detected with reliability using differential gene expression analysis, although genes involved in synaptic plasticity, and cell cycle seems to have a role revealed by gene classifier. In conclusion, the present data suggest gene expression profile changes secondary to the development of AD-related pathology and some genes that appear to be related to the clinical manifestation of dementia in subjects with significant AD pathology, making necessary further investigations to better understand these transcriptional findings on the pathogenesis and clinical emergence of AD.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Abstract Background To understand the molecular mechanisms underlying important biological processes, a detailed description of the gene products networks involved is required. In order to define and understand such molecular networks, some statistical methods are proposed in the literature to estimate gene regulatory networks from time-series microarray data. However, several problems still need to be overcome. Firstly, information flow need to be inferred, in addition to the correlation between genes. Secondly, we usually try to identify large networks from a large number of genes (parameters) originating from a smaller number of microarray experiments (samples). Due to this situation, which is rather frequent in Bioinformatics, it is difficult to perform statistical tests using methods that model large gene-gene networks. In addition, most of the models are based on dimension reduction using clustering techniques, therefore, the resulting network is not a gene-gene network but a module-module network. Here, we present the Sparse Vector Autoregressive model as a solution to these problems. Results We have applied the Sparse Vector Autoregressive model to estimate gene regulatory networks based on gene expression profiles obtained from time-series microarray experiments. Through extensive simulations, by applying the SVAR method to artificial regulatory networks, we show that SVAR can infer true positive edges even under conditions in which the number of samples is smaller than the number of genes. Moreover, it is possible to control for false positives, a significant advantage when compared to other methods described in the literature, which are based on ranks or score functions. By applying SVAR to actual HeLa cell cycle gene expression data, we were able to identify well known transcription factor targets. Conclusion The proposed SVAR method is able to model gene regulatory networks in frequent situations in which the number of samples is lower than the number of genes, making it possible to naturally infer partial Granger causalities without any a priori information. In addition, we present a statistical test to control the false discovery rate, which was not previously possible using other gene regulatory network models.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Abstract Background Oral squamous cell carcinoma (OSCC) is a frequent neoplasm, which is usually aggressive and has unpredictable biological behavior and unfavorable prognosis. The comprehension of the molecular basis of this variability should lead to the development of targeted therapies as well as to improvements in specificity and sensitivity of diagnosis. Results Samples of primary OSCCs and their corresponding surgical margins were obtained from male patients during surgery and their gene expression profiles were screened using whole-genome microarray technology. Hierarchical clustering and Principal Components Analysis were used for data visualization and One-way Analysis of Variance was used to identify differentially expressed genes. Samples clustered mostly according to disease subsite, suggesting molecular heterogeneity within tumor stages. In order to corroborate our results, two publicly available datasets of microarray experiments were assessed. We found significant molecular differences between OSCC anatomic subsites concerning groups of genes presently or potentially important for drug development, including mRNA processing, cytoskeleton organization and biogenesis, metabolic process, cell cycle and apoptosis. Conclusion Our results corroborate literature data on molecular heterogeneity of OSCCs. Differences between disease subsites and among samples belonging to the same TNM class highlight the importance of gene expression-based classification and challenge the development of targeted therapies.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Abstract Background Regardless the regulatory function of microRNAs (miRNA), their differential expression pattern has been used to define miRNA signatures and to disclose disease biomarkers. To address the question of whether patients presenting the different types of diabetes mellitus could be distinguished on the basis of their miRNA and mRNA expression profiling, we obtained peripheral blood mononuclear cell (PBMC) RNAs from 7 type 1 (T1D), 7 type 2 (T2D), and 6 gestational diabetes (GDM) patients, which were hybridized to Agilent miRNA and mRNA microarrays. Data quantification and quality control were obtained using the Feature Extraction software, and data distribution was normalized using quantile function implemented in the Aroma light package. Differentially expressed miRNAs/mRNAs were identified using Rank products, comparing T1DxGDM, T2DxGDM and T1DxT2D. Hierarchical clustering was performed using the average linkage criterion with Pearson uncentered distance as metrics. Results The use of the same microarrays platform permitted the identification of sets of shared or specific miRNAs/mRNA interaction for each type of diabetes. Nine miRNAs (hsa-miR-126, hsa-miR-1307, hsa-miR-142-3p, hsa-miR-142-5p, hsa-miR-144, hsa-miR-199a-5p, hsa-miR-27a, hsa-miR-29b, and hsa-miR-342-3p) were shared among T1D, T2D and GDM, and additional specific miRNAs were identified for T1D (20 miRNAs), T2D (14) and GDM (19) patients. ROC curves allowed the identification of specific and relevant (greater AUC values) miRNAs for each type of diabetes, including: i) hsa-miR-1274a, hsa-miR-1274b and hsa-let-7f for T1D; ii) hsa-miR-222, hsa-miR-30e and hsa-miR-140-3p for T2D, and iii) hsa-miR-181a and hsa-miR-1268 for GDM. Many of these miRNAs targeted mRNAs associated with diabetes pathogenesis. Conclusions These results indicate that PBMC can be used as reporter cells to characterize the miRNA expression profiling disclosed by the different diabetes mellitus manifestations. Shared miRNAs may characterize diabetes as a metabolic and inflammatory disorder, whereas specific miRNAs may represent biological markers for each type of diabetes, deserving further attention.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

[EN] Breast cancer patients show a wide variation in normal tissue reactions after radiotherapy. The individual sensitivity to x-rays limits the efficiency of the therapy. Prediction of individual sensitivity to radiotherapy could help to select the radiation protocol and to improve treatment results. The aim of this study was to assess the relationship between gene expression profiles of ex vivo un-irradiated and irradiated lymphocytes and the development of toxicity due to high-dose hyperfractionated radiotherapy in patients with locally advanced breast cancer. Raw data from microarray experiments were uploaded to the Gene Expression Omnibus Database http://www.ncbi.nlm.nih.gov/geo/ (GEO accession GSE15341). We obtained a small group of 81 genes significantly regulated by radiotherapy, lumped in 50 relevant pathways. Using ANOVA and t-test statistical tools we found 20 and 26 constitutive genes (0 Gy) that segregate patients with and without acute and late toxicity, respectively. Non-supervised hierarchical clustering was used for the visualization of results. Six and 9 pathways were significantly regulated respectively. Concerning to irradiated lymphocytes (2 Gy), we founded 29 genes that separate patients with acute toxicity and without it. Those genes were gathered in 4 significant pathways. We could not identify a set of genes that segregates patients with and without late toxicity. In conclusion, we have found an association between the constitutive gene expression profile of peripheral blood lymphocytes and the development of acute and late toxicity in consecutive, unselected patients. These observations suggest the possibility of predicting normal tissue response to irradiation in high-dose non-conventional radiation therapy regimens. Prospective studies with higher number of patients are needed to validate these preliminary results.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Because of its aberrant activation, the PI3K/AKT/mTOR signaling pathway represents a pharmacological target in blast cells from patients with acute myelogenous leukemia (AML). Using Reverse Phase Protein Microarrays (RPMA), we have analyzed 20 phosphorylated epitopes of the PI3K/Akt/mTor signal pathway of peripheral blood and bone marrow specimens of 84 patients with newly diagnosed AML. Fresh blast cells were grown for 2 h, 4 h or 20 h untreated or treated with a panel of phase I or phase II Akt allosteric inhibitors, either alone or in combination with the mTOR kinase inhibitor Torin1 or the broad RTK inhibitor Sunitinib. By unsupervised hierarchical clustering a strong phosphorylation/activity of most of the sampled members of the PI3K/Akt/mTOR pathway was observed in 70% of samples from AML patients. Remarkably, however, we observed that inhibition of Akt phosphorylation, as well as of its substrates, was transient, and recovered or even increased far above basal level after 20 h in 60% samples. We demonstrated that inhibition of Akt induces FOXO-dependent insulin receptor expression and IRS-1 activation, attenuating the effect of drug treatment by reactivation of PI3K/Akt. Consistent with this model we found that combined inhibition of Akt and RTKs is much more effective than either alone, revealing the adaptive capabilities of signaling networks in blast cells and highliting the limations of these drugs if used as monotherapy.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Intelligent Transport Systems (ITS) consists in the application of ICT to transport to offer new and improved services to the mobility of people and freights. While using ITS, travellers produce large quantities of data that can be collected and analysed to study their behaviour and to provide information to decision makers and planners. The thesis proposes innovative deployments of classification algorithms for Intelligent Transport System with the aim to support the decisions on traffic rerouting, bus transport demand and behaviour of two wheelers vehicles. The first part of this work provides an overview and a classification of a selection of clustering algorithms that can be implemented for the analysis of ITS data. The first contribution of this thesis is an innovative use of the agglomerative hierarchical clustering algorithm to classify similar travels in terms of their origin and destination, together with the proposal for a methodology to analyse drivers’ route choice behaviour using GPS coordinates and optimal alternatives. The clusters of repetitive travels made by a sample of drivers are then analysed to compare observed route choices to the modelled alternatives. The results of the analysis show that drivers select routes that are more reliable but that are more expensive in terms of travel time. Successively, different types of users of a service that provides information on the real time arrivals of bus at stop are classified using Support Vector Machines. The results shows that the results of the classification of different types of bus transport users can be used to update or complement the census on bus transport flows. Finally, the problem of the classification of accidents made by two wheelers vehicles is presented together with possible future application of clustering methodologies aimed at identifying and classifying the different types of accidents.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The purpose of this study was to determine the role of saliva-derived biomarkers and periodontal pathogens during periodontal disease progression (PDP). One hundred human participants were recruited into a 12-month investigation. They were seen bi-monthly for saliva and clinical measures and bi-annually for subtraction radiography, serum and plaque biofilm assessments. Saliva and serum were analyzed with protein arrays for 14 pro-inflammatory and bone turnover markers, while qPCR was used for detection of biofilm. A hierarchical clustering algorithm was used to group study participants based on clinical, microbiological, salivary/serum biomarkers, and PDP. Eighty-three individuals completed the six-month monitoring phase, with 39 [corrected] exhibiting PDP, while 44 [corrected] demonstrated stability. Participants assembled into three clusters based on periodontal pathogens, serum and salivary biomarkers. Cluster 1 members displayed high salivary biomarkers and biofilm; 71% [corrected] of these individuals were undergoing PDP. Cluster 2 members displayed low biofilm and biomarker levels; 76% [corrected] of these individuals were stable. Cluster 3 members were not discriminated by PDP status; however, cluster stratification followed groups 1 and 2 based on thresholds of salivary biomarkers and biofilm pathogens. The association of cluster membership to PDP was highly significant (p < 0.0007). [corrected] The use of salivary and biofilm biomarkers offers potential for the identification of PDP or stability (ClinicalTrials.gov number, CT00277745).