992 resultados para Gene clustering
Resumo:
The amount of biological data has grown exponentially in recent decades. Modern biotechnologies, such as microarrays and next-generation sequencing, are capable to produce massive amounts of biomedical data in a single experiment. As the amount of the data is rapidly growing there is an urgent need for reliable computational methods for analyzing and visualizing it. This thesis addresses this need by studying how to efficiently and reliably analyze and visualize high-dimensional data, especially that obtained from gene expression microarray experiments. First, we will study the ways to improve the quality of microarray data by replacing (imputing) the missing data entries with the estimated values for these entries. Missing value imputation is a method which is commonly used to make the original incomplete data complete, thus making it easier to be analyzed with statistical and computational methods. Our novel approach was to use curated external biological information as a guide for the missing value imputation. Secondly, we studied the effect of missing value imputation on the downstream data analysis methods like clustering. We compared multiple recent imputation algorithms against 8 publicly available microarray data sets. It was observed that the missing value imputation indeed is a rational way to improve the quality of biological data. The research revealed differences between the clustering results obtained with different imputation methods. On most data sets, the simple and fast k-NN imputation was good enough, but there were also needs for more advanced imputation methods, such as Bayesian Principal Component Algorithm (BPCA). Finally, we studied the visualization of biological network data. Biological interaction networks are examples of the outcome of multiple biological experiments such as using the gene microarray techniques. Such networks are typically very large and highly connected, thus there is a need for fast algorithms for producing visually pleasant layouts. A computationally efficient way to produce layouts of large biological interaction networks was developed. The algorithm uses multilevel optimization within the regular force directed graph layout algorithm.
Resumo:
Chronic hepatitis B (HBV) and C (HCV) virus infections are the most important factors associated with hepatocellular carcinoma (HCC), but tumor prognosis remains poor due to the lack of diagnostic biomarkers. In order to identify novel diagnostic markers and therapeutic targets, the gene expression profile associated with viral and non-viral HCC was assessed in 9 tumor samples by oligo-microarrays. The differentially expressed genes were examined using a z-score and KEGG pathway for the search of ontological biological processes. We selected a non-redundant set of 15 genes with the lowest P value for clustering samples into three groups using the non-supervised algorithm k-means. Fisher’s linear discriminant analysis was then applied in an exhaustive search of trios of genes that could be used to build classifiers for class distinction. Different transcriptional levels of genes were identified in HCC of different etiologies and from different HCC samples. When comparing HBV-HCC vs HCV-HCC, HBV-HCC/HCV-HCC vs non-viral (NV)-HCC, HBC-HCC vs NV-HCC, and HCV-HCC vs NV-HCC of the 58 non-redundant differentially expressed genes, only 6 genes (IKBKβ, CREBBP, WNT10B, PRDX6, ITGAV, and IFNAR1) were found to be associated with hepatic carcinogenesis. By combining trios, classifiers could be generated, which correctly classified 100% of the samples. This expression profiling may provide a useful tool for research into the pathophysiology of HCC. A detailed understanding of how these distinct genes are involved in molecular pathways is of fundamental importance to the development of effective HCC chemoprevention and treatment.
Resumo:
Naïvement perçu, le processus d’évolution est une succession d’événements de duplication et de mutations graduelles dans le génome qui mènent à des changements dans les fonctions et les interactions du protéome. La famille des hydrolases de guanosine triphosphate (GTPases) similaire à Ras constitue un bon modèle de travail afin de comprendre ce phénomène fondamental, car cette famille de protéines contient un nombre limité d’éléments qui diffèrent en fonctionnalité et en interactions. Globalement, nous désirons comprendre comment les mutations singulières au niveau des GTPases affectent la morphologie des cellules ainsi que leur degré d’impact sur les populations asynchrones. Mon travail de maîtrise vise à classifier de manière significative différents phénotypes de la levure Saccaromyces cerevisiae via l’analyse de plusieurs critères morphologiques de souches exprimant des GTPases mutées et natives. Notre approche à base de microscopie et d’analyses bioinformatique des images DIC (microscopie d’interférence différentielle de contraste) permet de distinguer les phénotypes propres aux cellules natives et aux mutants. L’emploi de cette méthode a permis une détection automatisée et une caractérisation des phénotypes mutants associés à la sur-expression de GTPases constitutivement actives. Les mutants de GTPases constitutivement actifs Cdc42 Q61L, Rho5 Q91H, Ras1 Q68L et Rsr1 G12V ont été analysés avec succès. En effet, l’implémentation de différents algorithmes de partitionnement, permet d’analyser des données qui combinent les mesures morphologiques de population native et mutantes. Nos résultats démontrent que l’algorithme Fuzzy C-Means performe un partitionnement efficace des cellules natives ou mutantes, où les différents types de cellules sont classifiés en fonction de plusieurs facteurs de formes cellulaires obtenus à partir des images DIC. Cette analyse démontre que les mutations Cdc42 Q61L, Rho5 Q91H, Ras1 Q68L et Rsr1 G12V induisent respectivement des phénotypes amorphe, allongé, rond et large qui sont représentés par des vecteurs de facteurs de forme distincts. Ces distinctions sont observées avec différentes proportions (morphologie mutante / morphologie native) dans les populations de mutants. Le développement de nouvelles méthodes automatisées d’analyse morphologique des cellules natives et mutantes s’avère extrêmement utile pour l’étude de la famille des GTPases ainsi que des résidus spécifiques qui dictent leurs fonctions et réseau d’interaction. Nous pouvons maintenant envisager de produire des mutants de GTPases qui inversent leur fonction en ciblant des résidus divergents. La substitution fonctionnelle est ensuite détectée au niveau morphologique grâce à notre nouvelle stratégie quantitative. Ce type d’analyse peut également être transposé à d’autres familles de protéines et contribuer de manière significative au domaine de la biologie évolutive.
Resumo:
Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.
Resumo:
Biclustering is simultaneous clustering of both rows and columns of a data matrix. A measure called Mean Squared Residue (MSR) is used to simultaneously evaluate the coherence of rows and columns within a submatrix. In this paper a novel algorithm is developed for biclustering gene expression data using the newly introduced concept of MSR difference threshold. In the first step high quality bicluster seeds are generated using K-Means clustering algorithm. Then more genes and conditions (node) are added to the bicluster. Before adding a node the MSR X of the bicluster is calculated. After adding the node again the MSR Y is calculated. The added node is deleted if Y minus X is greater than MSR difference threshold or if Y is greater than MSR threshold which depends on the dataset. The MSR difference threshold is different for gene list and condition list and it depends on the dataset also. Proper values should be identified through experimentation in order to obtain biclusters of high quality. The results obtained on bench mark dataset clearly indicate that this algorithm is better than many of the existing biclustering algorithms
Resumo:
A series of vectors for the over-expression of tagged proteins in Dictyostelium were designed, constructed and tested. These vectors allow the addition of an N- or C-terminal tag (GFP, RFP, 3xFLAG, 3xHA, 6xMYC and TAP) with an optimized polylinker sequence and no additional amino acid residues at the N or C terminus. Different selectable markers (Blasticidin and gentamicin) are available as well as an extra chromosomal version; these allow copy number and thus expression level to be controlled, as well as allowing for more options with regard to complementation, co- and super-transformation. Finally, the vectors share standardized cloning sites, allowing a gene of interest to be easily transfered between the different versions of the vectors as experimental requirements evolve. The organisation and dynamics of the Dictyostelium nucleus during the cell cycle was investigated. The centromeric histone H3 (CenH3) variant serves to target the kinetochore to the centromeres and thus ensures correct chromosome segregation during mitosis and meiosis. A number of Dictyostelium histone H3-domain containing proteins as GFP-tagged fusions were expressed and it was found that one of them functions as CenH3 in this species. Like CenH3 from some other species, Dictyostelium CenH3 has an extended N-terminal domain with no similarity to any other known proteins. The targeting domain, comprising α-helix 2 and loop 1 of the histone fold is required for targeting CenH3 to centromeres. Compared to the targeting domain of other known and putative CenH3 species, Dictyostelium CenH3 has a shorter loop 1 region. The localisation of a variety of histone modifications and histone modifying enzymes was examined. Using fluorescence in situ hybridisation (FISH) and CenH3 chromatin-immunoprecipitation (ChIP) it was shown that the six telocentric centromeres contain all of the DIRS-1 and most of the DDT-A and skipper transposons. During interphase the centromeres remain attached to the centrosome resulting in a single CenH3 cluster which also contains the putative histone H3K9 methyltransferase SuvA, H3K9me3 and HP1 (heterochromatin protein 1). Except for the centromere cluster and a number of small foci at the nuclear periphery opposite the centromeres, the rest of the nucleus is largely devoid of transposons and heterochromatin associated histone modifications. At least some of the small foci correspond to the distal telomeres, suggesting that the chromosomes are organised in a Rabl-like manner. It was found that in contrast to metazoans, loading of CenH3 onto Dictyostelium centromeres occurs in late G2 phase. Transformation of Dictyostelium with vectors carrying the G418 resistance cassette typically results in the vector integrating into the genome in one or a few tandem arrays of approximately a hundred copies. In contrast, plasmids containing a Blasticidin resistance cassette integrate as single or a few copies. The behaviour of transgenes in the nucleus was examined by FISH, and it was found that low copy transgenes show apparently random distribution within the nucleus, while transgenes with more than approximately 10 copies cluster at or immediately adjacent to the centromeres in interphase cells regardless of the actual integration site along the chromosome. During mitosis the transgenes show centromere-like behaviour, and ChIP experiments show that transgenes contain the heterochromatin marker H3K9me2 and the centromeric histone variant H3v1. This clustering, and centromere-like behaviour was not observed on extrachromosomal transgenes, nor on a line where the transgene had integrated into the extrachromosomal rDNA palindrome. This suggests that it is the repetitive nature of the transgenes that causes the centromere-like behaviour. A Dictyostelium homolog of DET1, a protein largely restricted to multicellular eukaryotes where it has a role in developmental regulation was identified. As in other species Dictyostelium DET1 is nuclear localised. In ChIP experiments DET1 was found to bind the promoters of a number of developmentally regulated loci. In contrast to other species where it is an essential protein, loss of DET1 is not lethal in Dictyostelium, although viability is greatly reduced. Loss of DET1 results in delayed and abnormal development with enlarged aggregation territories. Mutant slugs displayed apparent cell type patterning with a bias towards pre-stalk cell types.
Resumo:
In this work a new method for clustering and building a topographic representation of a bacteria taxonomy is presented. The method is based on the analysis of stable parts of the genome, the so-called “housekeeping genes”. The proposed method generates topographic maps of the bacteria taxonomy, where relations among different type strains can be visually inspected and verified. Two well known DNA alignement algorithms are applied to the genomic sequences. Topographic maps are optimized to represent the similarity among the sequences according to their evolutionary distances. The experimental analysis is carried out on 147 type strains of the Gammaprotebacteria class by means of the 16S rRNA housekeeping gene. Complete sequences of the gene have been retrieved from the NCBI public database. In the experimental tests the maps show clusters of homologous type strains and present some singular cases potentially due to incorrect classification or erroneous annotations in the database.
A genetic linkage map of microsatellite, gene-specific and morphological markers in diploid Fragaria
Resumo:
Diploid Fragaria provide a potential model for genomic studies in the Rosaceae. To develop a genetic linkage map of diploid Fragaria, we scored 78 markers (68 microsatellites, one sequence-characterised amplified region, six gene-specific markers and three morphological traits) in an interspecific F2 population of 94 plants generated from a cross of F.vesca f. semperflorens × F. nubicola. Co-segregation analysis arranged 76 markers into seven discrete linkage groups covering 448 cM, with linkage group sizes ranging from 100.3 cM to 22.9 cM. Marker coverage was generally good; however some clustering of markers was observed on six of the seven linkage groups. Segregation distortion was observed at a high proportion of loci (54%), which could reflect the interspecific nature of the progeny and, in some cases, the self-incompatibility of F. nubicola. Such distortion may also account for some of the marker clustering observed in the map. One of the morphological markers, pale-green leaf (pg) has not previously been mapped in Fragaria and was located to the mid-point of linkage group VI. The transferable nature of the markers used in this study means that the map will be ideal for use as a framework for additional marker incorporation aimed at enhancing and resolving map coverage of the diploid Fragaria genome. The map also provides a sound basis for linkage map transfer to the cultivated octoploid strawberry.
Resumo:
Background Oocytes mature in ovarian follicles surrounded by granulosa cells. During follicle growth, granulosa cells replicate and secrete hormones, particularly steroids close to ovulation. However, most follicles cease growing and undergo atresia or regression instead of ovulating. To investigate the effects of stimulatory (follicle-stimulating hormone; FSH) and inhibitory (tumour necrosis factor alpha; TNFα) factors on the granulosa cell transcriptome, bovine ovaries were obtained from a local abattoir and pools of granulosa cells were cultured in vitro for six days under defined serum-free conditions with treatments present on days 3–6. Initially dose–response experiments (n = 4) were performed to determine the optimal concentrations of FSH (0.33 ng/ml) and TNFα (10 ng/ml) to be used for the microarray experiments. For array experiments cells were cultured under control conditions, with FSH, with TNFα, or with FSH plus TNFα (n = 4 per group) and RNA was harvested for microarray analyses. Results Statistical analysis showed primary clustering of the arrays into two groups, control/FSH and TNFα/TNFα plus FSH. The effect of TNFα on gene expression dominated that of FSH, with substantially more genes differentially regulated, and the pathways and genes regulated by TNFα being similar to those of FSH plus TNFα treatment. TNFα treatment reduced the endocrine activity of granulosa cells with reductions in expression of FST, INHA, INBA and AMH. The top-ranked canonical pathways and GO biological terms for the TNFα treatments included antigen presentation, inflammatory response and other pathways indicative of innate immune function and fibrosis. The two most significant networks also reflect this, containing molecules which are present in the canonical pathways of hepatic fibrosis/hepatic stellate cell activation and transforming growth factor β signalling, and these were up regulated. Upstream regulator analyses also predicted TNF, interferons γ and β1 and interleukin 1β. Conclusions In vitro, the transcriptome of granulosa cells responded minimally to FSH compared with the response to TNFα. The response to TNFα indicated an active process akin to tissue remodelling as would occur upon atresia. Additionally there was reduction in endocrine function and induction of an inflammatory response to TNFα that displays features similar to immune cells.
Resumo:
Abstract Background: The amount and structure of genetic diversity in dessert apple germplasm conserved at a European level is mostly unknown, since all diversity studies conducted in Europe until now have been performed on regional or national collections. Here, we applied a common set of 16 SSR markers to genotype more than 2,400 accessions across 14 collections representing three broad European geographic regions (North+East, West and South) with the aim to analyze the extent, distribution and structure of variation in the apple genetic resources in Europe. Results: A Bayesian model-based clustering approach showed that diversity was organized in three groups, although these were only moderately differentiated (FST=0.031). A nested Bayesian clustering approach allowed identification of subgroups which revealed internal patterns of substructure within the groups, allowing a finer delineation of the variation into eight subgroups (FST=0.044). The first level of stratification revealed an asymmetric division of the germplasm among the three groups, and a clear association was found with the geographical regions of origin of the cultivars. The substructure revealed clear partitioning of genetic groups among countries, but also interesting associations between subgroups and breeding purposes of recent cultivars or particular usage such as cider production. Additional parentage analyses allowed us to identify both putative parents of more than 40 old and/or local cultivars giving interesting insights in the pedigree of some emblematic cultivars. Conclusions: The variation found at group and sub-group levels may reflect a combination of historical processes of migration/selection and adaptive factors to diverse agricultural environments that, together with genetic drift, have resulted in extensive genetic variation but limited population structure. The European dessert apple germplasm represents an important source of genetic diversity with a strong historical and patrimonial value. The present work thus constitutes a decisive step in the field of conservation genetics. Moreover, the obtained data can be used for defining a European apple core collection useful for further identification of genomic regions associated with commercially important horticultural traits in apple through genome-wide association studies.
Resumo:
Mesenchymal stem cells (MSC) are multipotent cells which can be obtained from several adult and fetal tissues including human umbilical cord units. We have recently shown that umbilical cord tissue (UC) is richer in MSC than umbilical cord blood (UCB) but their origin and characteristics in blood as compared to the cord remains unknown. Here we compared, for the first time, the exonic protein-coding and intronic noncoding RNA (ncRNA) expression profiles of MSC from match-paired UC and UCB samples, harvested from the same donors, processed simultaneously and under the same culture conditions. The patterns of intronic ncRNA expression in MSC from UC and UCB paired units were highly similar, indicative of their common donor origin. The respective exonic protein-coding transcript expression profiles, however, were significantly different. Hierarchical clustering based on protein-coding expression similarities grouped MSC according to their tissue location rather than original donor. Genes related to systems development, osteogenesis and immune system were expressed at higher levels in UCB, whereas genes related to cell adhesion, morphogenesis, secretion, angiogenesis and neurogenesis were more expressed in UC cells. These molecular differences verified in tissue-specific MSC gene expression may reflect functional activities influenced by distinct niches and should be considered when developing clinical protocols involving MSC from different sources. In addition, these findings reinforce our previous suggestion on the importance of banking the whole umbilical cord unit for research or future therapeutic use.
Resumo:
Trypanosoma (Megatrypanum) theileri from cattle and trypanosomes of other artiodactyls form a clade of closely related species in analyses using ribosomal sequences. Analysis of polymorphic sequences of a larger number of trypanosomes from broader geographical origins is required to evaluate the Clustering of isolates as suggested by previous studies. Here, we determined the sequences of the spliced leader (SL) genes of 21 isolates from cattle and 2 from water buffalo from distant regions of Brazil. Analysis of SL gene repeats revealed that the 5S rRNA gene is inserted within the intergenic region. Phylogeographical patterns inferred using SL sequences showed at least 5 major genotypes of T. theileri distributed in 2 strongly divergent lineages. Lineage TthI comprises genotypes IA and IB from buffalo and cattle, respectively, from the Southeast and Central regions, whereas genotype IC is restricted to cattle from the Southern region. Lineage Tth II includes cattle genotypes IIA, which is restricted to the North and Northeast, and IIB, found in the Centre, West, North and Northeast. PCR-RFLP of SL genes revealed valuable markers for genotyping T. theileri. The results of this study emphasize the genetic complexity and corroborate the geographical structuring of T. theileri genotypes found in cattle.
Resumo:
This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Clustering is a difficult task: there is no single cluster definition and the data can have more than one underlying structure. Pareto-based multi-objective genetic algorithms (e.g., MOCK Multi-Objective Clustering with automatic K-determination and MOCLE-Multi-Objective Clustering Ensemble) were proposed to tackle these problems. However, the output of such algorithms can often contains a high number of partitions, becoming difficult for an expert to manually analyze all of them. In order to deal with this problem, we present two selection strategies, which are based on the corrected Rand, to choose a subset of solutions. To test them, they are applied to the set of solutions produced by MOCK and MOCLE in the context of several datasets. The study was also extended to select a reduced set of partitions from the initial population of MOCLE. These analysis show that both versions of selection strategy proposed are very effective. They can significantly reduce the number of solutions and, at the same time, keep the quality and the diversity of the partitions in the original set of solutions. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
A large amount of biological data has been produced in the last years. Important knowledge can be extracted from these data by the use of data analysis techniques. Clustering plays an important role in data analysis, by organizing similar objects from a dataset into meaningful groups. Several clustering algorithms have been proposed in the literature. However, each algorithm has its bias, being more adequate for particular datasets. This paper presents a mathematical formulation to support the creation of consistent clusters for biological data. Moreover. it shows a clustering algorithm to solve this formulation that uses GRASP (Greedy Randomized Adaptive Search Procedure). We compared the proposed algorithm with three known other algorithms. The proposed algorithm presented the best clustering results confirmed statistically. (C) 2009 Elsevier Ltd. All rights reserved.