18 resultados para Molecular Sequence Data
em Helda - Digital Repository of University of Helsinki
Resumo:
Evolutionary genetics incorporates traditional population genetics and studies of the origins of genetic variation by mutation and recombination, and the molecular evolution of genomes. Among the primary forces that have potential to affect the genetic variation within and among populations, including those that may lead to adaptation and speciation, are genetic drift, gene flow, mutations and natural selection. The main challenges in knowing the genetic basis of evolutionary changes is to distinguish the adaptive selection forces that cause existent DNA sequence variants and also to identify the nucleotide differences responsible for the observed phenotypic variation. To understand the effects of various forces, interpretation of gene sequence variation has been the principal basis of many evolutionary genetic studies. The main aim of this thesis was to assess different forms of teleost gene sequence polymorphisms in evolutionary genetic studies of Atlantic salmon (Salmo salar) and other species. Firstly, the level of Darwinian adaptive evolution affected coding regions of the growth hormone (GH) gene during the teleost evolution was investigated based on the sequence data existing in public databases. Secondly, a target gene approach was used to identify within population variation in the growth hormone 1 (GH1) gene in salmon. Then, a new strategy for single nucleotide polymorphisms (SNPs) discovery in salmonid fishes was introduced, and, finally, the usefulness of a limited number of SNP markers as molecular tools in several applications of population genetics in Atlantic salmon was assessed. This thesis showed that the gene sequences in databases can be utilized to perform comparative studies of molecular evolution, and some putative evidence of the existence of Darwinian selection during the teleost GH evolution was presented. In addition, existent sequence data was exploited to investigate GH1 gene variation within Atlantic salmon populations throughout its range. Purifying selection is suggested to be the predominant evolutionary force controlling the genetic variation of this gene in salmon, and some support for gene flow between continents was also observed. The novel approach to SNP discovery in species with duplicated genome fragments introduced here proved to be an effective method, and this may have several applications in evolutionary genetics with different species - e.g. when developing gene-targeted markers to investigate quantitative genetic variation. The thesis also demonstrated that only a few SNPs performed highly similar signals in some of the population genetic analyses when compared with the microsatellite markers. This may have useful applications when estimating genetic diversity in genes having a potential role in ecological and conservation issues, or when using hard biological samples in genetic studies as SNPs can be applied with relatively highly degraded DNA.
Resumo:
This study addressed the large-scale molecular zoogeography in two brackish water bivalve molluscs, Macoma balthica and Cerastoderma glaucum, and genetic signatures of the postglacial colonization of Northern Europe by them. The traditional view poses that M. balthica in the Baltic, White and Barents seas (i.e. marginal seas) represent direct postglacial descendants of the adjacent Northeast Atlantic populations, but this has recently been challenged by observations of close genetic affinities between these marginal populations and those of the Northeast Pacific. The primary aim of the thesis was to verify, quantify and characterize the Pacific genetic contribution across North European populations of M. balthica and to resolve the phylogeographic histories of the two bivalve taxa in range-wide studies using information from mitochondrial DNA (mtDNA) and nuclear allozyme polymorphisms. The presence of recent Pacific genetic influence in M. balthica of the Baltic, White and Barents seas, along with an Atlantic element, was confirmed by mtDNA sequence data. On a broader temporal and geographical scale, altogether four independent trans-Arctic invasions of Macoma from the Pacific since the Miocene seem to have been involved in generating the current North Atlantic lineage diversity. The latest trans-Arctic invasion that affected the current Baltic, White and Barents Sea populations probably took place in the early post-glacial. The nuclear genetic compositions of these marginal sea populations are intermediate between those of pure Pacific and Atlantic subspecies. In the marginal sea populations of mixed ancestry (Barents, White and Northern Baltic seas), the Pacific and Atlantic components are now randomly associated in the genomes of individual clams, which indicates both pervasive historical interbreeding between the previously long-isolated lineages (subspecies), and current isolation of these populations from the adjacent pure Atlantic populations. These mixed populations can be characterized as self-supporting hybrid swarms, and they arguably represent the most extensive marine animal hybrid swarms so far documented. Each of the three swarms still has a distinct genetic composition, and the relative Pacific contributions vary from 30 to 90 % in local populations. This diversity highlights the potential of introgressive hybridization to rapidly give rise to new evolutionarily and ecologically significant units in the marine realm. In the south of the Danish straits and in the Southern Baltic Sea, a broad genetic transition zone links the pure North Sea subspecies M. balthica rubra to the inner Baltic hybrid swarm, which has about 60 % of Pacific contribution in its genome. This transition zone has no regular smooth clinal structure, but its populations show strong genotypic disequilibria typical of a hybrid zone maintained by the interplay of selection and gene flow by dispersing pelagic larvae. The structure of the genetic transition is partly in line with features of Baltic water circulation and salinity stratification, with greater penetration of Atlantic genes on the Baltic south coast and in deeper water populations. In all, the scenarios of historical isolation and secondary contact that arise from the phylogeographic studies of both Macoma and Cerastoderma shed light to the more general but enigmatic patterns seen in marine phylogeography, where deep genetic breaks are often seen in species with high dispersal potential.
Resumo:
Phylogenetic analyses of the Hypnales usually show the same picture of poorly resolved trees with a large number of polyphyletic taxa and low support for the few reconstructed clades. One odd clade, however, consisting of three genera that are currently treated either within the Leskeaceae (Miyabea) or Neckeraceae (Homaliadelphus and Bissetia), was retrieved in a previously published phylogeny based on chloroplast rbcL. In order to elucidate the reliability of the observed Homaliadelphus - Miyabea - Bissetia - clade (HMB-clade) and to reveal its phylogenetic relationships a molecular study based on a representative set of hypnalean taxa was performed. Sequence data from all three genomes, namely the ITS1 and 2 (nuclear), the trnS-rps4-trnT-trnL-trnF cluster (plastid), the nad5 intron (mitochondrial), were analyzed. Although the phylogenetic reconstruction of the combined data set was not fully resolved regarding the backbone it clearly indicated the polyphyletic nature of various hypnalean families, such as the Leskeaceae, Hypnaceae, Hylocomiaceae, Neckeraceae, Leptodontaceae and Anomodontaceae with respect to the included taxa. In addition the results favor the inclusion of the Leptodontaceae and Thamnobryaceae in the Neckeraceae. The maximally supported HMB-clade consisting of the three genera Homaliadelphus (2-3 species), Miyabea (3 species) and Bissetia (1 species) is resolved sister to a so far unnamed clade comprising Taxiphyllum aomoriense, Glossadelphus ogatae and Leptopterigynandrum. The well-resolved and supported HMB-clade, here formally described as the Miyabeaceae, fam. nov. is additionally supported by morphological characters such as strongly incrassate, porose leaf cells, a relatively weak and diffuse costa and the presence of dwarf males. The latter are absent in the Neckeraceae and the Leskeaceae. It is essentially an East Asian family, with one species occurring in North America.
Resumo:
Earlier phylogenetic studies, including species belonging to the Neckeraceae, have indicated that this pleurocarpous moss family shares a strongly supported sister group relationship with the Lembophyllaceae, but the family delimitation of the former needs adjustment. To test the monophyly of the Neckeraceae, as well as to redefine the family circumscription and to pinpoint its phylogenetic position in a larger context, a phylogenetic study based on molecular data was carried out. Sequence data were compiled, combining data from all three genomes: nuclear ITS1 and 2, plastid trnS-rps4-trnT-trnL-trnF and rpl16, and mitochondrial nad5 intron. The Neckeraceae have sometimes been divided into the two families, Neckeraceae and Thamnobryaceae, a division rejected here. Both parsimony and Bayesian analyses of molecular data revealed that the family concept of the Neckeraceae needs several further adjustments, such as the exclusion of some individual species and smaller genera as well as the inclusion of the Leptodontaceae. Within the family three well-supported clades (A, B and C) can be distinguished. Members of clade A are mainly non-Asiatic and nontropical. Most species have a weak costa and immersed capsules with reduced peristomes (mainly Neckera spp.) and the teeth at the leaf margins are usually unicellular. Clade B members are also mainly non-Asiatic. They are typically fairly robust, distinctly stipilate, having a single, at least relatively strong costa, long setae (capsules exserted), and the peristomes are well developed or only somewhat reduced. Members of clade C are essentially Asiatic and tropical. The species of this clade usually have a strong costa and a long seta, the seta often being mammillose in its upper part. The peristome types in this clade are mixed, since both reduced and unreduced types are found. Several neckeraceous genera that were recognised on a morphological basis are polyphyletic (e.g. Neckera, Homalia, Thamnobryum, Porotrichum). Ancestral state reconstructions revealed that currently used diagnostic traits, such as the leaf asymmetry and costa strength are highly homoplastic. Similarly, the reconstructions revealed that the 'reduced' sporophyte features have evolved independently in each of the three clades.
Resumo:
This dissertation is focused on the taxonomy, phylogeny, and ecology of the vagrant, erratic and allied terricolous and saxicolous species of the genera Aspicilia A. Massal. and Circinaria Link (Megasporaceae), particularly those traditionally referred to as manna lichens . The group has previously been defined on the basis of few morphological characters. The phylogeny of the family Megasporaceae is inferred from the combined dataset of nuLSU and mtSSU sequences. Five genera Aspicilia, Circinaria, Lobothallia, Megaspora, and Sagedia are recognized. Lobothallia is sister of the four other genera, while Aspicilia and Sagedia form the next clade. All these genera have small asci with eight spores. Circinaria is a sister genus of Megaspora, and these two have in common asci with (1 4) 6 8 large spores. Circinaria forms a monophyletic group and sphaerothallioid species form a monophyletic group within Circinaria. The presence of certain morphological characters such as pseudocyphellae, thickness of cortex and medulla layers, as well as ecological differences in sphaerothallioid species distinguish it from some other crustose species, especially those containing aspicilin and characterised by thin cortex and medulla layers, conidium length c. 6 12 µm and absence of pseudocyphellae. If sphaerothallioid species are accepted as a distinct genus, the rest of the Circinaria species would remain as a paraphyletic assemblage. Currently, the genus Circinaria includes all the sphaerothallioid species and its generic position is confirmed and accepted. Thus, it is proposed as a correct generic name also for the manna lichens described originally in other genera. Phylogeny at the species level was studied using nrITS sequence data. Traditionally, morphological characters have been used for the recognition of species. They were re-evaluated in the light of molecular data. Since characters such as vagrant, erratic and crustose growth forms proved to be misleading for the recognition of some species, a combination of several characters (including molecular data) is recommended. Vagrant growth form seems to have evolved several times among the distantly related lineages and even within a single population. The reasons behind the high plasticity in the external morphology of the sphaerothallioid Circinaria remain, however, unknown. Six new species are recognized: Aspicilia tibetica, Circinaria arida, C. digitata nom provis., C. gyrosa nom. provis., C. rogeri nom. provis., and C. rostamii nom. provis. Based on an analysis of nrITS dataset, three new erratic, vagrant and crustose species were also recognized, but these require additional study. The results also reveal that C. elmorei and C. hispida are not monophyletic as currently understood. In addition, 13 new combinations in the genus Circinaria are proposed.
Resumo:
Microbes in natural and artificial environments as well as in the human body are a key part of the functional properties of these complex systems. The presence or absence of certain microbial taxa is a correlate of functional status like risk of disease or course of metabolic processes of a microbial community. As microbes are highly diverse and mostly notcultivable, molecular markers like gene sequences are a potential basis for detection and identification of key types. The goal of this thesis was to study molecular methods for identification of microbial DNA in order to develop a tool for analysis of environmental and clinical DNA samples. Particular emphasis was placed on specificity of detection which is a major challenge when analyzing complex microbial communities. The approach taken in this study was the application and optimization of enzymatic ligation of DNA probes coupled with microarray read-out for high-throughput microbial profiling. The results show that fungal phylotypes and human papillomavirus genotypes could be accurately identified from pools of PCR amplicons generated from purified sample DNA. Approximately 1 ng/μl of sample DNA was needed for representative PCR amplification as measured by comparisons between clone sequencing and microarray. A minimum of 0,25 amol/μl of PCR amplicons was detectable from amongst 5 ng/μl of background DNA, suggesting that the detection limit of the test comprising of ligation reaction followed by microarray read-out was approximately 0,04%. Detection from sample DNA directly was shown to be feasible with probes forming a circular molecule upon ligation followed by PCR amplification of the probe. In this approach, the minimum detectable relative amount of target genome was found to be 1% of all genomes in the sample as estimated from 454 deep sequencing results. Signal-to-noise of contact printed microarrays could be improved by using an internal microarray hybridization control oligonucleotide probe together with a computational algorithm. The algorithm was based on identification of a bias in the microarray data and correction of the bias as shown by simulated and real data. The results further suggest semiquantitative detection to be possible by ligation detection, allowing estimation of target abundance in a sample. However, in practise, comprehensive sequence information of full length rRNA genes is needed to support probe design with complex samples. This study shows that DNA microarray has the potential for an accurate microbial diagnostic platform to take advantage of increasing sequence data and to replace traditional, less efficient methods that still dominate routine testing in laboratories. The data suggests that ligation reaction based microarray assay can be optimized to a degree that allows good signal-tonoise and semiquantitative detection.
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
Phylogenetic studies of cyanobacterial lichens Lichens are symbiotic assemblages between fungi (mycobiont) and green algae (phycobiont) or/and cyanobacteria (cyanobiont). Fossil records show that lichen-like symbioses occurred already 600 million years ago. Lichen symbiosis has since then become an important life strategy for the Fungi, particularly for species in the phylum Ascomycota as approximately 98% of the lichenized fungal species are ascomycetes. The taxonomy of lichen associations is based on the mycobiont. We reconstructed, using DNA sequence data, hypotheses of phylogenetic relationships of lichen-forming fungi that include species associated with cyanobacteria. These hypotheses of phylogeny should form the basis for the taxonomy. They also allowed studies of the origin and the evolution of specific symbioses. Genetic diversity and phylogenetic relationships of symbiotic cyanobionts were also studied in order to examine selectivity of cyanobionts and mycobionts as well as possible co-evolution between partners involved in lichen associations. The suggested circumscription of the family Stereocaulaceae to include Stereocaulon and Lepraria is supported. The recently described crustose Stereocaulon species seem to be correctly placed in the genus, although Stereocaulon traditionally included only fruticose species. The monospecific crustose genus Muhria is also shown to be best placed in Stereocaulon. Family Lobariaceae as currently delimited is monophyletic. Within Lobariaceae genus Sticta including Dendriscocaulon dendroides form a monophyletic group while the genera Lobaria and Pseudocyphellaria are non-monophyletic. A new classification of Lobariaceae is obviously needed. Further studies are however required before a final proposal for a new classification can be made. Our results show that the cyanobacterial symbiotic state has been gained repeatedly in the Ascomycota while losses of symbiotic cyanobacteria appear to be rare. The symbiosis with green algae is confirmed to have been gained repeatedly in Ascomycota but also repeatedly lost. Cyanobacterial symbioses therefore seem to be more stable than green algal associations. Cyanobacteria are perhaps more beneficial for the lichen fungi and therefore maintained. The results indicate a dynamic association of the lichen symbiosis. This evolutionary instability will perhaps be important for the lichen fungi as the utilization of options will perhaps enable lichens to colonize new substrates and survive environmental changes. Some cyanobacterial lichen genera seem to be highly selective towards the cyanobiont while others form symbioses with a broad spectrum of cyanobacteria. No evidence of co-evolution between fungi and cyanobacteria in cyanolichens could be demonstrated.
Resumo:
Lead contamination in the environment is of particular concern, as it is a known toxin. Until recently, however, much less attention has been given to the local contamination caused by activities at shooting ranges compared to large-scale industrial contamination. In Finland, more than 500 tons of Pb is produced each year for shotgun ammunition. The contaminant threatens various organisms, ground water and the health of human populations. However, the forest at shooting ranges usually shows no visible sign of stress compared to nearby clean environments. The aboveground biota normally reflects the belowground ecosystem. Thus, the soil microbial communities appear to bear strong resistance to contamination, despite the influence of lead. The studies forming this thesis investigated a shooting range site at Hälvälä in Southern Finland, which is heavily contaminated by lead pellets. Previously it was experimentally shown that the growth of grasses and degradation of litter are retarded. Measurements of acute toxicity of the contaminated soil or soil extracts gave conflicting results, as enchytraeid worms used as toxicity reporters were strongly affected, while reporter bacteria showed no or very minor decreases in viability. Measurements using sensitive inducible luminescent reporter bacteria suggested that the bioavailability of lead in the soil is indeed low, and this notion was supported by the very low water extractability of the lead. Nevertheless, the frequency of lead-resistant cultivable bacteria was elevated based on the isolation of cultivable strains. The bacterial and fungal diversity in heavily lead contaminated shooting sectors were compared with those of pristine sections of the shooting range area. The bacterial 16S rRNA gene and fungal ITS rRNA gene were amplified, cloned and sequenced using total DNA extracted from the soil humus layer as the template. Altogether, 917 sequenced bacterial clones and 649 sequenced fungal clones revealed a high soil microbial diversity. No effect of lead contamination was found on bacterial richness or diversity, while fungal richness and diversity significantly differed between lead contaminated and clean control areas. However, even in the case of fungi, genera that were deemed sensitive were not totally absent from the contaminated area: only their relative frequency was significantly reduced. Some operational taxonomic units (OTUs) assigned to Basidiomycota were clearly affected, and were much rarer in the lead contaminated areas. The studies of this thesis surveyed EcM sporocarps, analyzed morphotyped EcM root tips by direct sequencing, and 454-pyrosequenced fungal communities in in-growth bags. A total of 32 EcM fungi that formed conspicuous sporocarps, 27 EcM fungal OTUs from 294 root tips, and 116 EcM fungal OTUs from a total of 8 194 ITS2 454 sequences were recorded. The ordination analyses by non-parametric multidimensional scaling (NMS) indicated that Pb enrichment induced a shift in the EcM community composition. This was visible as indicative trends in the sporocarp and root tip datasets, but explicitly clear in the communities observed in the in-growth bags. The compositional shift in the EcM community was mainly attributable to an increase in the frequencies of OTUs assigned to the genus Thelephora, and to a decrease in the OTUs assigned to Pseudotomentella, Suillus and Tylospora in Pb-contaminated areas when compared to the control. The enrichment of Thelephora in contaminated areas was also observed when examining the total fungal communities in soil using DNA cloning and sequencing technology. While the compositional shifts are clear, their functional consequences for the dominant trees or soil ecosystem remain undetermined. The results indicate that at the Hälvälä shooting range, lead influences the fungal communities but not the bacterial communities. The forest ecosystem shows apparent functional redundancy, since no significant effects were seen on forest trees. Recently, by means of 454 pyrosequencing , the amount of sequences in a single analysis run can be up to one million. It has been applied in microbial ecology studies to characterize microbial communities. The handling of sequence data with traditional programs is becoming difficult and exceedingly time consuming, and novel tools are needed to handle the vast amounts of data being generated. The field of microbial ecology has recently benefited from the availability of a number of tools for describing and comparing microbial communities using robust statistical methods. However, although these programs provide methods for rapid calculation, it has become necessary to make them more amenable to larger datasets and numbers of samples from pyrosequencing. As part of this thesis, a new program was developed, MuSSA (Multi-Sample Sequence Analyser), to handle sequence data from novel high-throughput sequencing approaches in microbial community analyses. The greatest advantage of the program is that large volumes of sequence data can be manipulated, and general OTU series with a frequency value can be calculated among a large number of samples.
Resumo:
Understanding the overwhelming diversity of life calls for complex organisational schemes. The field of systematics may thus be seen as the cornerstone of evolutionary biology. In the last few decades, systematics has been rejuvenated through the introduction of molecular methods such as DNA barcoding and multi-gene phylogenetic approaches. These methods may shed new light on established taxonomic ideas and problems. For example, the classification of ants has aroused much debate due to reinterpretation of morphological characters or contradictions between molecular data and morphology. Only in the last few years a consensus was reached regarding the phylogeny of ant subfamilies. However, the situation remains deplorable for lower taxonomic ranks such as subfamilies, tribes and genera. This thesis describes the systematics and evolution of the Holarctic ant genus Myrmica and the tribe to which it belongs, Myrmicini. Using barcoding, molecular-phylogenetic data and divergence time estimations, it addresses questions regarding the taxonomy, morphology and biogeography of this group. Furthermore, the interrelationships between socially parasitic Myrmica species and their hosts (other species in the genus) were inferred. The phylogeny suggests that social parasitism evolved several times in Myrmica. Finally, this thesis investigated whether coevolution shaped the phylogeny of socially parasitic Maculinea butterflies that live inside Myrmica colonies. No evidence was found for coevolution.
Resumo:
Social behaviour affects dispersal of animals and is an important modifier of genetic population structures. The female sex is often philopatric, which maintains coancestry within the breeding groups and promotes cooperative behaviours. This enables also inclusive fitness returns from altruism and explains why some individuals sacrifice personal reproduction for the good of others in social insects such as ants. However, reduced dispersal and population substructuring at the level of colonies may also entail inbreeding, loss of genetic diversity, and vulnerability. In addition, the most vulnerable ants are species that are evolved to parasitize colonies of other ants, and which compromise between abilities to disperse and the efficiency to parasitize the host. On the other hand, certain social organisations of ant colonies may facilitate a species to disperse outside its natural range and become a pest. Altogether, knowledge on genetic structuring of ant populations, as well as the evolution of their life histories can contribute to conservation biology and population management. The aim of this thesis was to investigate population structures and phylogenetic evolution of the ant Plagiolepis pygmaea and its two obligatory, workerless social parasites (inquilines) P. xene and P. grassei with genetic markers and DNA sequence data. The results support the general assumption that populations of inquiline parasites are highly fragmented and genetically vulnerable. Comparison of the two parasites suggests that differences in their relative abundance may follow from their interaction with the host, i.e. how well the species is adapted to reproduce in the host colonies. The results also indicate that the most recent free living ancestor to these two parasite species is their common host. This is considered to provide evidence for the controversial issue of sympatric speciation. Further, given that the level of adaptations to parasitic life history depends on the evolutionary time since the free-living ancestor, the results establish a link between species rarity and its evolutionary age. The populations of the host species P. pygmaea displayed significantly reduced dispersal both among the females (queens) and males, and high levels of inbreeding which may enhance worker altruism. In addition, the queens were found to mate with multiple males. Given the high relatedness between the queens and their mates, this occurs probably for non-genetic reasons, e.g. without benefits associated in genetically more diverse offspring. The results hence caution that the contribution of non-genetic factors to the prevailing mating patterns and genetic population structures should not be underestimated.
Resumo:
Gene expression is one of the most critical factors influencing the phenotype of a cell. As a result of several technological advances, measuring gene expression levels has become one of the most common molecular biological measurements to study the behaviour of cells. The scientific community has produced enormous and constantly increasing collection of gene expression data from various human cells both from healthy and pathological conditions. However, while each of these studies is informative and enlighting in its own context and research setup, diverging methods and terminologies make it very challenging to integrate existing gene expression data to a more comprehensive view of human transcriptome function. On the other hand, bioinformatic science advances only through data integration and synthesis. The aim of this study was to develop biological and mathematical methods to overcome these challenges and to construct an integrated database of human transcriptome as well as to demonstrate its usage. Methods developed in this study can be divided in two distinct parts. First, the biological and medical annotation of the existing gene expression measurements needed to be encoded by systematic vocabularies. There was no single existing biomedical ontology or vocabulary suitable for this purpose. Thus, new annotation terminology was developed as a part of this work. Second part was to develop mathematical methods correcting the noise and systematic differences/errors in the data caused by various array generations. Additionally, there was a need to develop suitable computational methods for sample collection and archiving, unique sample identification, database structures, data retrieval and visualization. Bioinformatic methods were developed to analyze gene expression levels and putative functional associations of human genes by using the integrated gene expression data. Also a method to interpret individual gene expression profiles across all the healthy and pathological tissues of the reference database was developed. As a result of this work 9783 human gene expression samples measured by Affymetrix microarrays were integrated to form a unique human transcriptome resource GeneSapiens. This makes it possible to analyse expression levels of 17330 genes across 175 types of healthy and pathological human tissues. Application of this resource to interpret individual gene expression measurements allowed identification of tissue of origin with 92.0% accuracy among 44 healthy tissue types. Systematic analysis of transcriptional activity levels of 459 kinase genes was performed across 44 healthy and 55 pathological tissue types and a genome wide analysis of kinase gene co-expression networks was done. This analysis revealed biologically and medically interesting data on putative kinase gene functions in health and disease. Finally, we developed a method for alignment of gene expression profiles (AGEP) to perform analysis for individual patient samples to pinpoint gene- and pathway-specific changes in the test sample in relation to the reference transcriptome database. We also showed how large-scale gene expression data resources can be used to quantitatively characterize changes in the transcriptomic program of differentiating stem cells. Taken together, these studies indicate the power of systematic bioinformatic analyses to infer biological and medical insights from existing published datasets as well as to facilitate the interpretation of new molecular profiling data from individual patients.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Inorganic pyrophosphatases (PPases, EC 3.6.1.1) hydrolyse pyrophosphate in a reaction that provides the thermodynamic 'push' for many reactions in the cell, including DNA and protein synthesis. Soluble PPases can be classified into two families that differ completely in both sequence and structure. While Family I PPases are found in all kingdoms, family II PPases occur only in certain prokaryotes. The enzyme from baker's yeast (Saccharomyces cerevisiae) is very well characterised both kinetically and structurally, but the exact mechanism has remained elusive. The enzyme uses divalent cations as cofactors; in vivo the metal is magnesium. Two metals are permanently bound to the enzyme, while two come with the substrate. The reaction cycle involves the activation of the nucleophilic oxygen and allows different pathways for product release. In this thesis I have solved the crystal structures of wild type yeast PPase and seven active site variants in the presence of the native cofactor magnesium. These structures explain the effects of the mutations and have allowed me to describe each intermediate along the catalytic pathway with a structure. Although establishing the ʻchoreographyʼ of the heavy atoms is an important step in understanding the mechanism, hydrogen atoms are crucial for the mechanism. The most unambiguous method to determine the positions of these hydrogen atoms is neutron crystallography. In order to determine the neutron structure of yeast PPase I perdeuterated the enzyme and grew large crystals of it. Since the crystals were not stable at ambient temperature, a cooling device was developed to allow neutron data collection. In order to investigate the structural changes during the reaction in real time by time-resolved crystallography a photolysable substrate precursor is needed. I synthesised a candidate molecule and characterised its photolysis kinetics, but unfortunately it is hydrolysed by both yeast and Thermotoga maritima PPases. The mechanism of Family II PPases is subtly different from Family I. The native metal cofactor is manganese instead of magnesium, but the metal activation is more complex because the metal ions that arrive with the substrate are magnesium different from those permanently bound to the enzyme. I determined the crystal structures of wild type Bacillus subtilis PPase with the inhibitor imidodiphosphate and an inactive H98Q variant with the substrate pyrophosphate. These structures revealed a new trimetal site that activates the nucleophile. I also determined that the metal ion sites were partially occupied by manganese and iron using anomalous X- ray scattering.