911 resultados para Molecular Sequence Data.
Resumo:
Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
Resumo:
BACKGROUND A cost-effective strategy to increase the density of available markers within a population is to sequence a small proportion of the population and impute whole-genome sequence data for the remaining population. Increased densities of typed markers are advantageous for genome-wide association studies (GWAS) and genomic predictions. METHODS We obtained genotypes for 54 602 SNPs (single nucleotide polymorphisms) in 1077 Franches-Montagnes (FM) horses and Illumina paired-end whole-genome sequencing data for 30 FM horses and 14 Warmblood horses. After variant calling, the sequence-derived SNP genotypes (~13 million SNPs) were used for genotype imputation with the software programs Beagle, Impute2 and FImpute. RESULTS The mean imputation accuracy of FM horses using Impute2 was 92.0%. Imputation accuracy using Beagle and FImpute was 74.3% and 77.2%, respectively. In addition, for Impute2 we determined the imputation accuracy of all individual horses in the validation population, which ranged from 85.7% to 99.8%. The subsequent inclusion of Warmblood sequence data further increased the correlation between true and imputed genotypes for most horses, especially for horses with a high level of admixture. The final imputation accuracy of the horses ranged from 91.2% to 99.5%. CONCLUSIONS Using Impute2, the imputation accuracy was higher than 91% for all horses in the validation population, which indicates that direct imputation of 50k SNP-chip data to sequence level genotypes is feasible in the FM population. The individual imputation accuracy depended mainly on the applied software and the level of admixture.
Resumo:
Phylogenetic reconstruction of the evolutionary history of closely related organisms may be difficult because of the presence of unsorted lineages and of a relatively high proportion of heterozygous sites that are usually not handled well by phylogenetic programs. Genomic data may provide enough fixed polymorphisms to resolve phylogenetic trees, but the diploid nature of sequence data remains analytically challenging. Here, we performed a phylogenomic reconstruction of the evolutionary history of the common vole (Microtus arvalis) with a focus on the influence of heterozygosity on the estimation of intraspecific divergence times. We used genome-wide sequence information from 15 voles distributed across the European range. We provide a novel approach to integrate heterozygous information in existing phylogenetic programs by repeated random haplotype sampling from sequences with multiple unphased heterozygous sites. We evaluated the impact of the use of full, partial, or no heterozygous information for tree reconstructions on divergence time estimates. All results consistently showed four deep and strongly supported evolutionary lineages in the vole data. These lineages undergoing divergence processes split only at the end or after the last glacial maximum based on calibration with radiocarbon-dated paleontological material. However, the incorporation of information from heterozygous sites had a significant impact on absolute and relative branch length estimations. Ignoring heterozygous information led to an overestimation of divergence times between the evolutionary lineages of M. arvalis. We conclude that the exclusion of heterozygous sites from evolutionary analyses may cause biased and misleading divergence time estimates in closely related taxa.
Resumo:
Academic and industrial research in the late 90s have brought about an exponential explosion of DNA sequence data. Automated expert systems are being created to help biologists to extract patterns, trends and links from this ever-deepening ocean of information. Two such systems aimed on retrieving and subsequently utilizing phylogenetically relevant information have been developed in this dissertation, the major objective of which was to automate the often difficult and confusing phylogenetic reconstruction process. ^ Popular phylogenetic reconstruction methods, such as distance-based methods, attempt to find an optimal tree topology (that reflects the relationships among related sequences and their evolutionary history) by searching through the topology space. Various compromises between the fast (but incomplete) and exhaustive (but computationally prohibitive) search heuristics have been suggested. An intelligent compromise algorithm that relies on a flexible “beam” search principle from the Artificial Intelligence domain and uses the pre-computed local topology reliability information to adjust the beam search space continuously is described in the second chapter of this dissertation. ^ However, sometimes even a (virtually) complete distance-based method is inferior to the significantly more elaborate (and computationally expensive) maximum likelihood (ML) method. In fact, depending on the nature of the sequence data in question either method might prove to be superior. Therefore, it is difficult (even for an expert) to tell a priori which phylogenetic reconstruction method—distance-based, ML or maybe maximum parsimony (MP)—should be chosen for any particular data set. ^ A number of factors, often hidden, influence the performance of a method. For example, it is generally understood that for a phylogenetically “difficult” data set more sophisticated methods (e.g., ML) tend to be more effective and thus should be chosen. However, it is the interplay of many factors that one needs to consider in order to avoid choosing an inferior method (potentially a costly mistake, both in terms of computational expenses and in terms of reconstruction accuracy.) ^ Chapter III of this dissertation details a phylogenetic reconstruction expert system that selects a superior proper method automatically. It uses a classifier (a Decision Tree-inducing algorithm) to map a new data set to the proper phylogenetic reconstruction method. ^
Resumo:
Spanish wheat (Triticum spp.) landraces have a considerable polymorphism, containing many unique alleles, relative to other collections. The existence of a core collection is a favored approach for breeders to efficiently explore novel variation and enhance the use of germplasm. In this study, the Spanish durum wheat (Triticum turgidum L.) core collection (CC) was created using a population structure–based method, grouping accessions by subspecies and allocating the number of genotypes among populations according to the diversity of simple sequence repeat (SSR) markers. The CC of 94 genotypes was established, which accounted for 17% of the accessions in the entire collection. An alternative core collection (CH), with the same number of genotypes per subspecies and maximizing the coverage of SSR alleles, was assembled with the Core Hunter software. The quality of both core collections was compared with a random core collection and evaluated using geographic, agromorphological, and molecular marker data not previously used in the selection of genotypes. Both core collections had a high genetic representativeness, which validated their sampling strategies. Geographic and agromorphological variation, phenotypic correlations, and gliadin alleles of the original collection were more accurately depicted by the CC. Diversity arrays technology (DArT) markers revealed that the CC included genotypes less similar than the CH. Although more SSR alleles were retained by the CH (94%) than by the CC (91%), the results showed that the CC was better than CH for breeding purposes.
Resumo:
The Mycetozoa include the cellular (dictyostelid), acellular (myxogastrid), and protostelid slime molds. However, available molecular data are in disagreement on both the monophyly and phylogenetic position of the group. Ribosomal RNA trees show the myxogastrid and dictyostelid slime molds as unrelated early branching lineages, but actin and β-tubulin trees place them together as a single coherent (monophyletic) group, closely related to the animal–fungal clade. We have sequenced the elongation factor-1α genes from one member of each division of the Mycetozoa, including Dictyostelium discoideum, for which cDNA sequences were previously available. Phylogenetic analyses of these sequences strongly support a monophyletic Mycetozoa, with the myxogastrid and dictyostelid slime molds most closely related to each other. All phylogenetic methods used also place this coherent Mycetozoan assemblage as emerging among the multicellular eukaryotes, tentatively supported as more closely related to animals + fungi than are green plants. With our data there are now three proteins that consistently support a monophyletic Mycetozoa and at least four that place these taxa within the “crown” of the eukaryote tree. We suggest that ribosomal RNA data should be more closely examined with regard to these questions, and we emphasize the importance of developing multiple sequence data sets.
Resumo:
Molecular, sequence-based environmental surveys of microorganisms have revealed a large degree of previously uncharacterized diversity. However, nearly all studies of the human endogenous bacterial flora have relied on cultivation and biochemical characterization of the resident organisms. We used molecular methods to characterize the breadth of bacterial diversity within the human subgingival crevice by comparing 264 small subunit rDNA sequences from 21 clone libraries created with products amplified directly from subgingival plaque, with sequences obtained from bacteria that were cultivated from the same specimen, as well as with sequences available in public databases. The majority (52.5%) of the directly amplified 16S rRNA sequences were <99% identical to sequences within public databases. In contrast, only 21.4% of the sequences recovered from cultivated bacteria showed this degree of variability. The 16S rDNA sequences recovered by direct amplification were also more deeply divergent; 13.5% of the amplified sequences were more than 5% nonidentical to any known sequence, a level of dissimilarity that is often found between members of different genera. None of the cultivated sequences exhibited this degree of sequence dissimilarity. Finally, direct amplification of 16S rDNA yielded a more diverse view of the subgingival bacterial flora than did cultivation. Our data suggest that a significant proportion of the resident human bacterial flora remain poorly characterized, even within this well studied and familiar microbial environment.
Resumo:
The function of many of the uncharacterized open reading frames discovered by genomic sequencing can be determined at the level of expressed gene products, the proteome. However, identifying the cognate gene from minute amounts of protein has been one of the major problems in molecular biology. Using yeast as an example, we demonstrate here that mass spectrometric protein identification is a general solution to this problem given a completely sequenced genome. As a first screen, our strategy uses automated laser desorption ionization mass spectrometry of the peptide mixtures produced by in-gel tryptic digestion of a protein. Up to 90% of proteins are identified by searching sequence data bases by lists of peptide masses obtained with high accuracy. The remaining proteins are identified by partially sequencing several peptides of the unseparated mixture by nanoelectrospray tandem mass spectrometry followed by data base searching with multiple peptide sequence tags. In blind trials, the method led to unambiguous identification in all cases. In the largest individual protein identification project to date, a total of 150 gel spots—many of them at subpicomole amounts—were successfully analyzed, greatly enlarging a yeast two-dimensional gel data base. More than 32 proteins were novel and matched to previously uncharacterized open reading frames in the yeast genome. This study establishes that mass spectrometry provides the required throughput, the certainty of identification, and the general applicability to serve as the method of choice to connect genome and proteome.
Resumo:
Chromosome 7q22 has been the focus of many cytogenetic and molecular studies aimed at delineating regions commonly deleted in myeloid leukemias and myelodysplastic syndromes. We have compared a gene-dense, GC-rich sub-region of 7q22 with the orthologous region on mouse chromosome 5. A physical map of 640 kb of genomic DNA from mouse chromosome 5 was derived from a series of overlapping bacterial artificial chromosomes. A 296 kb segment from the physical map, spanning Ache to Tfr2, was compared with 267 kb of human sequence. We identified a conserved linkage of 12 genes including an open reading frame flanked by Ache and Asr2, a novel cation-chloride cotransporter interacting protein Cip1, Ephb4, Zan and Perq1. While some of these genes have been previously described, in each case we present new data derived from our comparative sequence analysis. Adjacent unfinished sequence data from the mouse contains an orthologous block of 10 additional genes including three novel cDNA sequences that we subsequently mapped to human 7q22. Methods for displaying comparative genomic information, including unfinished sequence data, are becoming increasingly important. We supplement our printed comparative analysis with a new, Web-based program called Laj (local alignments with java). Laj provides interactive access to archived pairwise sequence alignments via the WWW. It displays synchronized views of a dot-plot, a percent identity plot, a nucleotide-level local alignment and a variety of relevant annotations. Our mouse–human comparison can be viewed at http://web.uvic.ca/~bioweb/laj.html. Laj is available at http://bio.cse.psu.edu/, along with online documentation and additional examples of annotated genomic regions.
Resumo:
When many protein sequences are available for estimating the time of divergence between two species, it is customary to estimate the time for each protein separately and then use the average for all proteins as the final estimate. However, it can be shown that this estimate generally has an upward bias, and that an unbiased estimate is obtained by using distances based on concatenated sequences. We have shown that two concatenation-based distances, i.e., average gamma distance weighted with sequence length (d2) and multiprotein gamma distance (d3), generally give more satisfactory results than other concatenation-based distances. Using these two distance measures for 104 protein sequences, we estimated the time of divergence between mice and rats to be approximately 33 million years ago. Similarly, the time of divergence between humans and rodents was estimated to be approximately 96 million years ago. We also investigated the dependency of time estimates on statistical methods and various assumptions made by using sequence data from eubacteria, protists, plants, fungi, and animals. Our best estimates of the times of divergence between eubacteria and eukaryotes, between protists and other eukaryotes, and between plants, fungi, and animals were 3, 1.7, and 1.3 billion years ago, respectively. However, estimates of ancient divergence times are subject to a substantial amount of error caused by uncertainty of the molecular clock, horizontal gene transfer, errors in sequence alignments, etc.
Resumo:
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith–Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith–Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Resumo:
Carbonic anhydrase (CA) (EC 4.2.1.1) enzymes catalyze the reversible hydration of CO2, a reaction that is important in many physiological processes. We have cloned and sequenced a full-length cDNA encoding an intracellular β-CA from the unicellular green alga Coccomyxa. Nucleotide sequence data show that the isolated cDNA contains an open reading frame encoding a polypeptide of 227 amino acids. The predicted polypeptide is similar to β-type CAs from Escherichia coli and higher plants, with an identity of 26% to 30%. The Coccomyxa cDNA was overexpressed in E. coli, and the enzyme was purified and biochemically characterized. The mature protein is a homotetramer with an estimated molecular mass of 100 kD. The CO2-hydration activity of the Coccomyxa enzyme is comparable with that of the pea homolog. However, the activity of Coccomyxa CA is largely insensitive to oxidative conditions, in contrast to similar enzymes from most higher plants. Fractionation studies further showed that Coccomyxa CA is extrachloroplastic.
Resumo:
We report new evidence that bears decisively on a long-standing controversy in primate systematics. DNA sequence data for the complete cytochrome b gene, combined with an expanded morphological data set, confirm the results of a previous study and again indicate that all extant Malagasy lemurs originated from a single common ancestor. These results, as well as those from other genetic studies, call for a revision of primate classifications in which the dwarf and mouse lemurs are placed within the Afro-Asian lorisiforms. The phylogenetic results, in agreement with paleocontinental data, indicate an African origin for the common ancestor of lemurs and lorises (the Strepsirrhini). The molecular data further suggest the surprising conclusion that lemurs began evolving independently by the early Eocene at the latest. This indicates that the Malagasy primate lineage is more ancient than generally thought and places the split between the two strepsirrhine lineages well before the appearance of known Eocene fossil primates. We conclude that primate origins were marked by rapid speciation and diversification sometime before the late Paleocene.
Resumo:
Using partial amino acid sequence data derived from porcine methionyl aminopeptidase (MetAP; methionine aminopeptidase, peptidase M; EC 3.4.11.18), a full-length clone of the homologous human enzyme has been obtained. The cDNA sequence contains 2569 nt with a single open reading frame corresponding to a protein of 478 amino acids. The C-terminal portion representing the catalytic domain shows limited identity with MetAP sequences from various prokaryotes and yeast, while the N terminus is rich in charged amino acids, including extended strings of basic and acidic residues. These highly polar stretches likely result in the spuriously high observed molecular mass (67 kDa). This cDNA sequence is highly similar to a rat protein, termed p67, which was identified as an inhibitor of phosphorylation of initiation factor eIF2 alpha and was previously predicted to be a metallopeptidase based on limited sequence homology. Model building established that human MetAP (p67) could be readily accommodated into the Escherichia coli MetAP structure and that the Co2+ ligands were fully preserved. However, human MetAP was found to be much more similar to a yeast open reading frame that differed markedly from the previously reported yeast MetAP. A similar partial sequence from Methanothermus fervidus suggests that this p67-like sequence is also found in prokaryotes. These findings suggest that there are two cobalt-dependent MetAP families, presently composed of the prokaryote and yeast sequences (and represented by the E. coli structure) (type I), on the one hand, and by human MetAP, the yeast open reading frame, and the partial prokaryotic sequence (type II), on the other.
Resumo:
Sugarcane moth borers are a diverse group of species occurring in several genera, but predominately within the Noctuidae and Pyraloidea. They cause economic loss in sugarcane and other crops through damage to stems and stalks by larval boring. Partial sequence data from two mitochondrial genes, COII and 16S, were used to construct a molecular phylogeny based on 26 species from ten genera and six tribes. The Noctuidae were found to be monophyletic, providing molecular support for the taxonomy within this subfamily. However, the Pyraloidea are paraphyletic, with the noctuids splitting Galleriinae and Schoenobiinae from the Crambinae. This supports the separation of the Pyralidae and Crambinae, but does not support the concept of the incorporation of the Schoenobiinae in the Crambidae. Of the three crambine genera examined, Diatraea was monophyletic, Chilo paraphyletic, and Eoreuma was basal to the other two genera. Within the Noctuidae, Sesamia and Bathytricha were monophyletic, with Busseola basal to Bathytricha. Many species in this study (both noctuids and pyraloids) had different biotypes within collection localities and across their distribution; however the individual biotypes were not phylogenetically informative. These data highlight the need for taxonomic revisions at all taxon levels and provide a basis for the development of DNA-based diagnostics for rapidly identifying many species at any developmental stage. This ability is vital, as the species are an incursion threat to Australia and have the potential to cause significant losses to the sugar industry.