903 resultados para SEQUENCE DATA
Resumo:
Prior studies of phylogenetic relationships among phocoenids based on morphology and molecular sequence data conflict and yield unresolved relationships among species. This study evaluates a comprehensive set of cranial, postcranial, and soft anatomical characters to infer interrelationships among extant species and several well-known fossil phocoenids, using two different methods to analyze polymorphic data: polymorphic coding and frequency step matrix. Our phylogenetic results confirmed phocoenid monophyly. The division of Phocoenidae into two subfamilies previously proposed was rejected, as well as the alliance of the two extinct genera Salumiphocaena and Piscolithax with Phocoena dioptrica and Phocoenoides dalli. Extinct phocoenids are basal to all extant species. We also examined the origin and distribution of porpoises within the context of this phylogenetic framework. Phocoenid phylogeny together with available geologic evidence suggests that the early history of phocoenids was centered in the North Pacific during the middle Miocene, with subsequent dispersal into the southern hemisphere in the middle Pliocene. A cooling period in the Pleistocene allowed dispersal of the southern ancestor of Phocoena sinusinto the North Pacific (Gulf of California).
Resumo:
Iphisa elegans Gray, 1851 is a ground-dwelling lizard widespread over Amazonia that displays a broadly conserved external morphology over its range. This wide geographical distribution and conservation of body form contrasts with the expected poor dispersal ability of the species, the tumultuous past of Amazonia, and the previously documented prevalence of cryptic species in widespread terrestrial organisms in this region. Here we investigate this homogeneity by examining hemipenial morphology and conducting phylogenetic analyses of mitochondrial (CYTB) and nuclear (C-MOS) DNA sequence data from 49 individuals sampled across Amazonia. We detected remarkable variation in hemipenial morphology within this species, with multiple cases of sympatric occurrence of distinct hemipenial morphotypes. Phylogenetic analyses revealed highly divergent lineages corroborating the patterns suggested by the hemipenial morphotypes, including co-occurrence of different lineages. The degrees of genetic and morphological distinctness, as well as instances of sympatry among mtDNA lineages/morphotypes without nuDNA allele sharing, suggest that I. elegans is a complex of cryptic species. An extensive and integrative taxonomic revision of the I. elegans complex throughout its wide geographical range is needed. (c) 2012 The Linnean Society of London, Zoological Journal of the Linnean Society, 2012, 166, 361376.
Resumo:
Abstract Background Plasmodium vivax is the most widely distributed human malaria, responsible for 70–80 million clinical cases each year and large socio-economical burdens for countries such as Brazil where it is the most prevalent species. Unfortunately, due to the impossibility of growing this parasite in continuous in vitro culture, research on P. vivax remains largely neglected. Methods A pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of P. vivax was performed. To do so, 1,184 clones from a cDNA library constructed with parasites obtained from 10 different human patients in the Brazilian Amazon were sequenced. Sequences were automatedly processed to remove contaminants and low quality reads. A total of 806 sequences with an average length of 586 bp met such criteria and their clustering revealed 666 distinct events. The consensus sequence of each cluster and the unique sequences of the singlets were used in similarity searches against different databases that included P. vivax, Plasmodium falciparum, Plasmodium yoelii, Plasmodium knowlesi, Apicomplexa and the GenBank non-redundant database. An E-value of <10-30 was used to define a significant database match. ESTs were manually assigned a gene ontology (GO) terminology Results A total of 769 ESTs could be assigned a putative identity based upon sequence similarity to known proteins in GenBank. Moreover, 292 ESTs were annotated and a GO terminology was assigned to 164 of them. Conclusion These are the first ESTs reported for P. vivax and, as such, they represent a valuable resource to assist in the annotation of the P. vivax genome currently being sequenced. Moreover, since the GC-content of the P. vivax genome is strikingly different from that of P. falciparum, these ESTs will help in the validation of gene predictions for P. vivax and to create a gene index of this malaria parasite.
Resumo:
BACKGROUND: Several approaches can be used to determine the order of loci on chromosomes and hence develop maps of the genome. However, all mapping approaches are prone to errors either arising from technical deficiencies or lack of statistical support to distinguish between alternative orders of loci. The accuracy of the genome maps could be improved, in principle, if information from different sources was combined to produce integrated maps. The publicly available bovine genomic sequence assembly with 6x coverage (Btau_2.0) is based on whole genome shotgun sequence data and limited mapping data however, it is recognised that this assembly is a draft that contains errors. Correcting the sequence assembly requires extensive additional mapping information to improve the reliability of the ordering of sequence scaffolds on chromosomes. The radiation hybrid (RH) map described here has been contributed to the international sequencing project to aid this process. RESULTS: An RH map for the 30 bovine chromosomes is presented. The map was built using the Roslin 3000-rad RH panel (BovGen RH map) and contains 3966 markers including 2473 new loci in addition to 262 amplified fragment-length polymorphisms (AFLP) and 1231 markers previously published with the first generation RH map. Sequences of the mapped loci were aligned with published bovine genome maps to identify inconsistencies. In addition to differences in the order of loci, several cases were observed where the chromosomal assignment of loci differed between maps. All the chromosome maps were aligned with the current 6x bovine assembly (Btau_2.0) and 2898 loci were unambiguously located in the bovine sequence. The order of loci on the RH map for BTA 5, 7, 16, 22, 25 and 29 differed substantially from the assembled bovine sequence. From the 2898 loci unambiguously identified in the bovine sequence assembly, 131 mapped to different chromosomes in the BovGen RH map. CONCLUSION: Alignment of the BovGen RH map with other published RH and genetic maps showed higher consistency in marker order and chromosome assignment than with the current 6x sequence assembly. This suggests that the bovine sequence assembly could be significantly improved by incorporating additional independent mapping information.
Resumo:
Lyme disease Borrelia can infect humans and animals for months to years, despite the presence of an active host immune response. The vls antigenic variation system, which expresses the surface-exposed lipoprotein VlsE, plays a major role in B. burgdorferi immune evasion. Gene conversion between vls silent cassettes and the vlsE expression site occurs at high frequency during mammalian infection, resulting in sequence variation in the VlsE product. In this study, we examined vlsE sequence variation in B. burgdorferi B31 during mouse infection by analyzing 1,399 clones isolated from bladder, heart, joint, ear, and skin tissues of mice infected for 4 to 365 days. The median number of codon changes increased progressively in C3H/HeN mice from 4 to 28 days post infection, and no clones retained the parental vlsE sequence at 28 days. In contrast, the decrease in the number of clones with the parental vlsE sequence and the increase in the number of sequence changes occurred more gradually in severe combined immunodeficiency (SCID) mice. Clones containing a stop codon were isolated, indicating that continuous expression of full-length VlsE is not required for survival in vivo; also, these clones continued to undergo vlsE recombination. Analysis of clones with apparent single recombination events indicated that recombinations into vlsE are nonselective with regard to the silent cassette utilized, as well as the length and location of the recombination event. Sequence changes as small as one base pair were common. Fifteen percent of recovered vlsE variants contained "template-independent" sequence changes, which clustered in the variable regions of vlsE. We hypothesize that the increased frequency and complexity of vlsE sequence changes observed in clones recovered from immunocompetent mice (as compared with SCID mice) is due to rapid clearance of relatively invariant clones by variable region-specific anti-VlsE antibody responses.
Resumo:
We report the complete genome sequence of bovine pestivirus strain PG-2. The sequence data from this virus showed that PG-2 is closely related to the giraffe pestivirus strain H138. PG-2 and H138 belong to one pestivirus species that should be considered an approved member of the genus Pestivirus.
Resumo:
BACKGROUND A cost-effective strategy to increase the density of available markers within a population is to sequence a small proportion of the population and impute whole-genome sequence data for the remaining population. Increased densities of typed markers are advantageous for genome-wide association studies (GWAS) and genomic predictions. METHODS We obtained genotypes for 54 602 SNPs (single nucleotide polymorphisms) in 1077 Franches-Montagnes (FM) horses and Illumina paired-end whole-genome sequencing data for 30 FM horses and 14 Warmblood horses. After variant calling, the sequence-derived SNP genotypes (~13 million SNPs) were used for genotype imputation with the software programs Beagle, Impute2 and FImpute. RESULTS The mean imputation accuracy of FM horses using Impute2 was 92.0%. Imputation accuracy using Beagle and FImpute was 74.3% and 77.2%, respectively. In addition, for Impute2 we determined the imputation accuracy of all individual horses in the validation population, which ranged from 85.7% to 99.8%. The subsequent inclusion of Warmblood sequence data further increased the correlation between true and imputed genotypes for most horses, especially for horses with a high level of admixture. The final imputation accuracy of the horses ranged from 91.2% to 99.5%. CONCLUSIONS Using Impute2, the imputation accuracy was higher than 91% for all horses in the validation population, which indicates that direct imputation of 50k SNP-chip data to sequence level genotypes is feasible in the FM population. The individual imputation accuracy depended mainly on the applied software and the level of admixture.
Resumo:
Academic and industrial research in the late 90s have brought about an exponential explosion of DNA sequence data. Automated expert systems are being created to help biologists to extract patterns, trends and links from this ever-deepening ocean of information. Two such systems aimed on retrieving and subsequently utilizing phylogenetically relevant information have been developed in this dissertation, the major objective of which was to automate the often difficult and confusing phylogenetic reconstruction process. ^ Popular phylogenetic reconstruction methods, such as distance-based methods, attempt to find an optimal tree topology (that reflects the relationships among related sequences and their evolutionary history) by searching through the topology space. Various compromises between the fast (but incomplete) and exhaustive (but computationally prohibitive) search heuristics have been suggested. An intelligent compromise algorithm that relies on a flexible “beam” search principle from the Artificial Intelligence domain and uses the pre-computed local topology reliability information to adjust the beam search space continuously is described in the second chapter of this dissertation. ^ However, sometimes even a (virtually) complete distance-based method is inferior to the significantly more elaborate (and computationally expensive) maximum likelihood (ML) method. In fact, depending on the nature of the sequence data in question either method might prove to be superior. Therefore, it is difficult (even for an expert) to tell a priori which phylogenetic reconstruction method—distance-based, ML or maybe maximum parsimony (MP)—should be chosen for any particular data set. ^ A number of factors, often hidden, influence the performance of a method. For example, it is generally understood that for a phylogenetically “difficult” data set more sophisticated methods (e.g., ML) tend to be more effective and thus should be chosen. However, it is the interplay of many factors that one needs to consider in order to avoid choosing an inferior method (potentially a costly mistake, both in terms of computational expenses and in terms of reconstruction accuracy.) ^ Chapter III of this dissertation details a phylogenetic reconstruction expert system that selects a superior proper method automatically. It uses a classifier (a Decision Tree-inducing algorithm) to map a new data set to the proper phylogenetic reconstruction method. ^
Resumo:
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT.
Resumo:
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith–Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith–Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Resumo:
Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C-beta atoms in other residues within a sphere around the C-beta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either contacted or non-contacted, the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary sequence and higher order consecutive protein structural and functional properties.
Resumo:
The tropical abalone. Haliotis asinina. is,in ideal species to investigate the molecular mechanisms that control development. growth, reproduction and shell formation in all cultured haliotids. Here we describe the analysis of 232 expressed sequence tags (EST) obtained front a developmental H. asinina cDNA library intended for future microarray studies. From this data set we identified 183 unique gene Clusters. Of these, 90 clusters showed significant homology with sequences lodged in GenBank, ranging in function from general housekeeping to signal transduction, gene regulation and cell-cell communication. Seventy-one clusters possessed completely novel ORFs greater than 50 codons in length, highlighting the paucity of sequence data from molluscs and other lophotrochozoans. This study of developmental gene expression in H. asinina provides the foundation for further detailed analyses of abalone growth, development and reproduction.
Resumo:
The complete genome sequence of the Australian 1-2 heat-tolerant Newcastle disease virus (NDV) vaccine (master seed stocks) was determined and compared to the sequence of the parent virus from which it had been derived after exposure of the parent stock at 56 degrees C for 30 min. Nucleotide changes were observed at a number of positions with synonymous mutations being greater than those observed for non-synonymous mutations. Sequence data for the HN gene of a parental culture of V4 and two heat-tolerant variants of V4 were obtained. These were compared with the data for the 1-2 viruses and with published sequences for parental V4 and for a number of ND vaccine strains. Sequence analyses did not reveal the ARG 303 deletion in the HN protein, previously claimed to be responsible for the thermostable phenotype. No consistent changes were detected that would indicate involvement of the HN protein in heat resistance. The majority of alterations were observed in the L protein of the virus and it is proposed that these alterations were responsible for the heat-tolerant phenotype of the 1-2 NDV vaccine. (c) 2005 Elsevier B.V. All rights reserved.
Resumo:
The primary goal of this dissertation is the study of patterns of viral evolution inferred from serially-sampled sequence data, i.e., sequence data obtained from strains isolated at consecutive time points from a single patient or host. RNA viral populations have an extremely high genetic variability, largely due to their astronomical population sizes within host systems, high replication rate, and short generation time. It is this aspect of their evolution that demands special attention and a different approach when studying the evolutionary relationships of serially-sampled sequence data. New methods that analyze serially-sampled data were developed shortly after a groundbreaking HIV-1 study of several patients from which viruses were isolated at recurring intervals over a period of 10 or more years. These methods assume a tree-like evolutionary model, while many RNA viruses have the capacity to exchange genetic material with one another using a process called recombination. ^ A genealogy involving recombination is best described by a network structure. A more general approach was implemented in a new computational tool, Sliding MinPD, one that is mindful of the sampling times of the input sequences and that reconstructs the viral evolutionary relationships in the form of a network structure with implicit representations of recombination events. The underlying network organization reveals unique patterns of viral evolution and could help explain the emergence of disease-associated mutants and drug-resistant strains, with implications for patient prognosis and treatment strategies. In order to comprehensively test the developed methods and to carry out comparison studies with other methods, synthetic data sets are critical. Therefore, appropriate sequence generators were also developed to simulate the evolution of serially-sampled recombinant viruses, new and more through evaluation criteria for recombination detection methods were established, and three major comparison studies were performed. The newly developed tools were also applied to "real" HIV-1 sequence data and it was shown that the results represented within an evolutionary network structure can be interpreted in biologically meaningful ways. ^
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.