972 resultados para RBCL SEQUENCE DATA
Resumo:
Abstract Background Plasmodium vivax is the most widely distributed human malaria, responsible for 70–80 million clinical cases each year and large socio-economical burdens for countries such as Brazil where it is the most prevalent species. Unfortunately, due to the impossibility of growing this parasite in continuous in vitro culture, research on P. vivax remains largely neglected. Methods A pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of P. vivax was performed. To do so, 1,184 clones from a cDNA library constructed with parasites obtained from 10 different human patients in the Brazilian Amazon were sequenced. Sequences were automatedly processed to remove contaminants and low quality reads. A total of 806 sequences with an average length of 586 bp met such criteria and their clustering revealed 666 distinct events. The consensus sequence of each cluster and the unique sequences of the singlets were used in similarity searches against different databases that included P. vivax, Plasmodium falciparum, Plasmodium yoelii, Plasmodium knowlesi, Apicomplexa and the GenBank non-redundant database. An E-value of <10-30 was used to define a significant database match. ESTs were manually assigned a gene ontology (GO) terminology Results A total of 769 ESTs could be assigned a putative identity based upon sequence similarity to known proteins in GenBank. Moreover, 292 ESTs were annotated and a GO terminology was assigned to 164 of them. Conclusion These are the first ESTs reported for P. vivax and, as such, they represent a valuable resource to assist in the annotation of the P. vivax genome currently being sequenced. Moreover, since the GC-content of the P. vivax genome is strikingly different from that of P. falciparum, these ESTs will help in the validation of gene predictions for P. vivax and to create a gene index of this malaria parasite.
Resumo:
I investigated the systematics, phylogeny and biogeographical history of Juncaginaceae, a small family of the early-diverging monocot order Alismatales which comprises about 30 species of annual and perennial herbs. A wide range of methods from classical taxonomy to molecular systematic and biogeographic approaches was used. rnrnIn Chapter 1, a phylogenetic analysis of the family and members of Alismatales was conducted to clarify the circumscription of Juncaginaceae and intrafamilial relationships. For the first time, all accepted genera and those associated with the family in the past were analysed together. Phylogenetic analysis of three molecular markers (rbcL, matK, and atpA) showed that Juncaginaceae are not monophyletic. As a consequence the family is re-circumscribed to exclude Maundia which is pro-posed to belong to a separate family Maundiaceae, reducing Juncaginaceae to include Tetroncium, Cycnogeton and Triglochin. Tetroncium is weakly supported as sister to the rest of the family. The reinstated Cycnogeton (formerly included in Triglochin) is highly supported as sister to Triglochin s.str. Lilaea is nested within Triglochin s. str. and highly supported as sister to the T. bulbosa complex. The results of the molecular analysis are discussed in combination with morphological characters, a key to the genera of the family is given, and several new combinations are made.rnrnIn Chapter 2, phylogenetic relationships in Triglochin were investigated. A species-level phylogeny was constructed based on molecular data obtained from nuclear (ITS, internal transcribed spacer) and chloroplast sequence data (psbA-trnH, matK). Based on the phylogeny of the group, divergence times were estimated and ancestral distribution areas reconstructed. The monophyly of Triglochin is confirmed and relationships between the major lineages of the genus were resolved. A clade comprising the Mediterranean/African T. bulbosa complex and the American T. scilloides (= Lilaea s.) is sister to the rest of the genus which contains two main clades. In the first, the widespread T. striata is sister to a clade comprising annual Triglochin species from Australia. The second clade comprises T. palustris as sister to the T. maritima complex, of which the latter is further divided into a Eurasian and an American subclade. Diversification in Triglochin began in the Miocene or Oligocene, and most disjunctions in Triglochin were dated to the Miocene. Taxonomic diversity in some clades is strongly linked to habitat shifts and can not be observed in old but ecologically invariable lineages such as the non-monophyletic T. maritima.rnrnChapter 3 is a collaborative revision of the Triglochin bulbosa complex, a monophyletic group from the Mediterranean region and Africa. One new species, Triglochin buchenaui, and two new subspecies, T. bulbosa subsp. calcicola and subsp. quarcicola, from South Africa were described. Furthermore, two taxa were elevated to species rank and two reinstated. Altogether, seven species and four subspecies are recognised. An identification key, detailed descriptions and accounts of the ecology and distribution of the taxa are provided. An IUCN conservation status is proposed for each taxon.rnrnChapter 4 deals with the monotypic Tetroncium from southern South America. Tetroncium magellanicum is the only dioecious species in the family. The taxonomic history of the species is described, type material is traced, and a lectotype for the name is designated. Based on an extensive study of herbarium specimens and literature, a detailed description of the species and notes on its ecology and conservation status are provided. A detailed map showing the known distribution area of T. magellanicum is presented. rnrnIn Chapter 5, the flower structure of the rare Australian endemic Maundia triglochinoides (Maundiaceae, see Chapter 1) was studied in a collaborative project. As the morphology of Maundia is poorly known and some characters were described differently in the literature, inflorescences, flowers and fruits were studied using serial mictrotome sections and scanning electron microscopy. The phylogenetic placement, affinities to other taxa, and the evolution of certain characters are discussed. As Maundia exhibits a mosaic of characters of other families of tepaloid core Alismatales, its segregation as a separate family seems plausible.
Resumo:
BACKGROUND: Several approaches can be used to determine the order of loci on chromosomes and hence develop maps of the genome. However, all mapping approaches are prone to errors either arising from technical deficiencies or lack of statistical support to distinguish between alternative orders of loci. The accuracy of the genome maps could be improved, in principle, if information from different sources was combined to produce integrated maps. The publicly available bovine genomic sequence assembly with 6x coverage (Btau_2.0) is based on whole genome shotgun sequence data and limited mapping data however, it is recognised that this assembly is a draft that contains errors. Correcting the sequence assembly requires extensive additional mapping information to improve the reliability of the ordering of sequence scaffolds on chromosomes. The radiation hybrid (RH) map described here has been contributed to the international sequencing project to aid this process. RESULTS: An RH map for the 30 bovine chromosomes is presented. The map was built using the Roslin 3000-rad RH panel (BovGen RH map) and contains 3966 markers including 2473 new loci in addition to 262 amplified fragment-length polymorphisms (AFLP) and 1231 markers previously published with the first generation RH map. Sequences of the mapped loci were aligned with published bovine genome maps to identify inconsistencies. In addition to differences in the order of loci, several cases were observed where the chromosomal assignment of loci differed between maps. All the chromosome maps were aligned with the current 6x bovine assembly (Btau_2.0) and 2898 loci were unambiguously located in the bovine sequence. The order of loci on the RH map for BTA 5, 7, 16, 22, 25 and 29 differed substantially from the assembled bovine sequence. From the 2898 loci unambiguously identified in the bovine sequence assembly, 131 mapped to different chromosomes in the BovGen RH map. CONCLUSION: Alignment of the BovGen RH map with other published RH and genetic maps showed higher consistency in marker order and chromosome assignment than with the current 6x sequence assembly. This suggests that the bovine sequence assembly could be significantly improved by incorporating additional independent mapping information.
Resumo:
Lyme disease Borrelia can infect humans and animals for months to years, despite the presence of an active host immune response. The vls antigenic variation system, which expresses the surface-exposed lipoprotein VlsE, plays a major role in B. burgdorferi immune evasion. Gene conversion between vls silent cassettes and the vlsE expression site occurs at high frequency during mammalian infection, resulting in sequence variation in the VlsE product. In this study, we examined vlsE sequence variation in B. burgdorferi B31 during mouse infection by analyzing 1,399 clones isolated from bladder, heart, joint, ear, and skin tissues of mice infected for 4 to 365 days. The median number of codon changes increased progressively in C3H/HeN mice from 4 to 28 days post infection, and no clones retained the parental vlsE sequence at 28 days. In contrast, the decrease in the number of clones with the parental vlsE sequence and the increase in the number of sequence changes occurred more gradually in severe combined immunodeficiency (SCID) mice. Clones containing a stop codon were isolated, indicating that continuous expression of full-length VlsE is not required for survival in vivo; also, these clones continued to undergo vlsE recombination. Analysis of clones with apparent single recombination events indicated that recombinations into vlsE are nonselective with regard to the silent cassette utilized, as well as the length and location of the recombination event. Sequence changes as small as one base pair were common. Fifteen percent of recovered vlsE variants contained "template-independent" sequence changes, which clustered in the variable regions of vlsE. We hypothesize that the increased frequency and complexity of vlsE sequence changes observed in clones recovered from immunocompetent mice (as compared with SCID mice) is due to rapid clearance of relatively invariant clones by variable region-specific anti-VlsE antibody responses.
Resumo:
We report the complete genome sequence of bovine pestivirus strain PG-2. The sequence data from this virus showed that PG-2 is closely related to the giraffe pestivirus strain H138. PG-2 and H138 belong to one pestivirus species that should be considered an approved member of the genus Pestivirus.
Resumo:
BACKGROUND A cost-effective strategy to increase the density of available markers within a population is to sequence a small proportion of the population and impute whole-genome sequence data for the remaining population. Increased densities of typed markers are advantageous for genome-wide association studies (GWAS) and genomic predictions. METHODS We obtained genotypes for 54 602 SNPs (single nucleotide polymorphisms) in 1077 Franches-Montagnes (FM) horses and Illumina paired-end whole-genome sequencing data for 30 FM horses and 14 Warmblood horses. After variant calling, the sequence-derived SNP genotypes (~13 million SNPs) were used for genotype imputation with the software programs Beagle, Impute2 and FImpute. RESULTS The mean imputation accuracy of FM horses using Impute2 was 92.0%. Imputation accuracy using Beagle and FImpute was 74.3% and 77.2%, respectively. In addition, for Impute2 we determined the imputation accuracy of all individual horses in the validation population, which ranged from 85.7% to 99.8%. The subsequent inclusion of Warmblood sequence data further increased the correlation between true and imputed genotypes for most horses, especially for horses with a high level of admixture. The final imputation accuracy of the horses ranged from 91.2% to 99.5%. CONCLUSIONS Using Impute2, the imputation accuracy was higher than 91% for all horses in the validation population, which indicates that direct imputation of 50k SNP-chip data to sequence level genotypes is feasible in the FM population. The individual imputation accuracy depended mainly on the applied software and the level of admixture.
Resumo:
Academic and industrial research in the late 90s have brought about an exponential explosion of DNA sequence data. Automated expert systems are being created to help biologists to extract patterns, trends and links from this ever-deepening ocean of information. Two such systems aimed on retrieving and subsequently utilizing phylogenetically relevant information have been developed in this dissertation, the major objective of which was to automate the often difficult and confusing phylogenetic reconstruction process. ^ Popular phylogenetic reconstruction methods, such as distance-based methods, attempt to find an optimal tree topology (that reflects the relationships among related sequences and their evolutionary history) by searching through the topology space. Various compromises between the fast (but incomplete) and exhaustive (but computationally prohibitive) search heuristics have been suggested. An intelligent compromise algorithm that relies on a flexible “beam” search principle from the Artificial Intelligence domain and uses the pre-computed local topology reliability information to adjust the beam search space continuously is described in the second chapter of this dissertation. ^ However, sometimes even a (virtually) complete distance-based method is inferior to the significantly more elaborate (and computationally expensive) maximum likelihood (ML) method. In fact, depending on the nature of the sequence data in question either method might prove to be superior. Therefore, it is difficult (even for an expert) to tell a priori which phylogenetic reconstruction method—distance-based, ML or maybe maximum parsimony (MP)—should be chosen for any particular data set. ^ A number of factors, often hidden, influence the performance of a method. For example, it is generally understood that for a phylogenetically “difficult” data set more sophisticated methods (e.g., ML) tend to be more effective and thus should be chosen. However, it is the interplay of many factors that one needs to consider in order to avoid choosing an inferior method (potentially a costly mistake, both in terms of computational expenses and in terms of reconstruction accuracy.) ^ Chapter III of this dissertation details a phylogenetic reconstruction expert system that selects a superior proper method automatically. It uses a classifier (a Decision Tree-inducing algorithm) to map a new data set to the proper phylogenetic reconstruction method. ^
Resumo:
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT.
Resumo:
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith–Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith–Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Resumo:
Background: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C-beta atoms in other residues within a sphere around the C-beta atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. Results: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either contacted or non-contacted, the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. Conclusion: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary sequence and higher order consecutive protein structural and functional properties.
Resumo:
The tropical abalone. Haliotis asinina. is,in ideal species to investigate the molecular mechanisms that control development. growth, reproduction and shell formation in all cultured haliotids. Here we describe the analysis of 232 expressed sequence tags (EST) obtained front a developmental H. asinina cDNA library intended for future microarray studies. From this data set we identified 183 unique gene Clusters. Of these, 90 clusters showed significant homology with sequences lodged in GenBank, ranging in function from general housekeeping to signal transduction, gene regulation and cell-cell communication. Seventy-one clusters possessed completely novel ORFs greater than 50 codons in length, highlighting the paucity of sequence data from molluscs and other lophotrochozoans. This study of developmental gene expression in H. asinina provides the foundation for further detailed analyses of abalone growth, development and reproduction.
Resumo:
The complete genome sequence of the Australian 1-2 heat-tolerant Newcastle disease virus (NDV) vaccine (master seed stocks) was determined and compared to the sequence of the parent virus from which it had been derived after exposure of the parent stock at 56 degrees C for 30 min. Nucleotide changes were observed at a number of positions with synonymous mutations being greater than those observed for non-synonymous mutations. Sequence data for the HN gene of a parental culture of V4 and two heat-tolerant variants of V4 were obtained. These were compared with the data for the 1-2 viruses and with published sequences for parental V4 and for a number of ND vaccine strains. Sequence analyses did not reveal the ARG 303 deletion in the HN protein, previously claimed to be responsible for the thermostable phenotype. No consistent changes were detected that would indicate involvement of the HN protein in heat resistance. The majority of alterations were observed in the L protein of the virus and it is proposed that these alterations were responsible for the heat-tolerant phenotype of the 1-2 NDV vaccine. (c) 2005 Elsevier B.V. All rights reserved.
Resumo:
The primary goal of this dissertation is the study of patterns of viral evolution inferred from serially-sampled sequence data, i.e., sequence data obtained from strains isolated at consecutive time points from a single patient or host. RNA viral populations have an extremely high genetic variability, largely due to their astronomical population sizes within host systems, high replication rate, and short generation time. It is this aspect of their evolution that demands special attention and a different approach when studying the evolutionary relationships of serially-sampled sequence data. New methods that analyze serially-sampled data were developed shortly after a groundbreaking HIV-1 study of several patients from which viruses were isolated at recurring intervals over a period of 10 or more years. These methods assume a tree-like evolutionary model, while many RNA viruses have the capacity to exchange genetic material with one another using a process called recombination. ^ A genealogy involving recombination is best described by a network structure. A more general approach was implemented in a new computational tool, Sliding MinPD, one that is mindful of the sampling times of the input sequences and that reconstructs the viral evolutionary relationships in the form of a network structure with implicit representations of recombination events. The underlying network organization reveals unique patterns of viral evolution and could help explain the emergence of disease-associated mutants and drug-resistant strains, with implications for patient prognosis and treatment strategies. In order to comprehensively test the developed methods and to carry out comparison studies with other methods, synthetic data sets are critical. Therefore, appropriate sequence generators were also developed to simulate the evolution of serially-sampled recombinant viruses, new and more through evaluation criteria for recombination detection methods were established, and three major comparison studies were performed. The newly developed tools were also applied to "real" HIV-1 sequence data and it was shown that the results represented within an evolutionary network structure can be interpreted in biologically meaningful ways. ^
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
G-quadruplexes are secondary structures present in DNA and RNA molecules, which are formed by stacking of G-quartets (i.e., interaction of four guanines (G-tracts) bounded by Hoogsteen hydrogen bonding). Human PAX9 intron 1 has a putative G-quadruplex-forming region located near exon 1, which is present in all known sequenced placental mammals. Using circular dichroism (CD) analysis and CD melting, we showed that these sequences are able to form highly stable quadruplex structures. Due to the proximity of the quadruplex structure to exon-intron boundary, we used a validated double-reporter splicing assay and qPCR to analyze its role on splicing efficiency. The human quadruplex was shown to have a key role on splicing efficiency of PAX9 intron 1, as a mutation that abolished quadruplex formation decreased dramatically the splicing efficiency of human PAX9 intron 1. The less stable, rat quadruplex had a less efficient splicing when compared to human sequences. Additionally, the treatment with 360A, a strong ligand that stabilizes quadruplex structures, further increased splicing efficiency of human PAX9 intron 1. Altogether, these results provide evidences that G-quadruplex structures are involved in splicing efficiency of PAX9 intron 1.