10 resultados para Transcriptone Sequence Data
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a ‘coding statistic’ is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C.elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.
Resumo:
Background: Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required. Results: Our group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes http://ibi.imim.es/osirisform.html. Here we describe the development of a new version of OSIRIS (OSIRISv1.2, http://ibi.imim.es/OSIRISv1.2.html webcite) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented. Conclusion: OSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases.
Resumo:
DnaSP, DNA Sequence Polymorphism, is a software package for the analysis of nucleotide polymorphism from aligned DNA sequence data. DnaSP can estimate several measures of DNA sequence variation within and between populations (in noncoding, synonymous or nonsynonymous sites, or in various sorts of codon positions), as well as linkage disequilibrium, recombination, gene flow and gene conversion parameters. DnaSP can also carry out several tests of neutrality: Hudson, Kreitman and Aguadé (1987), Tajima (1989), McDonald and Kreitman (1991), Fu and Li (1993), and Fu (1997) tests. Additionally, DnaSP can estimate the confidence intervals of some test-statistics by the coalescent. The results of the analyses are displayed on tabular and graphic form.
Resumo:
In recent years, new analytical tools have allowed researchers to extract historical information contained in molecular data, which has fundamentally transformed our understanding of processes ruling biological invasions. However, the use of these new analytical tools has been largely restricted to studies of terrestrial organisms despite the growing recognition that the sea contains ecosystems that are amongst the most heavily affected by biological invasions, and that marine invasion histories are often remarkably complex. Here, we studied the routes of invasion and colonisation histories of an invasive marine invertebrate Microcosmus squamiger (Ascidiacea) using microsatellite loci, mitochondrial DNA sequence data and 11 worldwide populations. Discriminant analysis of principal components, clustering methods and approximate Bayesian computation (ABC) methods showed that the most likely source of the introduced populations was a single admixture event that involved populations from two genetically differentiated ancestral regions - the western and eastern coasts of Australia. The ABC analyses revealed that colonisation of the introduced range of M. squamiger consisted of a series of non-independent introductions along the coastlines of Africa, North America and Europe. Furthermore, we inferred that the sequence of colonisation across continents was in line with historical taxonomic records - first the Mediterranean Sea and South Africa from an unsampled ancestral population, followed by sequential introductions in California and, more recently, the NE Atlantic Ocean. We revealed the most likely invasion history for world populations of M. squamiger, which is broadly characterized by the presence of multiple ancestral sources and non-independent introductions within the introduced range. The results presented here illustrate the complexity of marine invasion routes and identify a cause-effect relationship between human-mediated transport and the success of widespread marine non-indigenous species, which benefit from stepping-stone invasions and admixture processes involving different sources for the spread and expansion of their range.
Resumo:
Mesoamerica, defined as the broad linguistic and cultural area from middle southern Mexico to Costa Rica, might have played a pivotal role during the colonization of theAmerican continent. It has been suggested that the Mesoamerican isthmus could have played an important role in severely restricting prehistorically gene flow between North and SouthAmerica. Although the Native American component has been already described in admixedMexican populations, few studies have been carried out in native Mexican populations. In thisstudy we present mitochondrial DNA (mtDNA) sequence data for the first hypervariable region (HVR-I) in 477 unrelated individuals belonging to eleven different native populations from Mexico. Almost all the Native Mexican mtDNAs could be classified into the four pan-Amerindian haplogroups (A2, B2, C1 and D1); only three of them could be allocated to the rare Native American lineage D4h3. Their haplogroup phylogenies are clearly star-like, as expected from relatively young populations that have experienced diverse episodes of genetic drift (e.g. extensive isolation, genetic drift and founder effects) and posterior population expansions. In agreement with this observation is the fact that Native Mexican populations show a high degree of heterogeneity in their patterns of haplogroup frequencies. HaplogroupX2a was absent in our samples, supporting previous observations where this clade was only detected in the American northernmost areas. The search for identical sequences in the American continent shows that, although Native Mexican populations seem to show a closer relationship to North American populations, they cannot be related to a single geographical region within the continent. Finally, we did not find significant population structure on the maternal lineages when considering the four main and distinct linguistic groups represented in our Mexican samples (Oto-Manguean, Uto-Aztecan, Tarascan, and Mayan), suggesting that genetic divergence predates linguistic diversification in Mexico.
Resumo:
Several approaches have been developed to estimate both the relative and absolute rates of speciation and extinction within clades based on molecular phylogenetic reconstructions of evolutionary relationships, according to an underlying model of diversification. However, the macroevolutionary models established for eukaryotes have scarcely been used with prokaryotes. We have investigated the rate and pattern of cladogenesis in the genus Aeromonas (γ-Proteobacteria, Proteobacteria, Bacteria) using the sequences of five housekeeping genes and an uncorrelated relaxed-clock approach. To our knowledge, until now this analysis has never been applied to all the species described in a bacterial genus and thus opens up the possibility of establishing models of speciation from sequence data commonly used in phylogenetic studies of prokaryotes. Our results suggest that the genus Aeromonas began to diverge between 248 and 266 million years ago, exhibiting a constant divergence rate through the Phanerozoic, which could be described as a pure birth process.
Resumo:
Gene turnover rates and the evolution of gene family sizes are important aspects of genome evolution. Here, we use curated sequence data of the major chemosensory gene families from Drosophila-the gustatory receptor, odorant receptor, ionotropic receptor, and odorant-binding protein families-to conduct a comparative analysis among families, exploring different methods to estimate gene birth and death rates, including an ad hoc simulation study. Remarkably, we found that the state-of-the-art methods may produce very different rate estimates, which may lead to disparate conclusions regarding the evolution of chemosensory gene family sizes in Drosophila. Among biological factors, we found that a peculiarity of D. sechellia's gene turnover rates was a major source of bias in global estimates, whereas gene conversion had negligible effects for the families analyzed herein. Turnover rates vary considerably among families, subfamilies, and ortholog groups although all analyzed families were quite dynamic in terms of gene turnover. Computer simulations showed that the methods that use ortholog group information appear to be the most accurate for the Drosophila chemosensory families. Most importantly, these results reveal the potential of rate heterogeneity among lineages to severely bias some turnover rate estimation methods and the need of further evaluating the performance of these methods in a more diverse sampling of gene families and phylogenetic contexts. Using branch-specific codon substitution models, we find further evidence of positive selection in recently duplicated genes, which attests to a nonneutral aspect of the gene birth-and-death process.
Resumo:
Several approaches have been developed to estimate both the relative and absolute rates of speciation and extinction within clades based on molecular phylogenetic reconstructions of evolutionary relationships, according to an underlying model of diversification. However, the macroevolutionary models established for eukaryotes have scarcely been used with prokaryotes. We have investigated the rate and pattern of cladogenesis in the genus Aeromonas (γ-Proteobacteria, Proteobacteria, Bacteria) using the sequences of five housekeeping genes and an uncorrelated relaxed-clock approach. To our knowledge, until now this analysis has never been applied to all the species described in a bacterial genus and thus opens up the possibility of establishing models of speciation from sequence data commonly used in phylogenetic studies of prokaryotes. Our results suggest that the genus Aeromonas began to diverge between 248 and 266 million years ago, exhibiting a constant divergence rate through the Phanerozoic, which could be described as a pure birth process.
Resumo:
Despite the successful retrieval of genomes from past remains, the prospects for human palaeogenomics remain unclear because of the difficulty of distinguishing contaminant from endogenous DNA sequences. Previous sequence data generated on high-throughput sequencing platforms indicate that fragmentation of ancient DNA sequences is a characteristic trait primarily arising due to depurination processes that create abasic sites leading to DNA breaks.
Resumo:
Previous genetic studies have demonstrated that natal homing shapes the stock structure of marine turtle nesting populations. However, widespread sharing of common haplotypes based on short segments of the mitochondrial control region often limits resolution of the demographic connectivity of populations. Recent studies employing longer control region sequences to resolve haplotype sharing have focused on regional assessments of genetic structure and phylogeography. Here we synthesize available control region sequences for loggerhead turtles from the Mediterranean Sea, Atlantic, and western Indian Ocean basins. These data represent six of the nine globally significant regional management units (RMUs) for the species and include novel sequence data from Brazil, Cape Verde, South Africa and Oman. Genetic tests of differentiation among 42 rookeries represented by short sequences (380 bp haplotypes from 3,486 samples) and 40 rookeries represented by long sequences (~800 bp haplotypes from 3,434 samples) supported the distinction of the six RMUs analyzed as well as recognition of at least 18 demographically independent management units (MUs) with respect to female natal homing. A total of 59 haplotypes were resolved. These haplotypes belonged to two highly divergent global lineages, with haplogroup I represented primarily by CC-A1, CC-A4, and CC-A11 variants and haplogroup II represented by CC-A2 and derived variants. Geographic distribution patterns of haplogroup II haplotypes and the nested position of CC-A11.6 from Oman among the Atlantic haplotypes invoke recent colonization of the Indian Ocean from the Atlantic for both global lineages. The haplotypes we confirmed for western Indian Ocean RMUs allow reinterpretation of previous mixed stock analysis and further suggest that contemporary migratory connectivity between the Indian and Atlantic Oceans occurs on a broader scale than previously hypothesized. This study represents a valuable model for conducting comprehensive international cooperative data management and research in marine ecology.