59 resultados para SEQUENCING DATA

em Université de Lausanne, Switzerland


Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genes underlying mutant phenotypes can be isolated by combining marker discovery, genetic mapping and resequencing, but a more straightforward strategy for mapping mutations would be the direct comparison of mutant and wild-type genomes. Applying such an approach, however, is hampered by the need for reference sequences and by mutational loads that confound the unambiguous identification of causal mutations. Here we introduce NIKS (needle in the k-stack), a reference-free algorithm based on comparing k-mers in whole-genome sequencing data for precise discovery of homozygous mutations. We applied NIKS to eight mutants induced in nonreference rice cultivars and to two mutants of the nonmodel species Arabis alpina. In both species, comparing pooled F2 individuals selected for mutant phenotypes revealed small sets of mutations including the causal changes. Moreover, comparing M3 seedlings of two allelic mutants unambiguously identified the causal gene. Thus, for any species amenable to mutagenesis, NIKS enables forward genetics without requiring segregating populations, genetic maps and reference sequences.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

MOTIVATION: High-throughput sequencing technologies enable the genome-wide analysis of the impact of genetic variation on molecular phenotypes at unprecedented resolution. However, although powerful, these technologies can also introduce unexpected artifacts. Results: We investigated the impact of library amplification bias on the identification of allele-specific (AS) molecular events from high-throughput sequencing data derived from chromatin immunoprecipitation assays (ChIP-seq). Putative AS DNA binding activity for RNA polymerase II was determined using ChIP-seq data derived from lymphoblastoid cell lines of two parent-daughter trios. We found that, at high-sequencing depth, many significant AS binding sites suffered from an amplification bias, as evidenced by a larger number of clonal reads representing one of the two alleles. To alleviate this bias, we devised an amplification bias detection strategy, which filters out sites with low read complexity and sites featuring a significant excess of clonal reads. This method will be useful for AS analyses involving ChIP-seq and other functional sequencing assays.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

BACKGROUND: There is an ever-increasing volume of data on host genes that are modulated during HIV infection, influence disease susceptibility or carry genetic variants that impact HIV infection. We created GuavaH (Genomic Utility for Association and Viral Analyses in HIV, http://www.GuavaH.org), a public resource that supports multipurpose analysis of genome-wide genetic variation and gene expression profile across multiple phenotypes relevant to HIV biology. FINDINGS: We included original data from 8 genome and transcriptome studies addressing viral and host responses in and ex vivo. These studies cover phenotypes such as HIV acquisition, plasma viral load, disease progression, viral replication cycle, latency and viral-host genome interaction. This represents genome-wide association data from more than 4,000 individuals, exome sequencing data from 392 individuals, in vivo transcriptome microarray data from 127 patients/conditions, and 60 sets of RNA-seq data. Additionally, GuavaH allows visualization of protein variation in ~8,000 individuals from the general population. The publicly available GuavaH framework supports queries on (i) unique single nucleotide polymorphism across different HIV related phenotypes, (ii) gene structure and variation, (iii) in vivo gene expression in the setting of human infection (CD4+ T cells), and (iv) in vitro gene expression data in models of permissive infection, latency and reactivation. CONCLUSIONS: The complexity of the analysis of host genetic influences on HIV biology and pathogenesis calls for comprehensive motors of research on curated data. The tool developed here allows queries and supports validation of the rapidly growing body of host genomic information pertinent to HIV research.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Microsatellite instability (MSI) occurs in 10-20% of colorectal tumours and is associated with good prognosis. Here we describe the development and validation of a genomic signature that identifies colorectal cancer patients with MSI caused by DNA mismatch repair deficiency with high accuracy. Microsatellite status for 276 stage II and III colorectal tumours has been determined. Full-genome expression data was used to identify genes that correlate with MSI status. A subset of these samples (n = 73) had sequencing data for 615 genes available. An MSI gene signature of 64 genes was developed and validated in two independent validation sets: the first consisting of frozen samples from 132 stage II patients; and the second consisting of FFPE samples from the PETACC-3 trial (n = 625). The 64-gene MSI signature identified MSI patients in the first validation set with a sensitivity of 90.3% and an overall accuracy of 84.8%, with an AUC of 0.942 (95% CI, 0.888-0.975). In the second validation, the signature also showed excellent performance, with a sensitivity 94.3% and an overall accuracy of 90.6%, with an AUC of 0.965 (95% CI, 0.943-0.988). Besides correct identification of MSI patients, the gene signature identified a group of MSI-like patients that were MSS by standard assessment but MSI by signature assessment. The MSI-signature could be linked to a deficient MMR phenotype, as both MSI and MSI-like patients showed a high mutation frequency (8.2% and 6.4% of 615 genes assayed, respectively) as compared to patients classified as MSS (1.6% mutation frequency). The MSI signature showed prognostic power in stage II patients (n = 215) with a hazard ratio of 0.252 (p = 0.0145). Patients with an MSI-like phenotype had also an improved survival when compared to MSS patients. The MSI signature was translated to a diagnostic microarray and technically and clinically validated in FFPE and frozen samples.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently, the introduction of second generation sequencing and further advance-ments in confocal microscopy have enabled system-level studies for the functional characterization of genes. The degree of complexity intrinsic to these approaches needs the development of bioinformatics methodologies and computational models for extracting meaningful biological knowledge from the enormous amount of experi¬mental data which is continuously generated. This PhD thesis presents several novel bioinformatics methods and computational models to address specific biological questions in Plant Biology by using the plant Arabidopsis thaliana as a model system. First, a spatio-temporal qualitative analysis of quantitative transcript and protein profiles is applied to show the role of the BREVIS RADIX (BRX) protein in the auxin- cytokinin crosstalk for root meristem growth. Core of this PhD work is the functional characterization of the interplay between the BRX protein and the plant hormone auxin in the root meristem by using a computational model based on experimental evidence. Hyphotesis generated by the modelled to the discovery of a differential endocytosis pattern in the root meristem that splits the auxin transcriptional response via the plasma membrane to nucleus partitioning of BRX. This positional information system creates an auxin transcriptional pattern that deviates from the canonical auxin response and is necessary to sustain the expression of a subset of BRX-dependent auxin-responsive genes to drive root meristem growth. In the second part of this PhD thesis, we characterized the genome-wide impact of large scale deletions on four divergent Arabidopsis natural strains, through the integration of Ultra-High Throughput Sequencing data with data from genomic hybridizations on tiling arrays. Analysis of the identified deletions revealed a considerable portion of protein coding genes affected and supported a history of genomic rearrangements shaped by evolution. In the last part of the thesis, we showed that VIP3 gene in Arabidopsis has an evo-lutionary conserved role in the 3' to 5' mRNA degradation machinery, by applying a novel approach for the analysis of mRNA-Seq data from random-primed mRNA. Altogether, this PhD research contains major advancements in the study of natural genomic variation in plants and in the application of computational morphodynamics models for the functional characterization of biological pathways essential for the plant. - Récemment, l'introduction du séquençage de seconde génération et les avancées dans la microscopie confocale ont permis des études à l'échelle des différents systèmes cellulaires pour la caractérisation fonctionnelle de gènes. Le degrés de complexité intrinsèque à ces approches ont requis le développement de méthodologies bioinformatiques et de modèles mathématiques afin d'extraire de la masse de données expérimentale générée, des information biologiques significatives. Ce doctorat présente à la fois des méthodes bioinformatiques originales et des modèles mathématiques pour répondre à certaines questions spécifiques de Biologie Végétale en utilisant la plante Arabidopsis thaliana comme modèle. Premièrement, une analyse qualitative spatio-temporelle de profiles quantitatifs de transcripts et de protéines est utilisée pour montrer le rôle de la protéine BREVIS RADIX (BRX) dans le dialogue entre l'auxine et les cytokinines, des phytohormones, dans la croissance du méristème racinaire. Le noyau de ce travail de thèse est la caractérisation fonctionnelle de l'interaction entre la protéine BRX et la phytohormone auxine dans le méristème de la racine en utilisant des modèles informatiques basés sur des preuves expérimentales. Les hypothèses produites par le modèle ont mené à la découverte d'un schéma différentiel d'endocytose dans le méristème racinaire qui divise la réponse transcriptionnelle à l'auxine par le partitionnement de BRX de la membrane plasmique au noyau de la cellule. Cette information positionnelle crée une réponse transcriptionnelle à l'auxine qui dévie de la réponse canonique à l'auxine et est nécessaire pour soutenir l'expression d'un sous ensemble de gènes répondant à l'auxine et dépendant de BRX pour conduire la croissance du méristème. Dans la seconde partie de cette thèse de doctorat, nous avons caractérisé l'impact sur l'ensemble du génome des délétions à grande échelle sur quatre souches divergentes naturelles d'Arabidopsis, à travers l'intégration du séquençage à ultra-haut-débit avec l'hybridation génomique sur puces ADN. L'analyse des délétions identifiées a révélé qu'une proportion considérable de gènes codant était affectée, supportant l'idée d'un historique de réarrangement génomique modelé durant l'évolution. Dans la dernière partie de cette thèse, nous avons montré que le gène VÏP3 dans Arabidopsis a conservé un rôle évolutif dans la machinerie de dégradation des ARNm dans le sens 3' à 5', en appliquant une nouvelle approche pour l'analyse des données de séquençage d'ARNm issue de transcripts amplifiés aléatoirement. Dans son ensemble, cette recherche de doctorat contient des avancées majeures dans l'étude des variations génomiques naturelles des plantes et dans l'application de modèles morphodynamiques informatiques pour la caractérisation de réseaux biologiques essentiels à la plante. - Le développement des plantes est écrit dans leurs codes génétiques. Pour comprendre comment les plantes sont capables de s'adapter aux changements environnementaux, il est essentiel d'étudier comment leurs gènes gouvernent leur formation. Plus nous essayons de comprendre le fonctionnement d'une plante, plus nous réalisons la complexité des mécanismes biologiques, à tel point que l'utilisation d'outils et de modèles mathématiques devient indispensable. Dans ce travail, avec l'utilisation de la plante modèle Arabidopsis thalicinci nous avons résolu des problèmes biologiques spécifiques à travers le développement et l'application de méthodes informatiques concrètes. Dans un premier temps, nous avons investigué comment le gène BREVIS RADIX (BRX) régule le développement de la racine en contrôlant la réponse à deux hormones : l'auxine et la cytokinine. Nous avons employé une analyse statistique sur des mesures quantitatives de transcripts et de produits de gènes afin de démontrer que BRX joue un rôle antagonisant dans le dialogue entre ces deux hormones. Lorsque ce-dialogue moléculaire est perturbé, la racine primaire voit sa longueur dramatiquement réduite. Pour comprendre comment BRX répond à l'auxine, nous avons développé un modèle informatique basé sur des résultats expérimentaux. Les simulations successives ont mené à la découverte d'un signal positionnel qui contrôle la réponse de la racine à l'auxine par la régulation du mouvement intracellulaire de BRX. Dans la seconde partie de cette thèse, nous avons analysé le génome entier de quatre souches naturelles d'Arabidopsis et nous avons trouvé qu'une grande partie de leurs gènes étaient manquant par rapport à la souche de référence. Ce résultat indique que l'historique des modifications génomiques conduites par l'évolution détermine une disponibilité différentielle des gènes fonctionnels dans ces plantes. Dans la dernière partie de ce travail, nous avons analysé les données du transcriptome de la plante où le gène VIP3 était non fonctionnel. Ceci nous a permis de découvrir le rôle double de VIP3 dans la régulation de l'initiation de la transcription et dans la dégradation des transcripts. Ce rôle double n'avait jusqu'alors été démontrée que chez l'homme. Ce travail de doctorat supporte le développement et l'application de méthodologies informatiques comme outils inestimables pour résoudre la complexité des problèmes biologiques dans la recherche végétale. L'intégration de la biologie végétale et l'informatique est devenue de plus en plus importante pour l'avancée de nos connaissances sur le fonctionnement et le développement des plantes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The drivers of species diversification and persistence are of great interest to current biogeography, especially in those global biodiversity hotspots' harbouring most of Earth's animal and plant life. Classical multispecies biogeographical work has yielded fascinating insights into broad-scale patterns of diversification, and DNA-based intraspecific phylogeographical studies have started to complement this picture at much finer temporal and spatial scales. The advent of novel next-generation sequencing (NGS) technologies provides the opportunity to greatly scale up the numbers of individuals, populations and species sampled, potentially merging intraspecific and interspecific approaches to biogeographical inference. Here, we outline these prospects and issues by using the example of an undisputed hotspot, the Cape of southern Africa. We outline the current state of knowledge on the biogeography of species diversification within the Cape, review the literature for phylogeographical evidence of its likely drivers and mechanisms, and suggest possible ways forward based on NGS approaches. We demonstrate the potential of these methods and current bioinformatic issues with the help of restriction-site-associated DNA (RAD) sequencing data for three highly divergent species of the Restionaceae, an important plant radiation in the Cape. A thorough understanding of the mechanisms that facilitate species diversification and persistence in spatially structured, species-rich environments will require the adoption of novel genomic and bioinformatic tools in biogeographical studies.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND: DNA sequence integrity, mRNA concentrations and protein-DNA interactions have been subject to genome-wide analyses based on microarrays with ever increasing efficiency and reliability over the past fifteen years. However, very recently novel technologies for Ultra High-Throughput DNA Sequencing (UHTS) have been harnessed to study these phenomena with unprecedented precision. As a consequence, the extensive bioinformatics environment available for array data management, analysis, interpretation and publication must be extended to include these novel sequencing data types. DESCRIPTION: MIMAS was originally conceived as a simple, convenient and local Microarray Information Management and Annotation System focused on GeneChips for expression profiling studies. MIMAS 3.0 enables users to manage data from high-density oligonucleotide SNP Chips, expression arrays (both 3'UTR and tiling) and promoter arrays, BeadArrays as well as UHTS data using MIAME-compliant standardized vocabulary. Importantly, researchers can export data in MAGE-TAB format and upload them to the EBI's ArrayExpress certified data repository using a one-step procedure. CONCLUSION: We have vastly extended the capability of the system such that it processes the data output of six types of GeneChips (Affymetrix), two different BeadArrays for mRNA and miRNA (Illumina) and the Genome Analyzer (a popular Ultra-High Throughput DNA Sequencer, Illumina), without compromising on its flexibility and user-friendliness. MIMAS, appropriately renamed into Multiomics Information Management and Annotation System, is currently used by scientists working in approximately 50 academic laboratories and genomics platforms in Switzerland and France. MIMAS 3.0 is freely available via http://multiomics.sourceforge.net/.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Résumé : Un nombre croissant de cas de malaria chez les voyageurs et migrants a été rapporté. Bien que l'analyse microscopique des frottis sanguins reste traditionnellement l'outil diagnostic de référence, sa fiabilité dépend considérablement de l'expertise de l'examinateur, pouvant elle-même faire défaut sous nos latitudes. Une PCR multiplex en temps réel a donc été développée en vue d'une standardisation du diagnostic. Un ensemble d'amorces génériques ciblant une région hautement conservée du gène d'ARN ribosomial 18S du genre Plasmodium a tout d'abord été conçu, dont le polymorphisme du produit d'amplification semblait suffisant pour créer quatre sondes spécifiques à l'espèce P. falciparum, P. malariae, P. vivax et P. ovale. Ces sondes utilisées en PCR en temps réel se sont révélées capables de détecter une seule copie de plasmide de P. falciparum, P. malariae, P. vivax et P. ovale spécifiquement. La même sensibilité a été obtenue avec une sonde de screening pouvant détecter les quatre espèces. Quatre-vingt-dix-sept échantillons de sang ont ensuite été testés, dont on a comparé la microscopie et la PCR en temps réel pour 66 (60 patients) d'entre eux. Ces deux méthodes ont montré une concordance globale de 86% pour la détection de plasmodia. Les résultats discordants ont été réévalués grâce à des données cliniques, une deuxième expertise microscopique et moléculaire (laboratoire de Genève et de l'Institut Suisse Tropical de Bâle), ainsi qu'à l'aide du séquençage. Cette nouvelle analyse s'est prononcé en faveur de la méthode moléculaire pour tous les neuf résultats discordants. Sur les 31 résultats positifs par les deux méthodes, la même réévaluation a pu donner raison 8 fois sur 9 à la PCR en temps réel sur le plan de l'identification de l'espèce plasmodiale. Les 31 autres échantillons ont été analysés pour le suivi de sept patients sous traitement antimalarique. Il a été observé une baisse rapide du nombre de parasites mesurée par la PCR en temps réel chez six des sept patients, baisse correspondant à la parasitémie déterminée microscopiquement. Ceci suggère ainsi le rôle potentiel de la PCR en temps réel dans le suivi thérapeutique des patients traités par antipaludéens. Abstract : There have been reports of increasing numbers of cases of malaria among migrants and travelers. Although microscopic examination of blood smears remains the "gold standard" in diagnosis, this method suffers from insufficient sensitivity and requires considerable expertise. To improve diagnosis, a multiplex real-time PCR was developed. One set of generic primers targeting a highly conserved region of the 18S rRNA gene of the genus Plasmodium was designed; the primer set was polymorphic enough internally to design four species-specific probes for P. falciparum, P. vivax, P. malarie, and P. ovale. Real-time PCR with species-specific probes detected one plasmid copy of P. falciparum, P. vivax, P. malariae, and P. ovale specifically. The same sensitivity was achieved for all species with real-time PCR with the 18S screening probe. Ninety-seven blood samples were investigated. For 66 of them (60 patients), microscopy and real-time PCR results were compared and had a crude agreement of 86% for the detection of plasmodia. Discordant results were reevaluated with clinical, molecular, and sequencing data to resolve them. All nine discordances between 18S screening PCR and microscopy were resolved in favor of the molecular method, as were eight of nine discordances at the species level for the species-specific PCR among the 31 samples positive by both methods. The other 31 blood samples were tested to monitor the antimalaria treatment in seven patients. The number of parasites measured by real-time PCR fell rapidly for six out of seven patients in parallel to parasitemia determined microscopically. This suggests a role of quantitative PCR for the monitoring of patients receiving antimalaria therapy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We sequenced 1077 bp of the mitochondrial cytochrome b gene and 511 bp of the nuclear Apolipoprotein B gene in bicoloured shrew (Crocidura leucodon, Soricidae) populations ranging from France to Georgia. The aims of the study were to identify the main genetic clades within this species and the influence of Pleistocene climatic variations on the respective clades. The mitochondrial analyses revealed a European clade distributed from France eastwards to north-western Turkey and a Near East clade distributed from Georgia to Romania; the two clades separated during the Middle Pleistocene. We clearly identified a population expansion after a bottleneck for the European clade based on mitochondrial and nuclear sequencing data; this expansion was not observed for the eastern clade. We hypothesize that the western population was confined to a small Italo-Balkanic refugium, whereas the eastern population subsisted in several refugia along the southern coast of the Black Sea.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In otherwise successful gene therapy trials insertional mutagenesis has resulted in leukemia. The identification of new short synthetic genetic insulator elements (GIE) which would both prevent such activation effects and shield the transgene from silencing, is a main challenge. Previous attempts with e.g. b-globin HS4, have met with poor efficacy and genetic instability. We have investigated potential improvement with two new candidate synthetic GIEs in SIN-gamma and lentiviral vectors. With each constructs two internal promoters have been tested: either the strong Fr- MuLV-U3 or the housekeeping hPGK.We could identify a specific combination of insulator 2 repeats which translates into best functional activity, high titers and boundary effect in both gammaretro and lentivectors. In target cells a dramatic shift of expression is observed with an homogenous profile the level of which strictly depends on the promoter strength. These data remain stable in both HeLa cells over three months and cord blood HSCs for two months, irrespective of the multiplicity of infection (MOI). In comparison, control native and SIN vectors expression levels show heterogeneous, depend on the MOI and prove unstable. We have undertaken genotoxicity assessment in comparing integration patterns ingenuity in human target cells sampled over three months using high-throughput pyro-sequencing. Data will be presented. Further genotoxicity assessment will include in vivo studies. We have established insulated vectors which harbour both boundary and enhancer-blocking effect and show stable in prolonged in vitro culture conditions. Work performed with support of EC-DG research FP6-NoE, CLINIGENE: LSHB-CT-2006-018933

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The increase of publicly available sequencing data has allowed for rapid progress in our understanding of genome composition. As new information becomes available we should constantly be updating and reanalyzing existing and newly acquired data. In this report we focus on transposable elements (TEs) which make up a significant portion of nearly all sequenced genomes. Our ability to accurately identify and classify these sequences is critical to understanding their impact on host genomes. At the same time, as we demonstrate in this report, problems with existing classification schemes have led to significant misunderstandings of the evolution of both TE sequences and their host genomes. In a pioneering publication Finnegan (1989) proposed classifying all TE sequences into two classes based on transposition mechanisms and structural features: the retrotransposons (class I) and the DNA transposons (class II). We have retraced how ideas regarding TE classification and annotation in both prokaryotic and eukaryotic scientific communities have changed over time. This has led us to observe that: (1) a number of TEs have convergent structural features and/or transposition mechanisms that have led to misleading conclusions regarding their classification, (2) the evolution of TEs is similar to that of viruses by having several unrelated origins, (3) there might be at least 8 classes and 12 orders of TEs including 10 novel orders. In an effort to address these classification issues we propose: (1) the outline of a universal TE classification, (2) a set of methods and classification rules that could be used by all scientific communities involved in the study of TEs, and (3) a 5-year schedule for the establishment of an International Committee for Taxonomy of Transposable Elements (ICTTE).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single-nucleotide polymorphisms. As an empirical example, we use a double-digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high-altitude mountains in Mexico.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Among the largest resources for biological sequence data is the large amount of expressed sequence tags (ESTs) available in public and proprietary databases. ESTs provide information on transcripts but for technical reasons they often contain sequencing errors. Therefore, when analyzing EST sequences computationally, such errors must be taken into account. Earlier attempts to model error prone coding regions have shown good performance in detecting and predicting these while correcting sequencing errors using codon usage frequencies. In the research presented here, we improve the detection of translation start and stop sites by integrating a more complex mRNA model with codon usage bias based error correction into one hidden Markov model (HMM), thus generalizing this error correction approach to more complex HMMs. We show that our method maintains the performance in detecting coding sequences.