978 resultados para COMPARATIVE GENOME MAPS


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Determination of copy number variants (CNVs) inferred in genome wide single nucleotide polymorphism arrays has shown increasing utility in genetic variant disease associations. Several CNV detection methods are available, but differences in CNV call thresholds and characteristics exist. We evaluated the relative performance of seven methods: circular binary segmentation, CNVFinder, cnvPartition, gain and loss of DNA, Nexus algorithms, PennCNV and QuantiSNP. Tested data included real and simulated Illumina HumHap 550 data from the Singapore cohort study of the risk factors for Myopia (SCORM) and simulated data from Affymetrix 6.0 and platform-independent distributions. The normalized singleton ratio (NSR) is proposed as a metric for parameter optimization before enacting full analysis. We used 10 SCORM samples for optimizing parameter settings for each method and then evaluated method performance at optimal parameters using 100 SCORM samples. The statistical power, false positive rates, and receiver operating characteristic (ROC) curve residuals were evaluated by simulation studies. Optimal parameters, as determined by NSR and ROC curve residuals, were consistent across datasets. QuantiSNP outperformed other methods based on ROC curve residuals over most datasets. Nexus Rank and SNPRank have low specificity and high power. Nexus Rank calls oversized CNVs. PennCNV detects one of the fewest numbers of CNVs.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ferns are one of the few remaining major clades of land plants for which a complete genome sequence is lacking. Knowledge of genome space in ferns will enable broad-scale comparative analyses of land plant genes and genomes, provide insights into genome evolution across green plants, and shed light on genetic and genomic features that characterize ferns, such as their high chromosome numbers and large genome sizes. As part of an initial exploration into fern genome space, we used a whole genome shotgun sequencing approach to obtain low-density coverage (∼0.4X to 2X) for six fern species from the Polypodiales (Ceratopteris, Pteridium, Polypodium, Cystopteris), Cyatheales (Plagiogyria), and Gleicheniales (Dipteris). We explore these data to characterize the proportion of the nuclear genome represented by repetitive sequences (including DNA transposons, retrotransposons, ribosomal DNA, and simple repeats) and protein-coding genes, and to extract chloroplast and mitochondrial genome sequences. Such initial sweeps of fern genomes can provide information useful for selecting a promising candidate fern species for whole genome sequencing. We also describe variation of genomic traits across our sample and highlight some differences and similarities in repeat structure between ferns and seed plants.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome sequencing and a robust assembly of a parrot genome. FINDINGS: We present a genomic resource for the budgerigar, an Australian Parakeet (Melopsittacus undulatus) -- the most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes over 300× raw read coverage from multiple sequencing technologies and chromosome optical maps from a single male animal. The reads and optical maps were used to create three hybrid assemblies representing some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome sequence assemblies. The sequence reads for this project were in part generated and used for both the Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio single-molecule sequencing. CONCLUSIONS: Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work provides valuable data and material for genome technology development and for investigating the genomics of complex behavioral traits.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.

We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.

We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.

Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.

This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Bacterioplankton of the SAR11 clade are the most abundant microorganisms in marine systems, usually representing 25% or more of the total bacterial cells in seawater worldwide. SAR11 is divided into subclades with distinct spatiotemporal distributions (ecotypes), some of which appear to be specific to deep water. Here we examine the genomic basis for deep ocean distribution of one SAR11 bathytype (depth-specific ecotype), subclade Ic. Four single-cell Ic genomes, with estimated completeness of 55%-86%, were isolated from 770 m at station ALOHA and compared with eight SAR11 surface genomes and metagenomic datasets. Subclade Ic genomes dominated metagenomic fragment recruitment below the euphotic zone. They had similar COG distributions, high local synteny and shared a large number (69%) of orthologous clusters with SAR11 surface genomes, yet were distinct at the 16S rRNA gene and amino-acid level, and formed a separate, monophyletic group in phylogenetic trees. Subclade Ic genomes were enriched in genes associated with membrane/cell wall/envelope biosynthesis and showed evidence of unique phage defenses. The majority of subclade Ic-specfic genes were hypothetical, and some were highly abundant in deep ocean metagenomic data, potentially masking mechanisms for niche differentiation. However, the evidence suggests these organisms have a similar metabolism to their surface counterparts, and that subclade Ic adaptations to the deep ocean do not involve large variations in gene content, but rather more subtle differences previously observed deep ocean genomic data, like preferential amino-acid substitutions, larger coding regions among SAR11 clade orthologs, larger intergenic regions and larger estimated average genome size.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Reliable population DNA molecular markers are difficult to develop for molluscs, the reasons for which are largely unknown. Identical protocols for microsatellite marker development were implemented in three gastropods. Success rates were lower for Gibbula cineraria compared to Littorina littorea and L. saxatilis. Comparative genomic analysis of 47.2?kb of microsatellite containing sequences (MCS) revealed a high incidence of cryptic repetitive DNA in their flanking regions. The majority of these were novel, and could be grouped into DNA families based upon sequence similarities. Significant inter-specific variation in abundance of cryptic repetitive DNA and DNA families was observed. Repbase scans show that a large proportion of cryptic repetitive DNA was identified as transposable elements (TEs). We argue that a large number of TEs and their transpositional activity may be linked to differential rates of DNA multiplication and recombination. This is likely to be an important factor explaining inter-specific variation in genome stability and hence microsatellite marker development success rates. Gastropods also differed significantly in the type of TEs classes (autonomous vs non-autonomous) observed. We propose that dissimilar transpositional mechanisms differentiate the TE classes in terms of their propensity for transposition, fixation and/or silencing. Consequently, the phylogenetic conservation of non-autonomous TEs, such as CvA, suggests that dispersal of these elements may have behaved as microsatellite-inducing elements. Results seem to indicate that, compared to autonomous, non-autonomous TEs maybe have a more active role in genome rearrangement processes. The implications of the findings for genomic rearrangement, stability and marker development are discussed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The nematodes Trichinella spiralis and Trichinella pseudospiralis are both intracellular parasites of skeletal muscle cells and induce profound alterations in the host cell resulting in a re-alignment of muscle-specific gene expression. While T. spiralis induces the production of a collagen capsule surrounding the host-parasite complex, T. pseudospiralis exists in a non-encapsulated form and is also characterised by suppression of the host inflammatory response in the muscle. These observed differences between the two species are thought to be due to variation in the proteins excreted or secreted (ES proteins) by the muscle larva. In this study, we use a global proteomics approach to compare the ES protein profiles from both species and to identify individual T. pseudospiralis proteins that complement earlier studies with T. spiralis. Following two-dimensional gel electrophoresis, tandem mass spectrometry was used to identify the peptide spots. In many cases identification was aided by the determination of partial peptide sequence from selected mass ions. The T. pseudospiralis spots identified included the major secreted glycoproteins and the secreted 5'-nucleotidase. Furthermore, two major groups of T. spiralis-specific proteins and several T. pseudospiralis-specific proteins were identified. Our results demonstrate the value of proteomics as a tool for the identification of ES proteins that are differentially expressed between Trichinella species and as an aid to identifying key parasite proteins that are involved in the host-parasite interaction. The value of this approach will be further enhanced by data arising out the current T. spiralis genome sequencing project.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The plant actin cytoskeleton is a highly dynamic, fibrous structure essential in many cellular processes including cell division and cytoplasmic streaming. This structure is stimulus responsive, being affected by internal stimuli, by biotic and abiotic stresses mediated in signal transduction pathways by actin-binding proteins. The completion of the Arabidopsis genome sequence has allowed a comparative identification of many actin-binding proteins. However, not all are conserved in plants, which possibly reflects the differences in the processes involved in morphogenesis between plant and other cells. Here we have searched for the Arabidopsis equivalents of 67 animal/fungal actin-binding proteins and show that 36 are not conserved in plants. One protein that is conserved across phylogeny is actin-depolymerizing factor or cofilin and we describe our work on the activity of vegetative tissue and pollen-specific isoforms of this protein in plant cells, concluding that they are functionally distinct.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: Members of the genus Cronobacter are causes of rare but severe illness in neonates and preterm infants following the ingestion of contaminated infant formula. Seven species have been described and two of the species genomes were subsequently published. In this study, we performed comparative genomics on eight strains of Cronobacter, including six that we sequenced (representing six of the seven species) and two previously published, closed genomes.

Results: We identified and characterized the features associated with the core and pan genome of the genus Cronobacter in an attempt to understand the evolution of these bacteria and the genetic content of each species. We identified 84 genomic regions that are present in two or more Cronobacter genomes, along with 45 unique genomic regions. Many potentially horizontally transferred genes, such as lysogenic prophages, were also identified. Most notable among these were several type six secretion system gene clusters, transposons that carried tellurium, copper and/or silver resistance genes, and a novel integrative conjugative element.

Conclusions: Cronobacter have diverged into two clusters, one consisting of C. dublinensis and C. muytjensii (Cdub-Cmuy) and the other comprised of C. sakazakii, C. malonaticus, C. universalis, and C. turicensis, (Csak-Cmal-Cuni-Ctur) from the most recent common ancestral species. While several genetic determinants for plant-association and human virulence could be found in the core genome of Cronobacter, the four Cdub-Cmuy clade genomes contained several accessory genomic regions important for survival in a plant-associated environmental niche, while the Csak-Cmal-Cuni-Ctur clade genomes harbored numerous virulence-related genetic traits.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Clade V nematodes comprise several parasitic species that include the cyathostomins, primary helminth pathogens of horses. Next generation transcriptome datasets are available for eight parasitic clade V nematodes, although no equine parasites are included in this group. Here, we report next generation transcriptome sequencing analysis for the common cyathostomin species, Cylicostephanus goldi. A cDNA library was generated from RNA extracted from 17 C. goldi male and female adult parasites. Following sequencing using a 454 GS FLX pyrosequencer, a total of 475,215 sequencing reads were generated, which were assembled into 26,910 contigs. Using Gene Ontology and Kyoto Encyclopedia of Genes and Genomes databases, 27% of the transcriptome was annotated. Further in-depth analysis was carried out by comparing the C. goldi dataset with the next generation transcriptomes and genomes of other clade V nematodes, with the Oesophagostomum dentatum transcriptome and the Haemonchus contortus genome showing the highest levels of sequence identity with the cyathostomin dataset (45%). The C. goldi transcriptome was mined for genes associated with anthelmintic mode of action and/or resistance. Sequences encoding proteins previously associated with the three major anthelmintic classes used in horses were identified, with the exception of the P-glycoprotein group. Targeted resequencing of the glutamate gated chloride channel α4 subunit (glc-3), one of the primary targets of the macrocyclic lactone anthelmintics, was performed for several cyathostomin species. We believe this study reports the first transcriptome dataset for an equine helminth parasite, providing the opportunity for in-depth analysis of these important parasites at the molecular level. Sequences encoding enzymes involved in key processes and genes associated with levamisole/pyrantel and macrocyclic lactone resistance, in particular the glutamate gated chloride channels, were identified. This novel data will inform cyathostomin biology and anthelmintic resistance studies in future.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: Klebsiella pneumoniae strains are pathogenic to animals and humans, in which they are both a frequent cause of nosocomial infections and a re-emerging cause of severe community-acquired infections. K. pneumoniae isolates of the capsular serotype K2 are among the most virulent. In order to identify novel putative virulence factors that may account for the severity of K2 infections, the genome sequence of the K2 reference strain Kp52.145 was determined and compared to two K1 and K2 strains of low virulence and to the reference strains MGH 78578 and NTUH-K2044.

RESULTS: In addition to diverse functions related to host colonization and virulence encoded in genomic regions common to the four strains, four genomic islands specific for Kp52.145 were identified. These regions encoded genes for the synthesis of colibactin toxin, a putative cytotoxin outer membrane protein, secretion systems, nucleases and eukaryotic-like proteins. In addition, an insertion within a type VI secretion system locus included sel1 domain containing proteins and a phospholipase D family protein (PLD1). The pld1 mutant was avirulent in a pneumonia model in mouse. The pld1 mRNA was expressed in vivo and the pld1 gene was associated with K. pneumoniae isolates from severe infections. Analysis of lipid composition of a defective E. coli strain complemented with pld1 suggests an involvement of PLD1 in cardiolipin metabolism.

CONCLUSIONS: Determination of the complete genome of the K2 reference strain identified several genomic islands comprising putative elements of pathogenicity. The role of PLD1 in pathogenesis was demonstrated for the first time and suggests that lipid metabolism is a novel virulence mechanism of K. pneumoniae.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The European sea bass, Dicentrarchus labrax, is one of the most important marine species cultivated in Southern Europe and has not benefited from selective breeding. One of the major goals in the sea bass (D. labrax) aquaculture industry is to understand and control the complexity of growth associated traits. The aim of the methodology developed for the studies reported in the thesis was not only to establish genetic and genomic resources for sea bass, but to also develop a conceptual strategy to efficiently create knowledge in a research environment that can easily be transferred to the aquaculture industry. The strategy involved; i) establishing an annotated sea bass transcriptome and then using it to, ii) identify new genetic markers for target QTL regions so that, iii) new QTL analysis could be performed and marker based resolution of the DNA regions of interest increased, and then iv) to merge the linkage map and the physical map in order to map the QTL confidence intervals to the sea bass genome and identify genes underlying the targeted traits. Finally to test if genes in the QTL regions that are candidates for divergent growth phenotypes have modified patterns of transcription that reflects the modified whole organism physiology SuperSAGE-SOLiD4 gene expression was used with sea bass with high growth heterogeneity. The SuperSAGE contributed to significantly increase the transcriptome information for sea bass muscle, brain and liver and also led to the identification of putative candidate genes lying in the genomic region of growth related QTL. Lastly all differentially expressed transcripts in brain, liver and muscle of the European sea bass with divergent specific growth rates were mapped to gene pathways and networks and the regulatory pathways most affected identified and established the tissue specific changes underlying the divergent SGR. Owing to the importance of European sea bass to Mediterranean aquaculture and the developed genomics resources from the present thesis and from other studies it should be possible to implement genetic selection programs using marker assisted selection.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Sequence repeats are an important phenomenon in the human genome, playing important roles in genomic alteration often with phenotypic consequences. The two major types of repeat elements in the human genome are tandem repeats (TRs) including microsatellites, minisatellites, and satellites and transposable elements (TEs). So far, very little has been known about the relationship between these two types of repeats. In this study, we identified TRs that are derived from TEs either based on sequence similarity or overlapping genomic positions. We then analyzed the distribution of these TRs among TE families/subfamilies. Our study shows that at least 7,276 TRs or 23% of all minisatellites/satellites is derived from TEs, contributing ∼0.32% of the human genome. TRs seem to be generated more likely from younger/more active TEs, and once initiated they are expanded with time via local duplication of the repeat units. The currently postulated mechanisms for origin of TRs can explain only 6% of all TE-derived TRs, indicating the presence of one or more yet to be identified mechanisms for the initiation of such repeats. Our result suggests that TEs are contributing to genome expansion and alteration not only by transposition but also by generating tandem repeats.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Genome sequence varies in numerous ways among individuals although the gross architecture is fixed for all humans. Retrotransposons create one of the most abundant structural variants in the human genome and are divided in many families, with certain members in some families, e.g., L1, Alu, SVA, and HERV-K, remaining active for transposition. Along with other types of genomic variants, retrotransponson-derived variants contribute to the whole spectrum of genome variants in humans. With the advancement of sequencing techniques, many human genomes are being sequenced at the individual level, fueling the comparative research on these variants among individuals. In this thesis, the evolution and functional impact of structural variations is examined primarily focusing on retrotransposons in the context of human evolution. The thesis comprises of three different studies on the topics that are presented in three data chapters. First, the recent evolution of all human specific AluYb members, representing the second most active subfamily of Alus, was tracked to identify their source/master copy using a novel approach. All human-specific AluYb elements from the reference genome were extracted, aligned with one another to construct clusters of similar copies and each cluster was analyzed to generate the evolutionary relationship between the members of the cluster. The approach resulted in identification of one major driver copy of all human specific Yb8 and the source copy of the Yb9 lineage. Three new subfamilies within the AluYb family – Yb8a1, Yb10 and Yb11 were also identified, with Yb11 being the youngest and most polymorphic. Second, an attempt to construct a relation between transposable elements (TEs) and tandem repeats (TRs) was made at a genome-wide scale for the first time. Upon sequence comparison, positional cross-checking and other relevant analyses, it was observed that over 20% of all TRs are derived from TEs. This result established the first connection between these two types of repetitive elements, and extends our appreciation for the impact of TEs on genomes. Furthermore, only 6% of these TE-derived TRs follow the already postulated initiation and expansion mechanisms, suggesting that the others are likely to follow a yet-unidentified mechanism. Third, by taking a combination of multiple computational approaches involving all types of genetic variations published so far including transposable elements, the first whole genome sequence of the most recent common ancestor of all modern human populations that diverged into different populations around 125,000-100,000 years ago was constructed. The study shows that the current reference genome sequence is 8.89 million base pairs larger than our common ancestor’s genome, contributed by a whole spectrum of genetic mechanisms. The use of this ancestral reference genome to facilitate the analysis of personal genomes was demonstrated using an example genome and more insightful recent evolutionary analyses involving the Neanderthal genome. The three data chapters presented in this thesis conclude that the tandem repeats and transposable elements are not two entirely distinctly isolated elements as over 20% TRs are actually derived from TEs. Certain subfamilies of TEs themselves are still evolving with the generation of newer subfamilies. The evolutionary analyses of all TEs along with other genomic variants helped to construct the genome sequence of the most recent common ancestor to all modern human populations which provides a better alternative to human reference genome and can be a useful resource for the study of personal genomics, population genetics, human and primate evolution.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The complete genome of an Erwinia amylovora bacteriophage, vB_EamM_Ea35-70 (Ea35-70), is 271,084 bp, encodes 318 putative proteins, and contains one tRNA. Comparative analysis with other Myoviridae genomes suggests that Ea35-70 is related to the Phikzlikevirus genus within the family Myoviridae, since 26% of Ea35-70 proteins share homology to proteins in Pseudomonas phage φKZ.