926 resultados para Genome Segmentation
Resumo:
Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/similar to uqjkeith/) or upon request to the author.
Resumo:
Background: There are many advantages to the application of complete mitochondrial (mt) genomes in the accurate reconstruction of phylogenetic relationships in Metazoa. Although over one thousand metazoan genomes have been sequenced, the taxonomic sampling is highly biased, left with many phyla without a single representative of complete mitochondrial genome. Sipuncula (peanut worms or star worms) is a small taxon of worm-like marine organisms with an uncertain phylogenetic position. In this report, we present the mitochondrial genome sequence of Phascolosoma esculenta, the first complete mitochondrial genome of the phylum. Results: The mitochondrial genome of P. esculenta is 15,494 bp in length. The coding strand consists of 32.1% A, 21.5% C, 13.0% G, and 33.4% T bases (AT = 65.5%; AT skew = -0.019; GC skew = -0.248). It contains thirteen protein-coding genes (PCGs) with 3,709 codons in total, twenty-two transfer RNA genes, two ribosomal RNA genes and a non-coding AT-rich region (AT = 74.2%). All of the 37 identified genes are transcribed from the same DNA strand. Compared with the typical set of metazoan mt genomes, sipunculid lacks trnR but has an additional trnM. Maximum Likelihood and Bayesian analyses of the protein sequences show that Myzostomida, Sipuncula and Annelida (including echiurans and pogonophorans) form a monophyletic group, which supports a closer relationship between Sipuncula and Annelida than with Mollusca, Brachiopoda, and some other lophotrochozoan groups. Conclusion: This is the first report of a complete mitochondrial genome as a representative within the phylum Sipuncula. It shares many more similar features with the four known annelid and one echiuran mtDNAs. Firstly, sipunculans and annelids share quite similar gene order in the mitochondrial genome, with all 37 genes located on the same strand; secondly, phylogenetic analyses based on the concatenated protein sequences also strongly support the sipunculan + annelid clade (including echiurans and pogonophorans). Hence annelid "key-characters" including segmentation may be more labile than previously assumed.
Resumo:
While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.
Resumo:
Motivation: Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number (Olshen {\it et~al}, 2004). The algorithm tests for change-points using a maximal $t$-statistic with a permutation reference distribution to obtain the corresponding $p$-value. The number of computations required for the maximal test statistic is $O(N^2),$ where $N$ is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster. algorithm. Results: We present a hybrid approach to obtain the $p$-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analysis of array CGH data from a breast cancer cell line to show the impact of the new approaches on the analysis of real data. Availability: An R (R Development Core Team, 2006) version of the CBS algorithm has been implemented in the ``DNAcopy'' package of the Bioconductor project (Gentleman {\it et~al}, 2004). The proposed hybrid method for the $p$-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.