To understand the molecular basis of gene targeting, we have studied interactions of nucleoprotein filaments comprised of single-stranded DNA and RecA protein with chromatin templates reconstituted from linear duplex DNA and histones. We observed that for the chromatin templates with histone/DNA mass ratios of 0.8 and 1.6, the efficiency of homologous pairing was indistinguishable from that of naked duplex DNA but strand exchange was repressed. In contrast, the chromatin templates with a histone/DNA mass ratio of 9.0 supported neither homologous pairing nor strand exchange. The addition of histone H1, in stoichiometric amounts, to chromatin templates quells homologous pairing. The pairing of chromatin templates with nucleoprotein filaments of RecA protein-single-stranded DNA proceeded without the production of detectable networks of DNA, suggesting that coaggregates are unlikely to be the intermediates in homologous pairing. The application of these observations to strategies for gene targeting and their implications for models of genetic recombination are discussed.


The compaction level of arrays of nucleosomes may be understood in terms of the balance between the self-repulsion of DNA (principally linker DNA) and countering factors including the ionic strength and composition of the medium, the highly basic N termini of the core histones, and linker histones. However, the structural principles that come into play during the transition from a loose chain of nucleosomes to a compact 30-nm chromatin fiber have been difficult to establish, and the arrangement of nucleosomes and linker DNA in condensed chromatin fibers has never been fully resolved. Based on images of the solution conformation of native chromatin and fully defined chromatin arrays obtained by electron cryomicroscopy, we report a linker histone-dependent architectural motif beyond the level of the nucleosome core particle that takes the form of a stem-like organization of the entering and exiting linker DNA segments. DNA completes ≈1.7 turns on the histone octamer in the presence and absence of linker histone. When linker histone is present, the two linker DNA segments become juxtaposed ≈8 nm from the nucleosome center and remain apposed for 3–5 nm before diverging. We propose that this stem motif directs the arrangement of nucleosomes and linker DNA within the chromatin fiber, establishing a unique three-dimensional zigzag folding pattern that is conserved during compaction. Such an arrangement with peripherally arranged nucleosomes and internal linker DNA segments is fully consistent with observations in intact nuclei and also allows dramatic changes in compaction level to occur without a concomitant change in topology.


Archaea contain histones that have primary sequences in common with eukaryal nucleosome core histones and a three-dimensional structure that is essentially only the histone fold. Here we report the results of experiments that document that archaeal histones compact DNA in vivo into structures similar to the structure formed by the histone (H3+H4)2 tetramer at the center of the eukaryal nucleosome. After formaldehyde cross-linking in vivo, these archaeal nucleosomes have been isolated from Methanobacterium thermoautotrophicum and Methanothermus fervidus, visualized by electron microscopy on plasmid and genomic DNAs, and shown by immunogold labeling, SDS/PAGE, and immunoblotting to contain archaeal histones, cross-linked into tetramers. Archaeal nucleosomes protect ≈60 bp of DNA and multiples of ≈60 bp from micrococcal nuclease digestion, and immunoprecipitation has demonstrated that most, but not all, M. fervidus genomic DNA sequences are associated in vivo with archaeal histones.


Nucleosomes, the basic structural elements of chromosomes, consist of 146 bp of DNA coiled around an octamer of histone proteins, and their presence can strongly influence gene expression. Considerations of the anisotropic flexibility of nucleotide triplets containing 3 cytosines or guanines suggested that a [5'(G/C)3 NN3']n motif might resist wrapping around a histone octamer. To test this, DNAs were constructed containing a 5'-CCGNN-3' pentanucleotide repeat with the Ns varied. Using in vitro nucleosome reconstitution and electron microscopy, a plasmid with 48 contiguous CCGNN repeats strongly excluded nucleosomes in the repeat region. Competitive reconstitution gel retardation experiments using DNA fragments containing 12, 24, or 48 CCGNN repeats showed that the propensity to exclude nucleosomes increased with the length of the repeat. Analysis showed that a 268-bp DNA containing a (CCGNN)48 block is 4.9 +/- 0.6-fold less efficient in nucleosome assembly than a similar length pUC19 fragment and approximately 78-fold less efficient than a similar length (CTG)n sequence, based on results from previous studies. Computer searches against the GenBank database for matches with a [(G/C)3NN]48 sequence revealed numerous examples that frequently were present in the control regions of "TATA-less" genes, including the human ETS-2 and human dihydrofolate reductase genes. In both cases the (G/C)3NN repeat, present in the promoter region, co-maps with loci previously shown to be nuclease hypersensitive sites.


Inheritance of each chromosome depends upon its centromere. A histone H3 variant, centromere protein A (CENP-A), is essential for epigenetically marking centromere location. We find that CENP-A is quantitatively retained at the centromere upon which it is initially assembled. CENP-C binds to CENP-A nucleosomes and is a prime candidate to stabilize centromeric chromatin. Using purified components, we find that CENP-C reshapes the octameric histone core of CENP-A nucleosomes, rigidifies both surface and internal nucleosome structure, and modulates terminal DNA to match the loose wrap that is found on native CENP-A nucleosomes at functional human centromeres. Thus, CENP-C affects nucleosome shape and dynamics in a manner analogous to allosteric regulation of enzymes. CENP-C depletion leads to rapid removal of CENP-A from centromeres, indicating their collaboration in maintaining centromere identity.


The field of epigenetics looks at changes in the chromosomal structure that affect gene expression without altering DNA sequence. A large-scale modelling project to better understand these mechanisms is gaining momentum. Early advances in genetics led to the all-genetic paradigm: phenotype (an organism's characteristics/behaviour) is determined by genotype (its genetic make-up). This was later amended and expressed by the well-known formula P = G + E, encompassing the notion that the visible characteristics of a living organism (the phenotype, P) is a combination of hereditary genetic factors (the genotype, G) and environmental factors (E). However, this method fails to explain why in diseases such as schizophrenia we still observe differences between identical twins. Furthermore, the identification of environmental factors (such as smoking and air quality for lung cancer) is relatively rare. The formula also fails to explain cell differentiation from a single fertilized cell. In the wake of early work by Waddington, more recent results have emphasized that the expression of the genotype can be altered without any change in the DNA sequence. This phenomenon has been tagged as epigenetics. To form the chromosome, DNA strands roll over nucleosomes, which are a cluster of nine proteins (histones), as detailed in Figure 1. Epigenetic mechanisms involve inherited alterations in these two structures, eg through attachment of a functional group to the amino acids (methyl, acetyl and phosphate). These 'stable alterations' arise during development and cell proliferation and persist through cell division. While information within the genetic material is not changed, instructions for its assembly and interpretation may be. Modelling this new paradigm, P = G + E + EpiG, is the object of our study.


The current explosion of DNA sequence information has generated increasing evidence for the claim that noncoding repetitive DNA sequences present within and around different genes could play an important role in genetic control processes, although the precise role and mechanism by which these sequences function are poorly understood. Several of the simple repetitive sequences which occur in a large number of loci throughout the human and other eukaryotic genomes satisfy the sequence criteria for forming non-B DNA structures in vitro. We have summarized some of the features of three different types of simple repeats that highlight the importance of repetitive DNA in the control of gene expression and chromatin organization. (i) (TG/CA)n repeats are widespread and conserved in many loci. These sequences are associated with nucleosomes of varying linker length and may play a role in chromatin organization. These Z-potential sequences can help absorb superhelical stress during transcription and aid in recombination. (ii) Human telomeric repeat (TTAGGG)n adopts a novel quadruplex structure and exhibits unusual chromatin organization. This unusual structural motif could explain chromosome pairing and stability. (iii) Intragenic amplification of (CTG)n/(CAG)n trinucleotide repeat, which is now known to be associated with several genetic disorders, could down-regulate gene expression in vivo. The overall implications of these findings vis-à-vis repetitive sequences in the genome are summarized.


The incorporation of DNA into nucleosomes and higher-order forms of chromatin in vivo creates difficulties with respect to its accessibility for cellular functions such as transcription, replication, repair and recombination. To understand the role of chromatin structure in the process of homologous recombination, we have studied the interaction of nucleoprotein filaments, comprised of RecA protein and ssDNA, with minichromosomes. Using this paradigm, we have addressed how chromatin structure affects the search for homologous DNA sequences, and attempted to distinguish between two mutually exclusive models of DNA-DNA pairing mechanisms. Paradoxically, we found that the search for homologous sequences, as monitored by unwinding of homologous or heterologous duplex DNA, was facilitated by nucleosomes, with no discernible effect on homologous pairing. More importantly, unwinding of minichromosomes required the interaction of nucleoprotein filaments and led to the accumulation of circular duplex DNA sensitive to nuclease P1. Competition experiments indicated that chromatin templates and naked DNA served as equally efficient targets for homologous pairing. These and other findings suggest that nucleosomes do not impede but rather facilitate the search for homologous sequences and establish, in accordance with one proposed model, that unwinding of duplex DNA precedes alignment of homologous sequences at the level of chromatin. The potential application of this model to investigate the role of chromosomal proteins in the alignment of homologous sequences in the context of cellular recombination is considered.


Background: A nucleosome is the fundamental repeating unit of the eukaryotic chromosome. It has been shown that the positioning of a majority of nucleosomes is primarily controlled by factors other than the intrinsic preference of the DNA sequence. One of the key questions in this context is the role, if any, that can be played by the variability of nucleosomal DNA structure. Results: In this study, we have addressed this question by analysing the variability at the dinucleotide and trinucleotide as well as longer length scales in a dataset of nucleosome X-ray crystal structures. We observe that the nucleosome structure displays remarkable local level structural versatility within the B-DNA family. The nucleosomal DNA also incorporates a large number of kinks. Conclusions: Based on our results, we propose that the local and global level versatility of B-DNA structure may be a significant factor modulating the formation of nucleosomes in the vicinity of high-plasticity genes, and in varying the probability of binding by regulatory proteins. Hence, these factors should be incorporated in the prediction algorithms and there may not be a unique `template' for predicting putative nucleosome sequences. In addition, the multimodal distribution of dinucleotide parameters for some steps and the presence of a large number of kinks in the nucleosomal DNA structure indicate that the linear elastic model, used by several algorithms to predict the energetic cost of nucleosome formation, may lead to incorrect results.


Histone variants and their modification have significant roles in many cellular processes. In this study, we identified and characterized the histone H2A variant h2af1o in fish and revealed its oocyte-specific expression pattern during oogenesis and embryogenesis. Moreover, posttranslational modification of H2af1o was observed that results from phosphorylation during oocyte maturation. To understand the binding dynamics of the novel core histone variant H2af1o in nucleosomes, we cloned ubiquitous gibel carp h2afx as a conventional histone control and investigated the dynamic exchange difference in chromatin by fluorescence recovery after photobleaching. H2af1o has significantly higher mobility in nucleosomes than ubiquitous H2afx. Compared with ubiquitous H2afx, H2af1o has a tightly binding C-terminal and a weakly binding N-terminal. These data indicate that fish oocytes have a novel H2A variant that destabilizes nucleosomes by protruding its N-terminal tail and stabilizes core particles by contracting its C-terminal tail. Our findings suggest that H2af1o may have intrinsic ability to modify chromatin properties during fish oogenesis, oocyte maturation, and early cleavage.


Recent work has identified a novel RSC-nucleosome complex that both strongly phases flanking nucleosomes and presents regulatory sites for ready access. These results challenge several widely held views.


Cellular stresses activate the tumor suppressor p53 protein leading to selective binding to DNA response elements (REs) and gene transactivation from a large pool of potential p53 REs (p53REs). To elucidate how p53RE sequences and local chromatin context interact to affect p53 binding and gene transactivation, we mapped genome-wide binding localizations of p53 and H3K4me3 in untreated and doxorubicin (DXR)-treated human lymphoblastoid cells. We examined the relationships among p53 occupancy, gene expression, H3K4me3, chromatin accessibility (DNase 1 hypersensitivity, DHS), ENCODE chromatin states, p53RE sequence, and evolutionary conservation. We observed that the inducible expression of p53-regulated genes was associated with the steady-state chromatin status of the cell. Most highly inducible p53-regulated genes were suppressed at baseline and marked by repressive histone modifications or displayed CTCF binding. Comparison of p53RE sequences residing in different chromatin contexts demonstrated that weaker p53REs resided in open promoters, while stronger p53REs were located within enhancers and repressed chromatin. p53 occupancy was strongly correlated with similarity of the target DNA sequences to the p53RE consensus, but surprisingly, inversely correlated with pre-existing nucleosome accessibility (DHS) and evolutionary conservation at the p53RE. Occupancy by p53 of REs that overlapped transposable element (TE) repeats was significantly higher (p<10-7) and correlated with stronger p53RE sequences (p<10-110) relative to nonTE-associated p53REs, particularly for MLT1H, LTR10B, and Mer61 TEs. However, binding at these elements was generally not associated with transactivation of adjacent genes. Occupied p53REs located in L2-like TEs were unique in displaying highly negative PhyloP scores (predicted fast-evolving) and being associated with altered H3K4me3 and DHS levels. These results underscore the systematic interaction between chromatin status and p53RE context in the induced transactivation response. This p53 regulated response appears to have been tuned via evolutionary processes that may have led to repression and/or utilization of p53REs originating from primate-specific transposon elements.


Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.

We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.

We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.

Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.

This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.