保守序列是一种跨物种保守的基因组序列,而且绝大多数为非蛋白编码序 列。保守序列在人类遗传疾病中发挥着重要作用。其中,一部分保守序列能够 折叠形成二级结构。已鉴定的一些保守二级结构编码一些RNA 分子,如 microRNA、RNA 编辑序列和组蛋白mRNA 3’端非翻译区茎环结构等。但是,对 于绝大部分的保守二级结构,它们的生物学功能以及作用于它们上面的进化作 用力依然是未知的。 群体的SNP 数据在分析序列上的进化作用力时非常有效。SNP 在群体中的 频率会因为受到不同的进化作用力而表现出差异,而与其是否位于基因组中的 突变热点无关。对于受纯净化选择作用的SNP,它们的频率一般会比中性SNP 具有低的新生型等位基因频率(DAF)。我们运用生物信息学的方法,在人类基 因组保守二级结构中找到746 个SNP。这746 个SNP 与基因组其它区段的SNP 在突变模式上并不存在显著差异,在保守二级结构内同样存在突变热点。通过 与侧翼序列SNP 的分布比较发现,保守二级结构上SNP 密度约为其侧翼序列的 2/3。相比于侧翼序列SNP,有更高比例的保守二级结构SNP 具有低的DAF 值。 这些结果提示,有很多保守二级结构上的SNP 因为受到纯净化选择作用而在现 代人群中被剔除了。保守二级结构与侧翼序列在SNP 密度和DAF 上的差异要高 于保守序列与非保守序列之间的差异,提示保守二级结构是受到纯净化选择作 用最为严格的一类保守序列。我们发现,在保守二级结构内部,纯净化选择作 用的强度也有差异。茎区比环区具有更低的SNP 密度,而且有更高比例的茎区 SNP 具有低的DAF 值。这个结果提示,保守二级结构上的纯净化选择力主要作 用于茎区上的位点。我们推测,这可能是茎区上的突变往往比环区的突变对二级结构的造成更大的影响导致的。 我们通过寻找保守二级结构与转录因子SOX2、OCT4、NANOG、SUZ12 和C-MYC 结合位点之间的重叠,还分析了保守二级结构在转录调控网络中的作用。结果 显示,很多保守二级结构是作为转录因子的结合位点调控了许多与发育相关的 转录因子编码基因的表达。转录因子与保守二级结构之间的结合模式非常复杂, 可以有多个转录因子结合到同一个保守二级结构上,也可以是一个转录因子结 合到自身编码基因相关的保守二级结构上。不同的转录因子和保守二级结构结 合可以主导靶基因的特异模式,当绝大多数相关的保守二级结构与SUZ12 结合 时,基因表达受到抑制,而当绝大多数相关的保守二级结构不与SUZ12 结合时, 基因表达受到激活。在转录调控网络中,约有30%的保守二级结构是作为启动 子来调控基因的表达。因为转录因子SOX2、OCT4、NANOG、SUZ12 和C-MYC 仅仅 只结合到很小一部分保守二级结构上,提示可能还有更多的转录因子会结合到 保守二级结构上。因此,保守二级结构介导的转录调控网络要比目前已知的复 杂得多。


Phytoene desaturase is one of the most important enzymes necessary for the biosynthesis of carotenoids in some cyanobacteria, green algae and plants. In this study, genomic DNA and cDNA of pds were cloned from unicellular green alga Haematococcus pluvialis strain323 using PCR and RT-PCR methods. The cDNA was cloned into plasmid pET-28a and efficiently expressed in Escherichia coli BL21. The complete genomic PDS gene of H. pluvialis, 3.3 kb in size, included eight exons and seven introns. To locate transcriptional regulation elements, an approximate 1 kb of 5'-flanking region was isolated by genome-walking method. Results of bioinformatic analysis showed several putative cis-elements e.g. the ABRE motif (abscisic acid responsive element), the C-repeat/DRE (dehydration responsive element) motif and the GCN4 motif were located in 5'-flanking region of pds. Results of phylogenetic analyses reveal that different sources of PDS genes form a separate clade, respectively, with 100% bootstrap support. Moreover, a maximum likelihood approach was employed to detect evidence of positive selection in the evolution of PDS genes. Results of branch-site model analysis suggest that 7.9% of sites along the green algal branch are under positive selection, and the PDS gene in green algae exhibits a different evolutionary pattern from its counterparts in cyanobacteria and plants.


BACKGROUND:Recent advances in genome sequencing suggest a remarkable conservation in gene content of mammalian organisms. The similarity in gene repertoire present in different organisms has increased interest in studying regulatory mechanisms of gene expression aimed at elucidating the differences in phenotypes. In particular, a proximal promoter region contains a large number of regulatory elements that control the expression of its downstream gene. Although many studies have focused on identification of these elements, a broader picture on the complexity of transcriptional regulation of different biological processes has not been addressed in mammals. The regulatory complexity may strongly correlate with gene function, as different evolutionary forces must act on the regulatory systems under different biological conditions. We investigate this hypothesis by comparing the conservation of promoters upstream of genes classified in different functional categories.RESULTS:By conducting a rank correlation analysis between functional annotation and upstream sequence alignment scores obtained by human-mouse and human-dog comparison, we found a significantly greater conservation of the upstream sequence of genes involved in development, cell communication, neural functions and signaling processes than those involved in more basic processes shared with unicellular organisms such as metabolism and ribosomal function. This observation persists after controlling for G+C content. Considering conservation as a functional signature, we hypothesize a higher density of cis-regulatory elements upstream of genes participating in complex and adaptive processes.CONCLUSION:We identified a class of functions that are associated with either high or low promoter conservation in mammals. We detected a significant tendency that points to complex and adaptive processes were associated with higher promoter conservation, despite the fact that they have emerged relatively recently during evolution. We described and contrasted several hypotheses that provide a deeper insight into how transcriptional complexity might have been emerged during evolution.


The gastrointestinal tract (GIT) is a diverse ecosystem, and is colonised by a diverse array of bacteria, of which bifidobacteria are a significant component. Bifidobacteria are Gram-positive, saccharolytic, non-motile, non-sporulating, anaerobic, Y-shaped bacteria, which possess a high GC genome content. Certain bifidobacteria possess the ability to produce conjugated linoleic acid (CLA) from linoleic acid (LA) by a biochemical pathway that is hypothesised to be achieved via a linoleic isomerase. In Chapter two of this thesis it was found that the MCRA-specifying gene is not involved in CLA production in B. breve NCFB 2258, and that this gene specifies an oleate hydratase involved in the conversion of oleic acid into 10-hydroxystearic acid. Prebiotics are defined as non-digestible food ingredients that beneficially affect the host by selectively stimulating growth and/or activity of one or a limited number of bacteria in the colon. Key to the development of such novel prebiotics is to understand which carbohydrates support growth of bifidobacteria and how such carbohydrates are metabolised. In Chapter 3 of this thesis we describe the identification and characterisation of two neighbouring gene clusters involved in the metabolism of raffinose-containing carbohydrates (plus related carbohydrate melibiose) and melezitose by Bifidobacterium breve UCC2003. The fourth chapter of this thesis describes the analysis of transcriptional regulation of the raf and mel clusters. In the final experimental chapter two putative rep genes, designated repA7017 and repB7017, are identified on the megaplasmid pBb7017 of B. breve JCM 7017, the first bifidobacterial megaplasmid to be reported. One of these, repA7017, was subjected to an in-depth characterisation. The work described in this thesis has resulted in an improved understanding of bifidobacterial fatty acid and carbohydrate metabolism, Furthermore, attempts were made to develop novel genetic tools.


Cytokine-driven signalling shapes immune homeostasis and guides inflammatory responses mainly through induction of specific gene expression programmes both within and outside the immune cell compartment. These transcriptional outputs are often amplified via cytokine synergy, which sets a stimulatory threshold that safeguards from exacerbated inflammation and immunopathology. In this study, we investigated the molecular mechanisms underpinning synergy between two pivotal Th1 cytokines, IFN-γ and TNF-α, in human intestinal epithelial cells. These two proinflammatory mediators induce a unique state of signalling and transcriptional synergy implicated in processes such as antiviral and antitumour immunity, intestinal barrier and pancreatic β-cell dysfunction. Since its discovery more than 30 years ago, this biological phenomenon remains, however, only partially defined. Here, using a functional genomics approach including RNAi perturbation screens and small-molecule inhibitors, we identified two new regulators of IFN-γ/TNF-α-induced chemokine and antiviral gene and protein expression, a Bcl-2 protein BCL-G and a histone demethylase UTX. We also discovered that IFN-γ/TNF-α synergise to trigger a coordinated shutdown of major receptor tyrosine kinases expression in colon cancer cells. Together, these findings extend our current understanding of how IFN-γ/TNF-α synergy elicits qualitatively and quantitatively distinct outputs in the intestinal epithelium. Given the well-documented role of this synergistic state in immunopathology of various disorders, our results may help to inform the identification of high quality and biologically relevant druggable targets for diseases characterised by an IFN-γ/TNF-α high immune signature


Bifidobacteria are Gram positive, anaerobic, typically Y-shaped bacteria which are naturally found in the digestive tract of certain mammals, birds and insects. Bifidobacterium breve strains are numerically prevalent among the gut microbiota of many healthy breast-fed infants. The prototypical B. breve strain UCC2003 has previously been shown to utilise numerous carbohydrates of plant origin. Various aspects of host-derived carbohydrate metabolism occurring in this bacterium will be described in this thesis. Chapter II describes B. breve UCC2003 utilisation of sialic acid, a nine-carbon monosaccharide, which is found in human milk oligosaccharides (HMOs) and the mucin glycoprotein. B. breve UCC2003 was also shown to cross-feed on sialic acid released from 3’ sialyllactose, a prominent HMO, by the extracellular sialidase activity of Bifidobacterium bifidum PRL2010. Chapter III reports on the transcriptional regulation of sialic acid metabolism in B. breve UCC2003 by a transcriptional repressor encoded by the nanR gene. NanR belongs to the GntR-family of transcriptional regulators and represents the first bifidobacterial member of this family to be characterised. Chapter IV investigates B. breve UCC2003 utilisation of mucin. B. breve UCC2003 was shown to be incapable of degrading mucin; however when grown in co-culture with B. bifidum PRL2010 it exhibits enhanced growth and survival properties. A number of methods were used to investigate and identify the mucin components supporting this enhanced growth/viability phenotype. Chapter V describes the characterisation of two sulfatase-encoding gene clusters from B. breve UCC2003. The transcriptional regulation of both sulfatase-encoding gene clusters was also investigated. The work presented in this thesis represents new information on the metabolism of host-derived carbohydrates in bifidobacteria, thus increasing our understanding of how these gut commensals are able to colonise and persist in the gastrointestinal tract.


Animals must coordinate development with fluctuating nutrient availability. Nutrient availability governs post-embryonic development in Caenorhabditis elegans: larvae that hatch in the absence of food do not initiate post-embryonic development but enter "L1 arrest" (or "L1 diapause") and can survive starvation for weeks, while rapidly resume normal development once get fed. Insulin-like signaling (IIS) has been shown to be a key regulator of L1 arrest and recovery. However, the C. elegans genome encodes 40 insulin-like peptides (ILPs), and it is unknown which peptides participate in nutritional control of L1 arrest and recovery. Work in other contexts has identified putative receptor agonists and antagonists, but the extent of specificity versus redundancy is unclear beyond this distinction.

We measured mRNA expression dynamics with high temporal resolution for all 40 insulin-like genes during entry into and recovery from L1 arrest. Nutrient availability influences expression of the majority of insulin-like genes, with variable dynamics suggesting complex regulation. We identified 13 candidate agonists and 8 candidate antagonists based on expression in response to nutrient availability. We selected ten candidate agonists (daf-28, ins-3, ins-4, ins-5, ins-6, ins-7, ins-9, ins-26, ins-33 and ins-35) for further characterization in L1 stage larvae. We used destabilized reporter genes to determine spatial expression patterns. Expression of candidate agonists was largely overlapping in L1 stage larvae, suggesting a role of the intestine, chemosensory neurons ASI and ASJ, and the interneuron PVT in systemic control of L1 development. Transcriptional regulation of candidate agonists was most significant in the intestine, as if nutrient uptake was a more important influence on transcription than sensory perception. Scanning in the 5' upstream promoter region of these 40 ILPs, We found that transcription factor PQM-1 and GATA putative binding sites are depleted in the promoter region of antagonists. A novel motif was also found to be over-represented in ILPs.

Phenotypic analysis of single and compound deletion mutants did not reveal effects on L1 recovery/developmental dynamics, though simultaneous disruption of ins-4 and daf-28 extended survival of L1 arrest without enhancing thermal tolerance, while overexpression of ins-4, ins-6 or daf-28 shortened L1 survival. Simultaneous disruption of several ILPs showed a temperature independent, transient dauer phenotype. These results revealed the relative redundancy and specificity among agonistic ILPs.

TGF- β and steroid hormone (SH) signaling have been reported to control the dauer formation along with IIS. Our preliminary results suggest they may also mediate the IIS control of L1 arrest and recovery, as the expression of several key components of TGF-β and SH signaling pathway genes are negatively regulated by DAF-16, and loss-of-function of these genes partially represses daf-16 null phenotype in L1 arrest, and causes a retardation in L1 development.

In summary, my dissertation study focused on the IIS, characterized the dynamics and sites of ILPs expression in response to nutrient availability, revealed the function of specific agonistic ILPs in L1 arrest, and suggested potential cross-regulation among IIS, TGF-β signaling and SH signaling in controlling L1 arrest and recovery. These findings provide insights into how post-embryonic development is governed by insulin-like signaling and nutrient availability.


Nutrient availability profoundly influences gene expression. Many animal genes encode multiple transcript isoforms, yet the effect of nutrient availability on transcript isoform expression has not been studied in genome-wide fashion. When Caenorhabditis elegans larvae hatch without food, they arrest development in the first larval stage (L1 arrest). Starved larvae can survive L1 arrest for weeks, but growth and post-embryonic development are rapidly initiated in response to feeding. We used RNA-seq to characterize the transcriptome during L1 arrest and over time after feeding. Twenty-seven percent of detectable protein-coding genes were differentially expressed during recovery from L1 arrest, with the majority of changes initiating within the first hour, demonstrating widespread, acute effects of nutrient availability on gene expression. We used two independent approaches to track expression of individual exons and mRNA isoforms, and we connected changes in expression to functional consequences by mining a variety of databases. These two approaches identified an overlapping set of genes with alternative isoform expression, and they converged on common functional patterns. Genes affecting mRNA splicing and translation are regulated by alternative isoform expression, revealing post-transcriptional consequences of nutrient availability on gene regulation. We also found that phosphorylation sites are often alternatively expressed, revealing a common mode by which alternative isoform expression modifies protein function and signal transduction. Our results detail rich changes in C. elegans gene expression as larvae initiate growth and post-embryonic development, and they provide an excellent resource for ongoing investigation of transcriptional regulation and developmental physiology.


Despite an emerging understanding of the genetic alterations giving rise to various tumors, the mechanisms whereby most oncogenes are overexpressed remain unclear. Here we have utilized an integrated approach of genomewide regulatory element mapping via DNase-seq followed by conventional reporter assays and transcription factor binding site discovery to characterize the transcriptional regulation of the medulloblastoma oncogene Orthodenticle Homeobox 2 (OTX2). Through these studies we have revealed that OTX2 is differentially regulated in medulloblastoma at the level of chromatin accessibility, which is in part mediated by DNA methylation. In cell lines exhibiting chromatin accessibility of OTX2 regulatory regions, we found that autoregulation maintains OTX2 expression. Comparison of medulloblastoma regulatory elements with those of the developing brain reveals that these tumors engage a developmental regulatory program to drive OTX2 transcription. Finally, we have identified a transcriptional regulatory element mediating retinoid-induced OTX2 repression in these tumors. This work characterizes for the first time the mechanisms of OTX2 overexpression in medulloblastoma. Furthermore, this study establishes proof of principle for applying ENCODE datasets towards the characterization of upstream trans-acting factors mediating expression of individual genes.


Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.

We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.

We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.

Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.

This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.


The cascade that culminates in macrometastases is thought to be mediated by phenotypic plasticity, including epithelial-mesenchymal and mesenchymal-epithelial transitions (EMT and MET). Although there is substantial support for the role of EMT in driving cancer cell invasion and dissemination, much less is known about the importance of MET in the later steps of metastatic colonization. We created novel reporters, which integrate transcriptional and post-transcriptional regulation, to test whether MET is required for metastasis in multiple in vivo cancer models. In a model of carcinosarcoma, metastasis occurred via an MET-dependent pathway; however, in two prostate carcinoma models, metastatic colonization was MET independent. Our results provide evidence for both MET-dependent and MET-independent metastatic pathways.


Epidemiological studies have identified psychological stress as a significant risk factor in breast cancer. The stress response is regulated by the HPA axis in the brain and is mediated by glucocorticoid receptor (GR) signalling. It has been found that early life events can affect epigenetic programming of GR, and methylation of the GR promoter has been reported in colorectal tumourigenesis. Decreased GR expression has also been observed in breast cancer. In addition, it has been previously demonstrated that unliganded GR can serve as a direct activator of the BRCA1 promoter in mammary epithelial cells. We propose a model whereby methylation of the GR promoter in the breast significantly lowers GR expression, resulting in insufficient BRCA1 promoter activation and an increased risk of developing cancer. Antibody-based methylated DNA enrichment was followed by qPCR analysis (MeDIP-qPCR) in a novel assay developed to detect locus-specific methylation levels. It was found that 13% of primary breast tumours were hypermethylated at the GR proximal promoter whereas no methylation was detected in normal tissue. RT-PCR and 5’ RACE analysis identified exon 1B as the predominant alternative first exon in the breast. Tumours methylated near exon 1B had decreased GR expression compared to unmethylated samples, suggesting that this region is important for transcriptional regulation of GR. It was also determined that GR and BRCA1 expression was decreased in breast tumour compared to normal tissue. Furthermore, the relative expression of GR and BRCA1 measured by qRT-PCR was correlated in normal tissue but this association was not found in tumour tissue. From this, it appears that lower GR levels with associated decreased BRCA1 expression in tissues may be a predisposing factor for breast cancer. Based on these results we propose a role for GR as a potential tumour suppressor gene in the breast due to its association with BRCA1, also a tumour suppressor gene, as well as its consistently decreased expression in breast tumours and methylation of its proximal promoter in a subset of cancer patients.


The hypoxia-inducible factor (HIF) transcription complex, which is activated by low oxygen tension, controls a diverse range of cellular processes including angiogenesis and erythropoiesis. Under normoxic conditions, the alpha subunit of HIF is rapidly degraded in a manner dependent on hydroxylation of two conserved proline residues at positions 402 and 564 in HIF-1alpha in the oxygen-dependent degradation (ODD) domain. This allows subsequent recognition by the von Hippel-Lindau (VHL) tumor suppressor protein, which targets HIF for degradation by the ubiquitin-proteasome pathway. Under hypoxic conditions, prolyl hydroxylation of HIF is inhibited, allowing it to escape VHL-mediated degradation. The transcriptional regulation of the erythropoietin gene by HIF raises the possibility that HIF may play a role in disorders of erythropoiesis, such as idiopathic erythrocytosis (IE).


BRCA1 is a tumor suppressor gene implicated in transcriptional regulation. We have generated cell lines with inducible expression of BRCA1 as a tool to identify downstream targets that may be important mediators of BRCA1 function. Oligonucleotide array-based expression profiling identified 11 previously described interferon regulated genes that were up-regulated following inducible expression of BRCA1. Northern blot analysis revealed that a subset of the identified targets including IRF-7, MxA, and ISG-54 were synergistically up-regulated by BRCA1 in the presence of interferon gamma (IFN-gamma) but not interferons alpha or beta. Importantly, IFN-gamma-mediated induction of IRF-7 and MxA was attenuated in the BRCA1 mutant cell line HCC1937, an effect that was rescued following reconstitution of exogenous wild type BRCA1 in these cells. Furthermore, reconstituted BRCA1 sensitized HCC1937 cells to IFN-gamma-induced apoptotic cell death. This study identifies BRCA1 as a component of the IFN-gamma-regulated signaling pathway and suggests that BRCA1 may play a role in the regulation of IFN-gamma-mediated apoptosis.