13 resultados para genome analysis

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Epigenetic variability is a new mechanism for the study of human microevolution, because it creates both phenotypic diversity within an individual and within population. This mechanism constitutes an important reservoir for adaptation in response to new stimuli and recent studies have demonstrated that selective pressures shape not only the genetic code but also DNA methylation profiles. The aim of this thesis is the study of the role of DNA methylation changes in human adaptive processes, considering the Italian peninsula and macro-geographical areas. A whole-genome analysis of DNA methylation profile across the Italian penisula identified some genes whose methylation levels differ between individuals of different Italian districts (South, Centre and North of Italy). These genes are involved in nitrogen compound metabolism and genes involved in pathogens response. Considering individuals with different macro-geographical origins (individuals of Asians, European and African ancestry) more significant DMRs (differentially methylated regions) were identified and are located in genes involved in glucoronidation, in immune response as well as in cell comunication processes. A "profile" of each ancestry (African, Asian and European) was described. Moreover a deepen analysis of three candidate genes (KRTCAP3, MAD1L and BRSK2) in a cohort of individuals of different countries (Morocco, Nigeria, China and Philippines) living in Bologna, was performed in order to explore genetic and epigenetic diversity. Moreover this thesis have paved the way for the application of DNA methylation for the study of hystorical remains and in particular for the age-estimation of individuals starting from biological samples (such as teeth or blood). Noteworthy, a mathematical model that considered methylation values of DNA extracted from cementum and pulp of living individuals can estimate chronological age with high accuracy (median absolute difference between age estimated from DNA methylation and chronological age was 1.2 years).

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The objective of this work is to characterize the genome of the chromosome 1 of A.thaliana, a small flowering plants used as a model organism in studies of biology and genetics, on the basis of a recent mathematical model of the genetic code. I analyze and compare different portions of the genome: genes, exons, coding sequences (CDS), introns, long introns, intergenes, untranslated regions (UTR) and regulatory sequences. In order to accomplish the task, I transformed nucleotide sequences into binary sequences based on the definition of the three different dichotomic classes. The descriptive analysis of binary strings indicate the presence of regularities in each portion of the genome considered. In particular, there are remarkable differences between coding sequences (CDS and exons) and non-coding sequences, suggesting that the frame is important only for coding sequences and that dichotomic classes can be useful to recognize them. Then, I assessed the existence of short-range dependence between binary sequences computed on the basis of the different dichotomic classes. I used three different measures of dependence: the well-known chi-squared test and two indices derived from the concept of entropy i.e. Mutual Information (MI) and Sρ, a normalized version of the “Bhattacharya Hellinger Matusita distance”. The results show that there is a significant short-range dependence structure only for the coding sequences whose existence is a clue of an underlying error detection and correction mechanism. No doubt, further studies are needed in order to assess how the information carried by dichotomic classes could discriminate between coding and noncoding sequence and, therefore, contribute to unveil the role of the mathematical structure in error detection and correction mechanisms. Still, I have shown the potential of the approach presented for understanding the management of genetic information.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The aim of this work was to identify markers associated with production traits in the pig genome using different approaches. We focused the attention on Italian Large White pig breed using Genome Wide Association Studies (GWAS) and applying a selective genotyping approach to increase the power of the analyses. Furthermore, we searched the pig genome using Next Generation Sequencing (NSG) Ion Torrent Technology to combine selective genotyping approach and deep sequencing for SNP discovery. Other two studies were carried on with a different approach. Allele frequency changes for SNPs affecting candidate genes and at Genome Wide level were analysed to identify selection signatures driven by selection program during the last 20 years. This approach confirmed that a great number of markers may affect production traits and that they are captured by the classical selection programs. GWAS revealed 123 significant or suggestively significant SNP associated with Back Fat Thickenss and 229 associated with Average Daily Gain. 16 Copy Number Variant Regions resulted more frequent in lean or fat pigs and showed that different copies of those region could have a limited impact on fat. These often appear to be involved in food intake and behavior, beside affecting genes involved in metabolic pathways and their expression. By combining NGS sequencing with selective genotyping approach, new variants where discovered and at least 54 are worth to be analysed in association studies. The study of groups of pigs undergone to stringent selection showed that allele frequency of some loci can drastically change if they are close to traits that are interesting for selection schemes. These approaches could be, in future, integrated in genomic selection plans.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Motivation An actual issue of great interest, both under a theoretical and an applicative perspective, is the analysis of biological sequences for disclosing the information that they encode. The development of new technologies for genome sequencing in the last years, opened new fundamental problems since huge amounts of biological data still deserve an interpretation. Indeed, the sequencing is only the first step of the genome annotation process that consists in the assignment of biological information to each sequence. Hence given the large amount of available data, in silico methods became useful and necessary in order to extract relevant information from sequences. The availability of data from Genome Projects gave rise to new strategies for tackling the basic problems of computational biology such as the determination of the tridimensional structures of proteins, their biological function and their reciprocal interactions. Results The aim of this work has been the implementation of predictive methods that allow the extraction of information on the properties of genomes and proteins starting from the nucleotide and aminoacidic sequences, by taking advantage of the information provided by the comparison of the genome sequences from different species. In the first part of the work a comprehensive large scale genome comparison of 599 organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48 eukaryotic genomes were aligned and clustered on the basis of their sequence identity. This procedure led to the identification of classes of proteins that are peculiar to the different groups of organisms. Moreover the adopted similarity threshold produced clusters that are homogeneous on the structural point of view and that can be used for structural annotation of uncharacterized sequences. The second part of the work focuses on the characterization of thermostable proteins and on the development of tools able to predict the thermostability of a protein starting from its sequence. By means of Principal Component Analysis the codon composition of a non redundant database comprising 116 prokaryotic genomes has been analyzed and it has been showed that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level, leading to an overall accuracy in discriminating thermophilic coding sequences equal to 95%. This result outperform those obtained in previous studies. Moreover, we investigated the effect of multiple mutations on protein thermostability. This issue is of great importance in the field of protein engineering, since thermostable proteins are generally more suitable than their mesostable counterparts in technological applications. A Support Vector Machine based method has been trained to predict if a set of mutations can enhance the thermostability of a given protein sequence. The developed predictor achieves 88% accuracy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This PhD Thesis is the result of my research activity in the last three years. My main research interest was centered on the evolution of mitochondrial genome (mtDNA), and on its usefulness as a phylogeographic and phylogenetic marker at different taxonomic levels in different taxa of Metazoa. From a methodological standpoint, my main effort was dedicated to the sequencing of complete mitochondrial genomes, and the approach to whole-genome sequencing was based on the application of Long-PCR and shotgun sequences. Moreover, this research project is a part of a bigger sequencing project of mtDNAs in many different Metazoans’ taxa, and I mostly dedicated myself to sequence and analyze mtDNAs in selected taxa of bivalves and hexapods (Insecta). Sequences of bivalve mtDNAs are particularly limited, and my study contributed to extend the sampling. Moreover, I used the bivalve Musculista senhousia as model taxon to investigate the molecular mechanisms and the evolutionary significance of their aberrant mode of mitochondrial inheritance (Doubly Uniparental Inheritance, see below). In Insects, I focused my attention on the Genus Bacillus (Insecta Phasmida). A detailed phylogenetic analysis was performed in order to assess phylogenetic relationships within the genus, and to investigate the placement of Phasmida in the phylogenetic tree of Insecta. The main goal of this part of my study was to add to the taxonomic coverage of sequenced mtDNAs in basal insects, which were only partially analyzed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Animal neocentromeres are defined as ectopic centromeres that have formed in non-centromeric locations and avoid some of the features, like the DNA satellite sequence, that normally characterize canonical centromeres. Despite this, they are stable functional centromeres inherited through generations. The only existence of neocentromeres provide convincing evidence that centromere specification is determined by epigenetic rather than sequence-specific mechanisms. For all this reasons, we used them as simplified models to investigate the molecular mechanisms that underlay the formation and the maintenance of functional centromeres. We collected human cell lines carrying neocentromeres in different positions. To investigate the region involved in the process at the DNA sequence level we applied a recent technology that integrates Chromatin Immuno-Precipitation and DNA microarrays (ChIP-on-chip) using rabbit polyclonal antibodies directed against CENP-A or CENP-C human centromeric proteins. These DNA binding-proteins are required for kinetochore function and are exclusively targeted to functional centromeres. Thus, the immunoprecipitation of DNA bound by these proteins allows the isolation of centromeric sequences, including those of the neocentromeres. Neocentromeres arise even in protein-coding genes region. We further analyzed if the increased scaffold attachment sites and the corresponding tighter chromatin of the region involved in the neocentromerization process still were permissive or not to transcription of within encoded genes. Centromere repositioning is a phenomenon in which a neocentromere arisen without altering the gene order, followed by the inactivation of the canonical centromere, becomes fixed in population. It is a process of chromosome rearrangement fundamental in evolution, at the bases of speciation. The repeat-free region where the neocentromere initially forms, progressively acquires extended arrays of satellite tandem repeats that may contribute to its functional stability. In this view our attention focalized to the repositioned horse ECA11 centromere. ChIP-on-chip analysis was used to define the region involved and SNPs studies, mapping within the region involved into neocentromerization, were carried on. We have been able to describe the structural polymorphism of the chromosome 11 centromeric domain of Caballus population. That polymorphism was seen even between homologues chromosome of the same cells. That discovery was the first described ever. Genomic plasticity had a fundamental role in evolution. Centromeres are not static packaged region of genomes. The key question that fascinates biologists is to understand how that centromere plasticity could be combined to the stability and maintenance of centromeric function. Starting from the epigenetic point of view that underlies centromere formation, we decided to analyze the RNA content of centromeric chromatin. RNA, as well as secondary chemically modifications that involve both histones and DNA, represents a good candidate to guide somehow the centromere formation and maintenance. Many observations suggest that transcription of centromeric DNA or of other non-coding RNAs could affect centromere formation. To date has been no thorough investigation addressing the identity of the chromatin-associated RNAs (CARs) on a global scale. This prompted us to develop techniques to identify CARs in a genome-wide approach using high-throughput genomic platforms. The future goal of this study will be to focalize the attention on what strictly happens specifically inside centromere chromatin.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Neisseria meningitidis (Nm) is the major cause of septicemia and meningococcal meningitis. During the course of infection, it must adapt to different host environments as a crucial factor for survival. Despite the severity of meningococcal sepsis, little is known about how Nm adapts to permit survival and growth in human blood. A previous time-course transcriptome analysis, using an ex vivo model of human whole blood infection, showed that Nm alters the expression of nearly 30% of ORFs of the genome: major dynamic changes were observed in the expression of transcriptional regulators, transport and binding proteins, energy metabolism, and surface-exposed virulence factors. Starting from these data, mutagenesis studies of a subset of up-regulated genes were performed and the mutants were tested for the ability to survive in human whole blood; Nm mutant strains lacking the genes encoding NMB1483, NalP, Mip, NspA, Fur, TbpB, and LctP were sensitive to killing by human blood. Then, the analysis was extended to the whole Nm transcriptome in human blood, using a customized 60-mer oligonucleotide tiling microarray. The application of specifically developed software combined with this new tiling array allowed the identification of different types of regulated transcripts: small intergenic RNAs, antisense RNAs, 5’ and 3’ untranslated regions and operons. The expression of these RNA molecules was confirmed by 5’-3’RACE protocol and specific RT-PCR. Here we describe the complete transcriptome of Nm during incubation in human blood; we were able to identify new proteins important for survival in human blood and also to identify additional roles of previously known virulence factors in aiding survival in blood. In addition the tiling array analysis demonstrated that Nm expresses a set of new transcripts, not previously identified, and suggests the presence of a circuit of regulatory RNA elements used by Nm to adapt to proliferate in human blood.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The goal of many plant scientists’ research is to explain natural phenotypic variation in term of simple changes in DNA sequence. DNA-based molecular markers are extensively used for the construction of genome-wide molecular maps and to perform genetic analysis for simple and complex traits. The PhD thesis was divided into two main research lines according to the different approaches adopted. The first research line is to analyze the genetic diversity in an Italian apple germplasm collection for the identification of markers tightly linked to targeted genes by an association genetic method. This made it possible to identify synomym and homonym accessions and triploids. The fruit red skin color trait has been used to test the reliability of the genetic approaches in this species. The second line is related to the development of molecular markers closely linked to the Rvi13 and Rvi5 scab resistance genes, previously mapped on apple’s chromosome 10 and 17 respectively by using the traditional linkage mapping method. Both region have been fine-mapped with various type of markers that could be used for marker-assisted selection in future breeding programs and to isolate the two resistance genes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Group B Streptococcus (GBS), in its transition from commensal to pathogen, will encounter diverse host environments and thus require coordinately controlling its transcriptional responses to these changes. This work was aimed at better understanding the role of two component signal transduction systems (TCS) in GBS pathophysiology through a systematic screening procedure. We first performed a complete inventory and sensory mechanism classification of all putative GBS TCS by genomic analysis. Five TCS were further investigated by the generation of knock-out strains, and in vitro transcriptome analysis identified genes regulated by these systems, ranging from 0.1-3% of the genome. Interestingly, two sugar phosphotransferase systems appeared differently regulated in the knock-out mutant of TCS-16, suggesting an involvement in monitoring carbon source availability. High throughput analysis of bacterial growth on different carbon sources showed that TCS-16 was necessary for growth of GBS on fructose-6-phosphate. Additional transcriptional analysis provided further evidence for a stimulus-response circuit where extracellular fructose-6-phosphate leads to autoinduction of TCS-16 with concomitant dramatic up-regulation of the adjacent operon encoding a phosphotransferase system. The TCS-16-deficient strain exhibited decreased persistence in a model of vaginal colonization and impaired growth/survival in the presence of vaginal mucoid components. All mutant strains were also characterized in a murine model of systemic infection, and inactivation of TCS-17 (also known as RgfAC) resulted in hypervirulence. Our data suggest a role for the previously unknown TCS-16, here named FspSR, in bacterial fitness and carbon metabolism during host colonization, and also provide experimental evidence for TCS-17/RgfAC involvement in virulence.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The recent advent of Next-generation sequencing technologies has revolutionized the way of analyzing the genome. This innovation allows to get deeper information at a lower cost and in less time, and provides data that are discrete measurements. One of the most important applications with these data is the differential analysis, that is investigating if one gene exhibit a different expression level in correspondence of two (or more) biological conditions (such as disease states, treatments received and so on). As for the statistical analysis, the final aim will be statistical testing and for modeling these data the Negative Binomial distribution is considered the most adequate one especially because it allows for "over dispersion". However, the estimation of the dispersion parameter is a very delicate issue because few information are usually available for estimating it. Many strategies have been proposed, but they often result in procedures based on plug-in estimates, and in this thesis we show that this discrepancy between the estimation and the testing framework can lead to uncontrolled first-type errors. We propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Afterwards, three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Autism spectrum disorder (ASD) and Intellectual Disability (ID) are complex neuropsychiatric disorders characterized by extensive clinical and genetic heterogeneity and with overlapping risk factors. The aim of my project was to further investigate the role of Copy Numbers Variants (CNVs), identified through genome-wide studies performed by the Autism Geome Project (AGP) and the CHERISH consortium in large cohorts of ASD and ID cases, respectively. Specifically, I focused on four rare genic CNVs, selected on the basis of their impact on interesting ASD/ID candidate genes: a) a compound heterozygous deletion involving CTNNA3, predicted to cause the lack of functional protein; b) a 15q13.3 duplication containing CHRNA7; c) a 2q31.1 microdeletion encompassing KLHL23, SSB and METTL5; d) Lastly, I investigated the putative imprinting regulation of the CADPS2 gene, disrupted by a maternal deletion in two siblings with ASD and ID. This study provides further evidence for the role of CTNNA3, CHRNA7, KLHL23 and CADPS2 as ASD and/or ID susceptibility genes, and highlights that rare genetic variation contributes to disease risk in different ways: some rare mutations, such as those impacting CTNNA3, act in a recessive mode of inheritance, while other CNVs, such as those occurring in the 15q13.3 region, are implicated in multiple developmental and/or neurological disorders possibly interacting with other susceptibility variants elsewhere in the genome. On the other hand, the discovery of a tissue-specific monoallelic expression for the CADPS2 gene, implicates the involvement of epigenetic regulatory mechanisms as risk factors conferring susceptibility to ASD/ID.