5 resultados para computational biology

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

100.00% 100.00%

Publicador:

Resumo:

From the late 1980s, the automation of sequencing techniques and the computer spread gave rise to a flourishing number of new molecular structures and sequences and to proliferation of new databases in which to store them. Here are presented three computational approaches able to analyse the massive amount of publicly avalilable data in order to answer to important biological questions. The first strategy studies the incorrect assignment of the first AUG codon in a messenger RNA (mRNA), due to the incomplete determination of its 5' end sequence. An extension of the mRNA 5' coding region was identified in 477 in human loci, out of all human known mRNAs analysed, using an automated expressed sequence tag (EST)-based approach. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for GNB2L1, QARS and TDP2 and the consequences for the functional studies are discussed. The second approach analyses the codon bias, the phenomenon in which distinct synonymous codons are used with different frequencies, and, following integration with a gene expression profile, estimates the total number of codons present across all the expressed mRNAs (named here "codonome value") in a given biological condition. Systematic analyses across different pathological and normal human tissues and multiple species shows a surprisingly tight correlation between the codon bias and the codonome bias. The third approach is useful to studies the expression of human autism spectrum disorder (ASD) implicated genes. ASD implicated genes sharing microRNA response elements (MREs) for the same microRNA are co-expressed in brain samples from healthy and ASD affected individuals. The different expression of a recently identified long non coding RNA which have four MREs for the same microRNA could disrupt the equilibrium in this network, but further analyses and experiments are needed.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Motivation An actual issue of great interest, both under a theoretical and an applicative perspective, is the analysis of biological sequences for disclosing the information that they encode. The development of new technologies for genome sequencing in the last years, opened new fundamental problems since huge amounts of biological data still deserve an interpretation. Indeed, the sequencing is only the first step of the genome annotation process that consists in the assignment of biological information to each sequence. Hence given the large amount of available data, in silico methods became useful and necessary in order to extract relevant information from sequences. The availability of data from Genome Projects gave rise to new strategies for tackling the basic problems of computational biology such as the determination of the tridimensional structures of proteins, their biological function and their reciprocal interactions. Results The aim of this work has been the implementation of predictive methods that allow the extraction of information on the properties of genomes and proteins starting from the nucleotide and aminoacidic sequences, by taking advantage of the information provided by the comparison of the genome sequences from different species. In the first part of the work a comprehensive large scale genome comparison of 599 organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48 eukaryotic genomes were aligned and clustered on the basis of their sequence identity. This procedure led to the identification of classes of proteins that are peculiar to the different groups of organisms. Moreover the adopted similarity threshold produced clusters that are homogeneous on the structural point of view and that can be used for structural annotation of uncharacterized sequences. The second part of the work focuses on the characterization of thermostable proteins and on the development of tools able to predict the thermostability of a protein starting from its sequence. By means of Principal Component Analysis the codon composition of a non redundant database comprising 116 prokaryotic genomes has been analyzed and it has been showed that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level, leading to an overall accuracy in discriminating thermophilic coding sequences equal to 95%. This result outperform those obtained in previous studies. Moreover, we investigated the effect of multiple mutations on protein thermostability. This issue is of great importance in the field of protein engineering, since thermostable proteins are generally more suitable than their mesostable counterparts in technological applications. A Support Vector Machine based method has been trained to predict if a set of mutations can enhance the thermostability of a given protein sequence. The developed predictor achieves 88% accuracy.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Here I will focus on three main topics that best address and include the projects I have been working in during my three year PhD period that I have spent in different research laboratories addressing both computationally and practically important problems all related to modern molecular genomics. The first topic is the use of livestock species (pigs) as a model of obesity, a complex human dysfunction. My efforts here concern the detection and annotation of Single Nucleotide Polymorphisms. I developed a pipeline for mining human and porcine sequences. Starting from a set of human genes related with obesity the platform returns a list of annotated porcine SNPs extracted from a new set of potential obesity-genes. 565 of these SNPs were analyzed on an Illumina chip to test the involvement in obesity on a population composed by more than 500 pigs. Results will be discussed. All the computational analysis and experiments were done in collaboration with the Biocomputing group and Dr.Luca Fontanesi, respectively, under the direction of prof. Rita Casadio at the Bologna University, Italy. The second topic concerns developing a methodology, based on Factor Analysis, to simultaneously mine information from different levels of biological organization. With specific test cases we develop models of the complexity of the mRNA-miRNA molecular interaction in brain tumors measured indirectly by microarray and quantitative PCR. This work was done under the supervision of Prof. Christine Nardini, at the “CAS-MPG Partner Institute for Computational Biology” of Shangai, China (co-founded by the Max Planck Society and the Chinese Academy of Sciences jointly) The third topic concerns the development of a new method to overcome the variety of PCR technologies routinely adopted to characterize unknown flanking DNA regions of a viral integration locus of the human genome after clinical gene therapy. This new method is entirely based on next generation sequencing and it reduces the time required to detect insertion sites, decreasing the complexity of the procedure. This work was done in collaboration with the group of Dr. Manfred Schmidt at the Nationales Centrum für Tumorerkrankungen (Heidelberg, Germany) supervised by Dr. Annette Deichmann and Dr. Ali Nowrouzi. Furthermore I add as an Appendix the description of a R package for gene network reconstruction that I helped to develop for scientific usage (http://www.bioconductor.org/help/bioc-views/release/bioc/html/BUS.html).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background. One of the phenomena observed in human aging is the progressive increase of a systemic inflammatory state, a condition referred to as “inflammaging”, negatively correlated with longevity. A prominent mediator of inflammation is the transcription factor NF-kB, that acts as key transcriptional regulator of many genes coding for pro-inflammatory cytokines. Many different signaling pathways activated by very diverse stimuli converge on NF-kB, resulting in a regulatory network characterized by high complexity. NF-kB signaling has been proposed to be responsible of inflammaging. Scope of this analysis is to provide a wider, systemic picture of such intricate signaling and interaction network: the NF-kB pathway interactome. Methods. The study has been carried out following a workflow for gathering information from literature as well as from several pathway and protein interactions databases, and for integrating and analyzing existing data and the relative reconstructed representations by using the available computational tools. Strong manual intervention has been necessarily used to integrate data from multiple sources into mathematically analyzable networks. The reconstruction of the NF-kB interactome pursued with this approach provides a starting point for a general view of the architecture and for a deeper analysis and understanding of this complex regulatory system. Results. A “core” and a “wider” NF-kB pathway interactome, consisting of 140 and 3146 proteins respectively, were reconstructed and analyzed through a mathematical, graph-theoretical approach. Among other interesting features, the topological characterization of the interactomes shows that a relevant number of interacting proteins are in turn products of genes that are controlled and regulated in their expression exactly by NF-kB transcription factors. These “feedback loops”, not always well-known, deserve deeper investigation since they may have a role in tuning the response and the output consequent to NF-kB pathway initiation, in regulating the intensity of the response, or its homeostasis and balance in order to make the functioning of such critical system more robust and reliable. This integrated view allows to shed light on the functional structure and on some of the crucial nodes of thet NF-kB transcription factors interactome. Conclusion. Framing structure and dynamics of the NF-kB interactome into a wider, systemic picture would be a significant step toward a better understanding of how NF-kB globally regulates diverse gene programs and phenotypes. This study represents a step towards a more complete and integrated view of the NF-kB signaling system.