12 resultados para Genome annotation

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Motivation An actual issue of great interest, both under a theoretical and an applicative perspective, is the analysis of biological sequences for disclosing the information that they encode. The development of new technologies for genome sequencing in the last years, opened new fundamental problems since huge amounts of biological data still deserve an interpretation. Indeed, the sequencing is only the first step of the genome annotation process that consists in the assignment of biological information to each sequence. Hence given the large amount of available data, in silico methods became useful and necessary in order to extract relevant information from sequences. The availability of data from Genome Projects gave rise to new strategies for tackling the basic problems of computational biology such as the determination of the tridimensional structures of proteins, their biological function and their reciprocal interactions. Results The aim of this work has been the implementation of predictive methods that allow the extraction of information on the properties of genomes and proteins starting from the nucleotide and aminoacidic sequences, by taking advantage of the information provided by the comparison of the genome sequences from different species. In the first part of the work a comprehensive large scale genome comparison of 599 organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48 eukaryotic genomes were aligned and clustered on the basis of their sequence identity. This procedure led to the identification of classes of proteins that are peculiar to the different groups of organisms. Moreover the adopted similarity threshold produced clusters that are homogeneous on the structural point of view and that can be used for structural annotation of uncharacterized sequences. The second part of the work focuses on the characterization of thermostable proteins and on the development of tools able to predict the thermostability of a protein starting from its sequence. By means of Principal Component Analysis the codon composition of a non redundant database comprising 116 prokaryotic genomes has been analyzed and it has been showed that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level, leading to an overall accuracy in discriminating thermophilic coding sequences equal to 95%. This result outperform those obtained in previous studies. Moreover, we investigated the effect of multiple mutations on protein thermostability. This issue is of great importance in the field of protein engineering, since thermostable proteins are generally more suitable than their mesostable counterparts in technological applications. A Support Vector Machine based method has been trained to predict if a set of mutations can enhance the thermostability of a given protein sequence. The developed predictor achieves 88% accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Identification and genetic diversity of phytoplasmas infecting tropical plant species, selected among those most agronomically relevant in South-east Asia and Latin America were studied. Correlation between evolutionary divergence of relevant phytoplasma strains and their geographic distribution by comparison on homologous genes of phytoplasma strains detected in the same or related plant species in other geographical areas worldwide was achieved. Molecular diversity was studied on genes coding ribosomal proteins, groEL, tuf and amp besides phytoplasma 16S rRNA. Selected samples infected by phytoplasmas belonging to diverse ribosomal groups were also studied by in silico RFLP followed by phylogenetic analyses. Moreover a partial genome annotation of a ‘Ca. P. brasiliense’ strain was done towards future application for epidemiological studies. Phytoplasma presence in cassava showing frog skin (CFSD) and witches’ broom (CWB) diseases in Costa Rica - Paraguay and in Vietnam – Thailand, respectively, was evaluated. In both cases, the diseases were associated with phytoplasmas related to aster yellows, apple proliferation and “stolbur” groups, while only phytoplasma related to X-disease group in CFSD, and to hibiscus witches’ broom, elm yellows and clover proliferation groups in CWB. Variability was found among strains belonging to the same ribosomal group but having different geographic origin and associated with different disease. Additionally, a dodder transmission assay to elucidate the role of phytoplasmas in CWB disease was carried out, and resulted in typical phytoplasma symptoms in periwinkle plants associated with the presence of aster yellows-related strains. Lethal wilt disease, a severe disease of oil palm in Colombia that is spreading throughout South America was also studied. Phytoplasmas were detected in symptomatic oil palm and identified as ‘Ca. P. asteris’, ribosomal subgroup 16SrI-B, and were distinguished from other aster yellows phytoplasmas used as reference strains; in particular, from an aster yellows strain infecting corn in the same country.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This PhD Thesis is the result of my research activity in the last three years. My main research interest was centered on the evolution of mitochondrial genome (mtDNA), and on its usefulness as a phylogeographic and phylogenetic marker at different taxonomic levels in different taxa of Metazoa. From a methodological standpoint, my main effort was dedicated to the sequencing of complete mitochondrial genomes, and the approach to whole-genome sequencing was based on the application of Long-PCR and shotgun sequences. Moreover, this research project is a part of a bigger sequencing project of mtDNAs in many different Metazoans’ taxa, and I mostly dedicated myself to sequence and analyze mtDNAs in selected taxa of bivalves and hexapods (Insecta). Sequences of bivalve mtDNAs are particularly limited, and my study contributed to extend the sampling. Moreover, I used the bivalve Musculista senhousia as model taxon to investigate the molecular mechanisms and the evolutionary significance of their aberrant mode of mitochondrial inheritance (Doubly Uniparental Inheritance, see below). In Insects, I focused my attention on the Genus Bacillus (Insecta Phasmida). A detailed phylogenetic analysis was performed in order to assess phylogenetic relationships within the genus, and to investigate the placement of Phasmida in the phylogenetic tree of Insecta. The main goal of this part of my study was to add to the taxonomic coverage of sequenced mtDNAs in basal insects, which were only partially analyzed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The DNA topology is an important modifier of DNA functions. Torsional stress is generated when right handed DNA is either over- or underwound, producing structural deformations which drive or are driven by processes such as replication, transcription, recombination and repair. DNA topoisomerases are molecular machines that regulate the topological state of the DNA in the cell. These enzymes accomplish this task by either passing one strand of the DNA through a break in the opposing strand or by passing a region of the duplex from the same or a different molecule through a double-stranded cut generated in the DNA. Because of their ability to cut one or two strands of DNA they are also target for some of the most successful anticancer drugs used in standard combination therapies of human cancers. An effective anticancer drug is Camptothecin (CPT) that specifically targets DNA topoisomerase 1 (TOP 1). The research project of the present thesis has been focused on the role of human TOP 1 during transcription and on the transcriptional consequences associated with TOP 1 inhibition by CPT in human cell lines. Previous findings demonstrate that TOP 1 inhibition by CPT perturbs RNA polymerase (RNAP II) density at promoters and along transcribed genes suggesting an involvement of TOP 1 in RNAP II promoter proximal pausing site. Within the transcription cycle, promoter pausing is a fundamental step the importance of which has been well established as a means of coupling elongation to RNA maturation. By measuring nascent RNA transcripts bound to chromatin, we demonstrated that TOP 1 inhibition by CPT can enhance RNAP II escape from promoter proximal pausing site of the human Hypoxia Inducible Factor 1 (HIF-1) and c-MYC genes in a dose dependent manner. This effect is dependent from Cdk7/Cdk9 activities since it can be reversed by the kinases inhibitor DRB. Since CPT affects RNAP II by promoting the hyperphosphorylation of its Rpb1 subunit the findings suggest that TOP 1inhibition by CPT may increase the activity of Cdks which in turn phosphorylate the Rpb1 subunit of RNAP II enhancing its escape from pausing. Interestingly, the transcriptional consequences of CPT induced topological stress are wider than expected. CPT increased co-transcriptional splicing of exon1 and 2 and markedly affected alternative splicing at exon 11. Surprisingly despite its well-established transcription inhibitory activity, CPT can trigger the production of a novel long RNA (5’aHIF-1) antisense to the human HIF-1 mRNA and a known antisense RNA at the 3’ end of the gene, while decreasing mRNA levels. The effects require TOP 1 and are independent from CPT induced DNA damage. Thus, when the supercoiling imbalance promoted by CPT occurs at promoter, it may trigger deregulation of the RNAP II pausing, increased chromatin accessibility and activation/derepression of antisense transcripts in a Cdks dependent manner. A changed balance of antisense transcripts and mRNAs may regulate the activity of HIF-1 and contribute to the control of tumor progression After focusing our TOP 1 investigations at a single gene level, we have extended the study to the whole genome by developing the “Topo-Seq” approach which generates a map of genome-wide distribution of sites of TOP 1 activity sites in human cells. The preliminary data revealed that TOP 1 preferentially localizes at intragenic regions and in particular at 5’ and 3’ ends of genes. Surprisingly upon TOP 1 downregulation, which impairs protein expression by 80%, TOP 1 molecules are mostly localized around 3’ ends of genes, thus suggesting that its activity is essential at these regions and can be compensate at 5’ ends. The developed procedure is a pioneer tool for the detection of TOP 1 cleavage sites across the genome and can open the way to further investigations of the enzyme roles in different nuclear processes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Here I will focus on three main topics that best address and include the projects I have been working in during my three year PhD period that I have spent in different research laboratories addressing both computationally and practically important problems all related to modern molecular genomics. The first topic is the use of livestock species (pigs) as a model of obesity, a complex human dysfunction. My efforts here concern the detection and annotation of Single Nucleotide Polymorphisms. I developed a pipeline for mining human and porcine sequences. Starting from a set of human genes related with obesity the platform returns a list of annotated porcine SNPs extracted from a new set of potential obesity-genes. 565 of these SNPs were analyzed on an Illumina chip to test the involvement in obesity on a population composed by more than 500 pigs. Results will be discussed. All the computational analysis and experiments were done in collaboration with the Biocomputing group and Dr.Luca Fontanesi, respectively, under the direction of prof. Rita Casadio at the Bologna University, Italy. The second topic concerns developing a methodology, based on Factor Analysis, to simultaneously mine information from different levels of biological organization. With specific test cases we develop models of the complexity of the mRNA-miRNA molecular interaction in brain tumors measured indirectly by microarray and quantitative PCR. This work was done under the supervision of Prof. Christine Nardini, at the “CAS-MPG Partner Institute for Computational Biology” of Shangai, China (co-founded by the Max Planck Society and the Chinese Academy of Sciences jointly) The third topic concerns the development of a new method to overcome the variety of PCR technologies routinely adopted to characterize unknown flanking DNA regions of a viral integration locus of the human genome after clinical gene therapy. This new method is entirely based on next generation sequencing and it reduces the time required to detect insertion sites, decreasing the complexity of the procedure. This work was done in collaboration with the group of Dr. Manfred Schmidt at the Nationales Centrum für Tumorerkrankungen (Heidelberg, Germany) supervised by Dr. Annette Deichmann and Dr. Ali Nowrouzi. Furthermore I add as an Appendix the description of a R package for gene network reconstruction that I helped to develop for scientific usage (http://www.bioconductor.org/help/bioc-views/release/bioc/html/BUS.html).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Grape berry is considered a non climacteric fruit, but there are some evidences that ethylene plays a role in the control of berry ripening. This PhD thesis aimed to give insights in the role of ethylene and ethylene-related genes in the regulation of grape berry ripening. During this study a small increase in ethylene concentration one week before véraison has been measured in Vitis vinifera L. ‘Pinot Noir’ grapes confirming previous findings in ‘Cabernet Sauvignon’. In addition, ethylene-related genes have been identified in the grapevine genome sequence. Similarly to other species, biosynthesis and ethylene receptor genes are present in grapevine as multi-gene families and their expression appeared tissue or developmental specific. All the other elements of the ethylene signal transduction cascade were also identified in the grape genome. Among them, there were ethylene response factors (ERF) which modulate the transcription of many effector genes in response to ethylene. In this study seven grapevine ERFs have been characterized and they showed tissue and berry development specific expression profiles. Two sequences, VvERF045 and VvERF063, seemed likely involved in berry ripening control due to their expression profiles and their sequence annotation. VvERF045 was induced before véraison and was specific of the ripe berry, by sequence similarity it was likely a transcription activator. VvERF063 displayed high sequence similarity to repressors of transcription and its expression, very high in green berries, was lowest at véraison and during ripening. To functionally characterize VvERF045 and VvERF063, a stable transformation strategy was chosen. Both sequences were cloned in vectors for over-expression and silencing and transferred in grape by Agrobacterium-mediated or biolistic-mediated gene transfer. In vitro, transgenic VvERF045 over-expressing plants displayed an epinastic phenotype whose extent was correlated to the transgene expression level. Four pathogen stress response genes were significantly induced in the transgenic plants, suggesting a putative function of VvERF045 in biotic stress defense during berry ripening. Further molecular analysis on the transgenic plants will help in identifying the actual VvERF045 target genes and together with the phenotypic characterization of the adult transgenic plants, will allow to extensively define the role of VvERF045 in berry ripening.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The objective of this work is to characterize the genome of the chromosome 1 of A.thaliana, a small flowering plants used as a model organism in studies of biology and genetics, on the basis of a recent mathematical model of the genetic code. I analyze and compare different portions of the genome: genes, exons, coding sequences (CDS), introns, long introns, intergenes, untranslated regions (UTR) and regulatory sequences. In order to accomplish the task, I transformed nucleotide sequences into binary sequences based on the definition of the three different dichotomic classes. The descriptive analysis of binary strings indicate the presence of regularities in each portion of the genome considered. In particular, there are remarkable differences between coding sequences (CDS and exons) and non-coding sequences, suggesting that the frame is important only for coding sequences and that dichotomic classes can be useful to recognize them. Then, I assessed the existence of short-range dependence between binary sequences computed on the basis of the different dichotomic classes. I used three different measures of dependence: the well-known chi-squared test and two indices derived from the concept of entropy i.e. Mutual Information (MI) and Sρ, a normalized version of the “Bhattacharya Hellinger Matusita distance”. The results show that there is a significant short-range dependence structure only for the coding sequences whose existence is a clue of an underlying error detection and correction mechanism. No doubt, further studies are needed in order to assess how the information carried by dichotomic classes could discriminate between coding and noncoding sequence and, therefore, contribute to unveil the role of the mathematical structure in error detection and correction mechanisms. Still, I have shown the potential of the approach presented for understanding the management of genetic information.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Il progresso tecnologico nel campo della biologia molecolare, pone la comunità scientifica di fronte all’esigenza di dare un’interpretazione all’enormità di sequenze biologiche che a mano a mano vanno a costituire le banche dati, siano esse proteine o acidi nucleici. In questo contesto la bioinformatica gioca un ruolo di primaria importanza. Un nuovo livello di possibilità conoscitive è stato introdotto con le tecnologie di Next Generation Sequencing (NGS), per mezzo delle quali è possibile ottenere interi genomi o trascrittomi in poco tempo e con bassi costi. Tra le applicazioni del NGS più rilevanti ci sono senza dubbio quelle oncologiche che prevedono la caratterizzazione genomica di tessuti tumorali e lo sviluppo di nuovi approcci diagnostici e terapeutici per il trattamento del cancro. Con l’analisi NGS è possibile individuare il set completo di variazioni che esistono nel genoma tumorale come varianti a singolo nucleotide, riarrangiamenti cromosomici, inserzioni e delezioni. Va però sottolineato che le variazioni trovate nei geni vanno in ultima battuta osservate dal punto di vista degli effetti a livello delle proteine in quanto esse sono le responsabili più dirette dei fenotipi alterati riscontrabili nella cellula tumorale. L’expertise bioinformatica va quindi collocata sia a livello dell’analisi del dato prodotto per mezzo di NGS ma anche nelle fasi successive ove è necessario effettuare l’annotazione dei geni contenuti nel genoma sequenziato e delle relative strutture proteiche che da esso sono espresse, o, come nel caso dello studio mutazionale, la valutazione dell’effetto della variazione genomica. È in questo contesto che si colloca il lavoro presentato: da un lato lo sviluppo di metodologie computazionali per l’annotazione di sequenze proteiche e dall’altro la messa a punto di una pipeline di analisi di dati prodotti con tecnologie NGS in applicazioni oncologiche avente come scopo finale quello della individuazione e caratterizzazione delle mutazioni genetiche tumorali a livello proteico.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The aim of this work was to identify markers associated with production traits in the pig genome using different approaches. We focused the attention on Italian Large White pig breed using Genome Wide Association Studies (GWAS) and applying a selective genotyping approach to increase the power of the analyses. Furthermore, we searched the pig genome using Next Generation Sequencing (NSG) Ion Torrent Technology to combine selective genotyping approach and deep sequencing for SNP discovery. Other two studies were carried on with a different approach. Allele frequency changes for SNPs affecting candidate genes and at Genome Wide level were analysed to identify selection signatures driven by selection program during the last 20 years. This approach confirmed that a great number of markers may affect production traits and that they are captured by the classical selection programs. GWAS revealed 123 significant or suggestively significant SNP associated with Back Fat Thickenss and 229 associated with Average Daily Gain. 16 Copy Number Variant Regions resulted more frequent in lean or fat pigs and showed that different copies of those region could have a limited impact on fat. These often appear to be involved in food intake and behavior, beside affecting genes involved in metabolic pathways and their expression. By combining NGS sequencing with selective genotyping approach, new variants where discovered and at least 54 are worth to be analysed in association studies. The study of groups of pigs undergone to stringent selection showed that allele frequency of some loci can drastically change if they are close to traits that are interesting for selection schemes. These approaches could be, in future, integrated in genomic selection plans.