11 resultados para GENOME SEQUENCING
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Legionella is a Gram-negative bacterium that represent a public health issue, with heavy social and economic impact. Therefore, it is mandatory to provide a proper environmental surveillance and risk assessment plan to perform Legionella control in water distribution systems in hospital and community buildings. The thesis joins several methodologies in a unique workflow applied for the identification of non-pneumophila Legionella species (n-pL), starting from standard methods as culture and gene sequencing (mip and rpoB), and passing through innovative approaches as MALDI-TOF MS technique and whole genome sequencing (WGS). The results obtained, were compared to identify the Legionella isolates, and lead to four presumptive novel Legionella species identification. One of these four new isolates was characterized and recognized at taxonomy level with the name of Legionella bononiensis (the 64th Legionella species). The workflow applied in this thesis, help to increase the knowledge of Legionella environmental species, improving the description of the environment itself and the events that promote the growth of Legionella in their ecological niche. The correct identification and characterization of the isolates permit to prevent their spread in man-made environment and contain the occurrence of cases, clusters, or outbreaks. Therefore, the experimental work undertaken, could support the preventive measures during environmental and clinical surveillance, improving the study of species often underestimated or still unknown.
Resumo:
The artisanal food chain is enriched by a wide diversity of local food productions with delightful organoleptic characteristics and valuable nutritional properties. Despite their increasing worldwide popularity and appeal, several food safety challenges are addressed in artisanal facilities context suffering from less standardized processing conditions. In such scenario, recent advances in molecular typing and genomic surveillance (e.g., Whole Genome Sequencing [WGS]) represent an unprecedent solution capable of inferring sources of contamination as well as contributing to food safety along the artisanal food continuum. The overall objective of this PhD thesis was to explore potential microbial hazards among different artisanal food productions of animal origins (dairy and meat-derived) typical of the food culture and heritage landscape belonging to Mediterranean countries. Three different studies were then carried out, specifically focussing on: 1) compare the seasonal variability of microbiological quality and potential occurrence of microbial hazards in two batches of Italian artisanal fermented dairy and meat productions; 2) Investigate genetic relationships as well as virulome and resistome of foodborne pathogens isolated within dairy and meat-derived productions located in Italy, Spain, Portugal and Morocco; 3) investigate the population structure, virulome, resistome and mobilome of Klebsiella spp. isolates collected from study 1, including an extended range of public sequences.
Resumo:
The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe
Resumo:
Motivation An actual issue of great interest, both under a theoretical and an applicative perspective, is the analysis of biological sequences for disclosing the information that they encode. The development of new technologies for genome sequencing in the last years, opened new fundamental problems since huge amounts of biological data still deserve an interpretation. Indeed, the sequencing is only the first step of the genome annotation process that consists in the assignment of biological information to each sequence. Hence given the large amount of available data, in silico methods became useful and necessary in order to extract relevant information from sequences. The availability of data from Genome Projects gave rise to new strategies for tackling the basic problems of computational biology such as the determination of the tridimensional structures of proteins, their biological function and their reciprocal interactions. Results The aim of this work has been the implementation of predictive methods that allow the extraction of information on the properties of genomes and proteins starting from the nucleotide and aminoacidic sequences, by taking advantage of the information provided by the comparison of the genome sequences from different species. In the first part of the work a comprehensive large scale genome comparison of 599 organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48 eukaryotic genomes were aligned and clustered on the basis of their sequence identity. This procedure led to the identification of classes of proteins that are peculiar to the different groups of organisms. Moreover the adopted similarity threshold produced clusters that are homogeneous on the structural point of view and that can be used for structural annotation of uncharacterized sequences. The second part of the work focuses on the characterization of thermostable proteins and on the development of tools able to predict the thermostability of a protein starting from its sequence. By means of Principal Component Analysis the codon composition of a non redundant database comprising 116 prokaryotic genomes has been analyzed and it has been showed that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level, leading to an overall accuracy in discriminating thermophilic coding sequences equal to 95%. This result outperform those obtained in previous studies. Moreover, we investigated the effect of multiple mutations on protein thermostability. This issue is of great importance in the field of protein engineering, since thermostable proteins are generally more suitable than their mesostable counterparts in technological applications. A Support Vector Machine based method has been trained to predict if a set of mutations can enhance the thermostability of a given protein sequence. The developed predictor achieves 88% accuracy.
Resumo:
This PhD Thesis is the result of my research activity in the last three years. My main research interest was centered on the evolution of mitochondrial genome (mtDNA), and on its usefulness as a phylogeographic and phylogenetic marker at different taxonomic levels in different taxa of Metazoa. From a methodological standpoint, my main effort was dedicated to the sequencing of complete mitochondrial genomes, and the approach to whole-genome sequencing was based on the application of Long-PCR and shotgun sequences. Moreover, this research project is a part of a bigger sequencing project of mtDNAs in many different Metazoans’ taxa, and I mostly dedicated myself to sequence and analyze mtDNAs in selected taxa of bivalves and hexapods (Insecta). Sequences of bivalve mtDNAs are particularly limited, and my study contributed to extend the sampling. Moreover, I used the bivalve Musculista senhousia as model taxon to investigate the molecular mechanisms and the evolutionary significance of their aberrant mode of mitochondrial inheritance (Doubly Uniparental Inheritance, see below). In Insects, I focused my attention on the Genus Bacillus (Insecta Phasmida). A detailed phylogenetic analysis was performed in order to assess phylogenetic relationships within the genus, and to investigate the placement of Phasmida in the phylogenetic tree of Insecta. The main goal of this part of my study was to add to the taxonomic coverage of sequenced mtDNAs in basal insects, which were only partially analyzed.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
This thesis presents AMR phenotypic evaluation and whole genome sequencing analysis of 288 Escherichia coli strains isolated from different sources (livestock, companion animal, wildlife, food and human) in Italy. Our data reflects general resistance trends in Europe, reporting tetracycline, ampicillin, sulfisoxazole and aminoglycosides resistance as the most common phenotypic AMR profile among livestock, pets, wildlife and humans. Identification of human and animal (livestock and companion animal) AMR profiles in niches with a rare (fishery, mollusc) or absent (vegetable, wild animal, wild boar) direct exposure to antimicrobials, suggests widespread environmental pollution with ARGs conferring resistance to these antimicrobials. Phenotypic resistance to highest priority critically important antimicrobials was mainly observed in food-producing animals and related food such as rabbit, poultry, beef and swine. Discrepancies between AMR phenotypic pattern and genetic profile were observed. In particular, phenotypic aminoglycoside, cephalosporin, meropenem, colistin resistance and ESBL profile did not have a genetic explanation in different cases. This data could suggest the diffusion of new genetic variants of ARGs, associated to these antimicrobial classes. Generally, our collection shows a virulence profile typical of extraintestinal pathogenic Escherichia coli (ExPEC) pathotype. Different pandemic and emerging ExPEC lineages were identified, in particular in poultry meat (ST10; ST23; ST69, ST117; ST131). Rabbit was suggested as a source of ST20-ST40 potential hybrid pathogens. Wildlife carried a high average number (10) of VAGs (mostly associated to ExPEC pathotype) and different predominant ExPEC lineages (ST23, ST117, ST648), suggesting its possible involvement in maintenance and diffusion of virulence determinants. In conclusion, our study provides important knowledge related to the phenotypic/genetic AMR and virulence profiles circulating in E. coli in Italy. The role of different niches in AMR dynamics has been discussed. In particular, food-producing animals are worthy of continued investigation as a source of potential zoonotic pathogens, meanwhile wildlife might contribute to VAGs spread.
Resumo:
Autism Spectrum Disorder (ASD) is a heterogeneous and highly heritable neurodevelopmental disorder with a complex genetic architecture, consisting of a combination of common low-risk and more penetrant rare variants. This PhD project aimed to explore the contribution of rare variants in ASD susceptibility through NGS approaches in a cohort of 106 ASD families including 125 ASD individuals. Firstly, I explored the contribution of inherited rare variants towards the ASD phenotype in a girl with a maternally inherited pathogenic NRXN1 deletion. Whole exome sequencing of the trio family identified an increased burden of deleterious variants in the proband that could modulate the CNV penetrance and determine the disease development. In the second part of the project, I investigated the role of rare variants emerging from whole genome sequencing in ASD aetiology. To properly manage and analyse sequencing data, a robust and efficient variant filtering and prioritization pipeline was developed, and by its application a stringent set of rare recessive-acting and ultra-rare variants was obtained. As a first follow-up, I performed a preliminary analysis on de novo variants, identifying the most likely deleterious variants and highlighting candidate genes for further analyses. In the third part of the project, considering the well-established involvement of calcium signalling in the molecular bases of ASD, I investigated the role of rare variants in voltage-gated calcium channels genes, that mainly regulate intracellular calcium concentration, and whose alterations have been correlated with enhanced ASD risk. Specifically, I functionally tested the effect of rare damaging variants identified in CACNA1H, showing that CACNA1H variation may be involved in ASD development by additively combining with other high risk variants. This project highlights the challenges in the analysis and interpretation of variants from NGS analysis in ASD, and underlines the importance of a comprehensive assessment of the genomic landscape of ASD individuals.
Resumo:
Pathogenic aberrations in homologous recombination DNA repair (HRR) genes occur in approximately 1 to 4 men with advanced prostate cancer (PCa). Treatment with PARP inhibitors (PARPi) has recently been introduced for metastatic castration-resistant PCa patients, increasing clinicians' interest in the molecular characterization of all PCa patients. The limitations of using old, low-quality tumor tissue for genetic analysis, which is very common for PCa, can be overcome by using liquid biopsy as an alternative biomarker source. In this study, we aimed to evaluate the detection of molecular alterations in HRR genes on liquid biopsy compared with tumor tissue from PCa patients. Secondarily, we explored the genomic instability score (GIS), and a broader range of gene alterations for in-depth characterization of the PCa cohort. Plasma samples were collected from 63 patients with PCa. Sophia Homologous Recombination Solution (targeting 16 HRR genes) and shallow whole genome sequencing (sWGS) were used for genomic analysis of tissue DNA and circulating tumor DNA (ct). A total of 33 alterations (mainly on TP53, ATM, CHEK2, CDK12, and BRCA1/2) were identified in 28,5% of PCa plasma patients. By integrating the mutational and sWGS data, the HRR status of PCa patients was determined and a concordance agreement of 85,7% was identified with tumor tissue. A median GIS of 15 was obtained, reaching a score of 63 in 2 samples with double alterations, BRCA1 and TP53. We explored the PCa mutation landscape, and the most significant enriched pathways identified were the sphingosine 1-phosphate (S1P) receptor signaling and the PI3K-AKT-mTOR pathway. HRR analysis on FFPE and liquid biopsy samples show high concordance, demonstrating that the noninvasive ctDNA-enriched plasma can be an optimal alternative source for molecular SNV and CNV analysis. In addition, the evaluation of GIS and pathway interaction should be considered for more comprehensive molecular characterization in PCa patients.
Resumo:
Pediatric acute myeloid leukemia (AML) is a molecularly heterogeneous disease that arises from genetic alterations in pathways that regulate self-renewal and myeloid differentiation. While the majority of patients carry recurrent chromosomal translocations, almost 20% of childhood AML do not show any recognizable cytogenetic alteration and are defined as cytogenetically normal (CN)-AML. CN-AML patients have always showed a great variability in response to therapy and overall outcome, underlining the presence of unknown genetic changes, not detectable by conventional analyses, but relevant for pathogenesis, and outcome of AML. The development of novel genome-wide techniques such as next-generation sequencing, have tremendously improved our ability to interrogate the cancer genome. Based on this background, the aim of this research study was to investigate the mutational landscape of pediatric CN-AML patients negative for all the currently known somatic mutations reported in AML through whole-transcriptome sequencing (RNA-seq). RNA-seq performed on diagnostic leukemic blasts from 19 pediatric CN-AML cases revealed a considerable incidence of cryptic chromosomal rearrangements, with the identification of 21 putative fusion genes. Several of the fusion genes that were identified in this study are recurrent and might have a prognostic and/or therapeutic relevance. A paradigm of that is the CBFA2T3-GLIS2 fusion, which has been demonstrated to be a common alteration in pediatric CN-AML, predicting poor outcome. Important findings have been also obtained in the identification of novel therapeutic targets. On one side, the identification of NUP98-JARID1A fusion suggests the use of disulfiram; on the other, here we describe alteration-activating tyrosine kinases, providing functional data supporting the use of tyrosine kinase inhibitors to specifically inhibit leukemia cells. This study provides new insights in the knowledge of genetic alterations underlying pediatric AML, defines novel prognostic markers and putative therapeutic targets, and prospectively ensures a correct risk stratification and risk-adapted therapy also for the “all-neg” AML subgroup.
Resumo:
The aim of this work was to identify markers associated with production traits in the pig genome using different approaches. We focused the attention on Italian Large White pig breed using Genome Wide Association Studies (GWAS) and applying a selective genotyping approach to increase the power of the analyses. Furthermore, we searched the pig genome using Next Generation Sequencing (NSG) Ion Torrent Technology to combine selective genotyping approach and deep sequencing for SNP discovery. Other two studies were carried on with a different approach. Allele frequency changes for SNPs affecting candidate genes and at Genome Wide level were analysed to identify selection signatures driven by selection program during the last 20 years. This approach confirmed that a great number of markers may affect production traits and that they are captured by the classical selection programs. GWAS revealed 123 significant or suggestively significant SNP associated with Back Fat Thickenss and 229 associated with Average Daily Gain. 16 Copy Number Variant Regions resulted more frequent in lean or fat pigs and showed that different copies of those region could have a limited impact on fat. These often appear to be involved in food intake and behavior, beside affecting genes involved in metabolic pathways and their expression. By combining NGS sequencing with selective genotyping approach, new variants where discovered and at least 54 are worth to be analysed in association studies. The study of groups of pigs undergone to stringent selection showed that allele frequency of some loci can drastically change if they are close to traits that are interesting for selection schemes. These approaches could be, in future, integrated in genomic selection plans.