8 resultados para SEQUENCE DATA
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Self-incompatibility (SI) systems have evolved in many flowering plants to prevent self-fertilization and thus promote outbreeding. Pear and apple, as many of the species belonging to the Rosaceae, exhibit RNase-mediated gametophytic self-incompatibility, a widespread system carried also by the Solanaceae and Plantaginaceae. Pear orchards must for this reason contain at least two different cultivars that pollenize each other; to guarantee an efficient cross-pollination, they should have overlapping flowering periods and must be genetically compatible. This compatibility is determined by the S-locus, containing at least two genes encoding for a female (pistil) and a male (pollen) determinant. The female determinant in the Rosaceae, Solanaceae and Plantaginaceae system is a stylar glycoprotein with ribonuclease activity (S-RNase), that acts as a specific cytotoxin in incompatible pollen tubes degrading cellular RNAs. Since its identification, the S-RNase gene has been intensively studied and the sequences of a large number of alleles are available in online databases. On the contrary, the male determinant has been only recently identified as a pollen-expressed protein containing a F-box motif, called S-Locus F-box (abbreviated SLF or SFB). Since F-box proteins are best known for their participation to the SCF (Skp1 - Cullin - F-box) E3 ubiquitine ligase enzymatic complex, that is involved in protein degradation through the 26S proteasome pathway, the male determinant is supposed to act mediating the ubiquitination of the S-RNases, targeting them for the degradation in compatible pollen tubes. Attempts to clone SLF/SFB genes in the Pyrinae produced no results until very recently; in apple, the use of genomic libraries allowed the detection of two F-box genes linked to each S haplotype, called SFBB (S-locus F-Box Brothers). In Japanese pear, three SFBB genes linked to each haplotype were cloned from pollen cDNA. The SFBB genes exhibit S haplotype-specific sequence divergence and pollen-specific expression; their multiplicity is a feature whose interpretation is unclear: it has been hypothesized that all of them participate in the S-specific interaction with the RNase, but it is also possible that only one of them is involved in this function. Moreover, even if the S locus male and female determinants are the only responsible for the specificity of the pollen-pistil recognition, many other factors are supposed to play a role in GSI; these are not linked to the S locus and act in a S-haplotype independent manner. They can have a function in regulating the expression of S determinants (group 1 factors), modulating their activity (group 2) or acting downstream, in the accomplishment of the reaction of acceptance or rejection of the pollen tube (group 3). This study was aimed to the elucidation of the molecular mechanism of GSI in European pear (Pyrus communis) as well as in the other Pyrinae; it was divided in two parts, the first focusing on the characterization of male determinants, and the second on factors external to the S locus. The research of S locus F-box genes was primarily aimed to the identification of such genes in European pear, for which sequence data are still not available; moreover, it allowed also to investigate about the S locus structure in the Pyrinae. The analysis was carried out on a pool of varieties of the three species Pyrus communis (European pear), Pyrus pyrifolia (Japanese pear), and Malus × domestica (apple); varieties carrying S haplotypes whose RNases are highly similar were chosen, in order to check whether or not the same level of similarity is maintained also between the male determinants. A total of 82 sequences was obtained, 47 of which represent the first S-locus F-box genes sequenced from European pear. The sequence data strongly support the hypothesis that the S locus structure is conserved among the three species, and presumably among all the Pyrinae; at least five genes have homologs in the analysed S haplotypes, but the number of F-box genes surrounding the S-RNase could be even greater. The high level of sequence divergence and the similarity between alleles linked to highly conserved RNases, suggest a shared ancestral polymorphism also for the F-box genes. The F-box genes identified in European pear were mapped on a segregating population of 91 individuals from the cross 'Abbé Fétel' × 'Max Red Bartlett'. All the genes were placed on the linkage group 17, where the S locus has been placed both in pear and apple maps, and resulted strongly associated to the S-RNase gene. The linkage with the RNase was perfect for some of the F-box genes, while for others very rare single recombination events were identified. The second part of this study was focused on the research of other genes involved in the SI response in pear; it was aimed on one side to the identification of genes differentially expressed in compatible and incompatible crosses, and on the other to the cloning and characterization of the transglutaminase (TGase) gene, whose role may be crucial in pollen rejection. For the identification of differentially expressed genes, controlled pollinations were carried out in four combinations (self pollination, incompatible, half-compatible and fully compatible cross-pollination); expression profiles were compared through cDNA-AFLP. 28 fragments displaying an expression pattern related to compatibility or incompatibility were identified, cloned and sequenced; the sequence analysis allowed to assign a putative annotation to a part of them. The identified genes are involved in very different cellular processes or in defense mechanisms, suggesting a very complex change in gene expression following the pollen/pistil recognition. The pool of genes identified with this technique offers a good basis for further study toward a better understanding of how the SI response is carried out. Among the factors involved in SI response, moreover, an important role may be played by transglutaminase (TGase), an enzyme involved both in post-translational protein modification and in protein cross-linking. The TGase activity detected in pear styles was significantly higher when pollinated in incompatible combinations than in compatible ones, suggesting a role of this enzyme in the abnormal cytoskeletal reorganization observed during pollen rejection reaction. The aim of this part of the work was thus to identify and clone the pear TGase gene; the PCR amplification of fragments of this gene was achieved using primers realized on the alignment between the Arabidopsis TGase gene sequence and several apple EST fragments; the full-length coding sequence of the pear TGase gene was then cloned from cDNA, and provided a precious tool for further study of the in vitro and in vivo action of this enzyme.
Resumo:
The Poxviruses are a family of double stranded DNA (dsDNA) viruses that cause disease in many species, both vertebrate and invertebrate. Their genomes range in size from 135 to 365 kbp and show conservation in both organization and content. In particular, the central genomic regions of the chordopoxvirus subfamily (those capable of infecting vertebrates) contain 88 genes which are present in all the virus species characterised to date and which mostly occur in the same order and orientation. In contrast, however, the terminal regions of the genomes frequently contain genes that are species or genera-specific and that are not essential for the growth of the virus in vitro but instead often encode factors with important roles in vivo including modulation of the host immune response to infection and determination of the host range of the virus. The Parapoxviruses (PPV), of which Orf virus is the prototypic species, represent a genus within the chordopoxvirus subfamily of Poxviridae and are characterised by their ability to infect ruminants and humans. The genus currently contains four recognised species of virus, bovine papular stomatitis virus (BPSV) and pseudocowpox virus (PCPV) both of which infect cattle, orf virus (OV) that infects sheep and goats, and parapoxvirus of red deer in New Zealand (PVNZ). The ORFV genome has been fully sequenced, as has that of BPSV, and is ~138 kb in length encoding ~132 genes. The vast majority of these genes allow the virus to replicate in the cytoplasm of the infected host cell and therefore encode proteins involved in replication, transcription and metabolism of nucleic acids. These genes are well conserved between all known genera of poxviruses. There is however another class of genes, located at either end of the linear dsDNA genome, that encode proteins which are non-essential for replication and generally dictate host range and virulence of the virus. The non-essential genes are often the most variable within and between species of virus and therefore are potentially useful for diagnostic purposes. Given their role in subverting the host-immune response to infection they are also targets for novel therapeutics. The function of only a relatively small number of these proteins has been elucidated and there are several genes whose function still remains obscure principally because there is little similarity between them and proteins of known function in current sequence databases. It is thought that by selectively removing some of the virulence genes, or at least neutralising the proteins in some way, current vaccines could be improved. The evolution of poxviruses has been proposed to be an adaptive process involving frequent events of gene gain and loss, such that the virus co-evolves with its specific host. Gene capture or horizontal gene transfer from the host to the virus is considered an important source of new viral genes including those likely to be involved in host range and those enabling the virus to interfere with the host immune response to infection. Given the low rate of nucleotide substitution, recombination can be seen as an essential evolutionary driving force although it is likely underestimated. Recombination in poxviruses is intimately linked to DNA replication with both viral and cellular proteins participate in this recombination-dependent replication. It has been shown, in other poxvirus genera, that recombination between isolates and perhaps even between species does occur, thereby providing another mechanism for the acquisition of new genes and for the rapid evolution of viruses. Such events may result in viruses that have a selective advantage over others, for example in re-infections (a characteristic of the PPV), or in viruses that are able to jump the species barrier and infect new hosts. Sequence data related to viral strains isolated from goats suggest that possible recombination events may have occurred between OV and PCPV (Ueda et al. 2003). The recombination events are frequent during poxvirus replication and comparative genomic analysis of several poxvirus species has revealed that recombinations occur frequently on the right terminal region. Intraspecific recombination can occur between strains of the same PPV species, but also interspecific recombination can happen depending on enough sequence similarity to enable recombination between distinct PPV species. The most important pre-requisite for a successful recombination is the coinfection of the individual host by different virus strains or species. Consequently, the following factors affecting the distribution of different viruses to shared target cells need to be considered: dose of inoculated virus, time interval between inoculation of the first and the second virus, distance between the marker mutations, genetic homology. At present there are no available data on the replication dynamics of PPV in permissive and non permissive hosts and reguarding co-infetions there are no information on the interference mechanisms occurring during the simultaneous replication of viruses of different species. This work has been carried out to set up permissive substrates allowing the replication of different PPV species, in particular keratinocytes monolayers and organotypic skin cultures. Furthermore a method to isolate and expand ovine skin stem cells was has been set up to indeep further aspects of viral cellular tropism during natural infection. The study produced important data to elucidate the replication dynamics of OV and PCPV virus in vitro as well as the mechanisms of interference that can arise during co-infection with different viral species. Moreover, the analysis carried on the genomic right terminal region of PCPV 1303/05 contributed to a better knowledge of the viral genes involved in host interaction and pathogenesis as well as to locate recombination breakpoints and genetic homologies between PPV species. Taken together these data filled several crucial gaps for the study of interspecific recombinations of PPVs which are thought to be important for a better understanding of the viral evolution and to improve the biosafety of antiviral therapy and PPV-based vectors.
Resumo:
Different types of proteins exist with diverse functions that are essential for living organisms. An important class of proteins is represented by transmembrane proteins which are specifically designed to be inserted into biological membranes and devised to perform very important functions in the cell such as cell communication and active transport across the membrane. Transmembrane β-barrels (TMBBs) are a sub-class of membrane proteins largely under-represented in structure databases because of the extreme difficulty in experimental structure determination. For this reason, computational tools that are able to predict the structure of TMBBs are needed. In this thesis, two computational problems related to TMBBs were addressed: the detection of TMBBs in large datasets of proteins and the prediction of the topology of TMBB proteins. Firstly, a method for TMBB detection was presented based on a novel neural network framework for variable-length sequence classification. The proposed approach was validated on a non-redundant dataset of proteins. Furthermore, we carried-out genome-wide detection using the entire Escherichia coli proteome. In both experiments, the method significantly outperformed other existing state-of-the-art approaches, reaching very high PPV (92%) and MCC (0.82). Secondly, a method was also introduced for TMBB topology prediction. The proposed approach is based on grammatical modelling and probabilistic discriminative models for sequence data labeling. The method was evaluated using a newly generated dataset of 38 TMBB proteins obtained from high-resolution data in the PDB. Results have shown that the model is able to correctly predict topologies of 25 out of 38 protein chains in the dataset. When tested on previously released datasets, the performances of the proposed approach were measured as comparable or superior to the current state-of-the-art of TMBB topology prediction.
Resumo:
The recent advent of Next-generation sequencing technologies has revolutionized the way of analyzing the genome. This innovation allows to get deeper information at a lower cost and in less time, and provides data that are discrete measurements. One of the most important applications with these data is the differential analysis, that is investigating if one gene exhibit a different expression level in correspondence of two (or more) biological conditions (such as disease states, treatments received and so on). As for the statistical analysis, the final aim will be statistical testing and for modeling these data the Negative Binomial distribution is considered the most adequate one especially because it allows for "over dispersion". However, the estimation of the dispersion parameter is a very delicate issue because few information are usually available for estimating it. Many strategies have been proposed, but they often result in procedures based on plug-in estimates, and in this thesis we show that this discrepancy between the estimation and the testing framework can lead to uncontrolled first-type errors. We propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Afterwards, three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.
Resumo:
By the end of the 19th century, geodesy has contributed greatly to the knowledge of regional tectonics and fault movement through its ability to measure, at sub-centimetre precision, the relative positions of points on the Earth’s surface. Nowadays the systematic analysis of geodetic measurements in active deformation regions represents therefore one of the most important tool in the study of crustal deformation over different temporal scales [e.g., Dixon, 1991]. This dissertation focuses on motion that can be observed geodetically with classical terrestrial position measurements, particularly triangulation and leveling observations. The work is divided into two sections: an overview of the principal methods for estimating longterm accumulation of elastic strain from terrestrial observations, and an overview of the principal methods for rigorously inverting surface coseismic deformation fields for source geometry with tests on synthetic deformation data sets and applications in two different tectonically active regions of the Italian peninsula. For the long-term accumulation of elastic strain analysis, triangulation data were available from a geodetic network across the Messina Straits area (southern Italy) for the period 1971 – 2004. From resulting angle changes, the shear strain rates as well as the orientation of the principal axes of the strain rate tensor were estimated. The computed average annual shear strain rates for the time period between 1971 and 2004 are γ˙1 = 113.89 ± 54.96 nanostrain/yr and γ˙2 = -23.38 ± 48.71 nanostrain/yr, with the orientation of the most extensional strain (θ) at N140.80° ± 19.55°E. These results suggests that the first-order strain field of the area is dominated by extension in the direction perpendicular to the trend of the Straits, sustaining the hypothesis that the Messina Straits could represents an area of active concentrated deformation. The orientation of θ agree well with GPS deformation estimates, calculated over shorter time interval, and is consistent with previous preliminary GPS estimates [D’Agostino and Selvaggi, 2004; Serpelloni et al., 2005] and is also similar to the direction of the 1908 (MW 7.1) earthquake slip vector [e.g., Boschi et al., 1989; Valensise and Pantosti, 1992; Pino et al., 2000; Amoruso et al., 2002]. Thus, the measured strain rate can be attributed to an active extension across the Messina Straits, corresponding to a relative extension rate ranges between < 1mm/yr and up to ~ 2 mm/yr, within the portion of the Straits covered by the triangulation network. These results are consistent with the hypothesis that the Messina Straits is an important active geological boundary between the Sicilian and the Calabrian domains and support previous preliminary GPS-based estimates of strain rates across the Straits, which show that the active deformation is distributed along a greater area. Finally, the preliminary dislocation modelling has shown that, although the current geodetic measurements do not resolve the geometry of the dislocation models, they solve well the rate of interseismic strain accumulation across the Messina Straits and give useful information about the locking the depth of the shear zone. Geodetic data, triangulation and leveling measurements of the 1976 Friuli (NE Italy) earthquake, were available for the inversion of coseismic source parameters. From observed angle and elevation changes, the source parameters of the seismic sequence were estimated in a join inversion using an algorithm called “simulated annealing”. The computed optimal uniform–slip elastic dislocation model consists of a 30° north-dipping shallow (depth 1.30 ± 0.75 km) fault plane with azimuth of 273° and accommodating reverse dextral slip of about 1.8 m. The hypocentral location and inferred fault plane of the main event are then consistent with the activation of Periadriatic overthrusts or other related thrust faults as the Gemona- Kobarid thrust. Then, the geodetic data set exclude the source solution of Aoudia et al. [2000], Peruzza et al. [2002] and Poli et al. [2002] that considers the Susans-Tricesimo thrust as the May 6 event. The best-fit source model is then more consistent with the solution of Pondrelli et al. [2001], which proposed the activation of other thrusts located more to the North of the Susans-Tricesimo thrust, probably on Periadriatic related thrust faults. The main characteristics of the leveling and triangulation data are then fit by the optimal single fault model, that is, these results are consistent with a first-order rupture process characterized by a progressive rupture of a single fault system. A single uniform-slip fault model seems to not reproduce some minor complexities of the observations, and some residual signals that are not modelled by the optimal single-fault plane solution, were observed. In fact, the single fault plane model does not reproduce some minor features of the leveling deformation field along the route 36 south of the main uplift peak, that is, a second fault seems to be necessary to reproduce these residual signals. By assuming movements along some mapped thrust located southward of the inferred optimal single-plane solution, the residual signal has been successfully modelled. In summary, the inversion results presented in this Thesis, are consistent with the activation of some Periadriatic related thrust for the main events of the sequence, and with a minor importance of the southward thrust systems of the middle Tagliamento plain.
Resumo:
During my PhD, starting from the original formulations proposed by Bertrand et al., 2000 and Emolo & Zollo 2005, I developed inversion methods and applied then at different earthquakes. In particular large efforts have been devoted to the study of the model resolution and to the estimation of the model parameter errors. To study the source kinematic characteristics of the Christchurch earthquake we performed a joint inversion of strong-motion, GPS and InSAR data using a non-linear inversion method. Considering the complexity highlighted by superficial deformation data, we adopted a fault model consisting of two partially overlapping segments, with dimensions 15x11 and 7x7 km2, having different faulting styles. This two-fault model allows to better reconstruct the complex shape of the superficial deformation data. The total seismic moment resulting from the joint inversion is 3.0x1025 dyne.cm (Mw = 6.2) with an average rupture velocity of 2.0 km/s. Errors associated with the kinematic model have been estimated of around 20-30 %. The 2009 Aquila sequence was characterized by an intense aftershocks sequence that lasted several months. In this study we applied an inversion method that assumes as data the apparent Source Time Functions (aSTFs), to a Mw 4.0 aftershock of the Aquila sequence. The estimation of aSTFs was obtained using the deconvolution method proposed by Vallée et al., 2004. The inversion results show a heterogeneous slip distribution, characterized by two main slip patches located NW of the hypocenter, and a variable rupture velocity distribution (mean value of 2.5 km/s), showing a rupture front acceleration in between the two high slip zones. Errors of about 20% characterize the final estimated parameters.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.
Resumo:
In many application domains data can be naturally represented as graphs. When the application of analytical solutions for a given problem is unfeasible, machine learning techniques could be a viable way to solve the problem. Classical machine learning techniques are defined for data represented in a vectorial form. Recently some of them have been extended to deal directly with structured data. Among those techniques, kernel methods have shown promising results both from the computational complexity and the predictive performance point of view. Kernel methods allow to avoid an explicit mapping in a vectorial form relying on kernel functions, which informally are functions calculating a similarity measure between two entities. However, the definition of good kernels for graphs is a challenging problem because of the difficulty to find a good tradeoff between computational complexity and expressiveness. Another problem we face is learning on data streams, where a potentially unbounded sequence of data is generated by some sources. There are three main contributions in this thesis. The first contribution is the definition of a new family of kernels for graphs based on Directed Acyclic Graphs (DAGs). We analyzed two kernels from this family, achieving state-of-the-art results from both the computational and the classification point of view on real-world datasets. The second contribution consists in making the application of learning algorithms for streams of graphs feasible. Moreover,we defined a principled way for the memory management. The third contribution is the application of machine learning techniques for structured data to non-coding RNA function prediction. In this setting, the secondary structure is thought to carry relevant information. However, existing methods considering the secondary structure have prohibitively high computational complexity. We propose to apply kernel methods on this domain, obtaining state-of-the-art results.