2 resultados para ESID ONLINE DATABASE

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

80.00% 80.00%

Publicador:

Resumo:

The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A Digital Scholarly Edition is a conceptually and structurally sophisticated entity. Throughout the centuries, diverse methodologies have been employed to reconstruct a text transmitted through one or multiple sources, resulting in various edition types. With the advent of digital technology in philology, these practices have undergone a significant transformation, compelling scholars to reconsider their approach in light of the web. In the digital age, philologists are expected to possess (too) advanced technical skills to prepare interactive and enriched editions, even though, in most cases, only mechanical or documentary editions are published online. The Śivadharma Database is a web Content Management System (CMS) designed to facilitate the preparation, publication, and updating of Digital Scholarly Editions. By providing scholars with a user-friendly CRUD web application to reconstruct and annotate a text, they can prepare their textus with additional components such as apparatus, notes, translations, citations, and parallels. It is possible by leveraging an annotation system based on HTML and graph data structure. This choice is made because the text entity is multidimensional and multifaceted, even if its sequential presentation constrains it. In particular, editions of South Asian texts of the Śivadharma corpus, the case study of this research, contain a series of phenomena that are difficult to manage formally, such as overlapping hierarchies. Hence, it becomes necessary to establish the data structure best suited to represent this complexity. In Śivadharma Database, the textus is an HTML file readily displayable. Textual fragments, annotated via an interface without requiring philologists to write code and saved in the backend, form the atomic unit of multiple relationships organised in a graph database. This approach enables the formal representation of complex and overlapping textual phenomena, allowing for good annotation expressiveness with minimal effort to learn the relevant technologies during the editing workflow.