5 resultados para Inter-project Learning
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
Biology is now a “Big Data Science” thanks to technological advancements allowing the characterization of the whole macromolecular content of a cell or a collection of cells. This opens interesting perspectives, but only a small portion of this data may be experimentally characterized. From this derives the demand of accurate and efficient computational tools for automatic annotation of biological molecules. This is even more true when dealing with membrane proteins, on which my research project is focused leading to the development of two machine learning-based methods: BetAware-Deep and SVMyr. BetAware-Deep is a tool for the detection and topology prediction of transmembrane beta-barrel proteins found in Gram-negative bacteria. These proteins are involved in many biological processes and primary candidates as drug targets. BetAware-Deep exploits the combination of a deep learning framework (bidirectional long short-term memory) and a probabilistic graphical model (grammatical-restrained hidden conditional random field). Moreover, it introduced a modified formulation of the hydrophobic moment, designed to include the evolutionary information. BetAware-Deep outperformed all the available methods in topology prediction and reported high scores in the detection task. Glycine myristoylation in Eukaryotes is the binding of a myristic acid on an N-terminal glycine. SVMyr is a fast method based on support vector machines designed to predict this modification in dataset of proteomic scale. It uses as input octapeptides and exploits computational scores derived from experimental examples and mean physicochemical features. SVMyr outperformed all the available methods for co-translational myristoylation prediction. In addition, it allows (as a unique feature) the prediction of post-translational myristoylation. Both the tools here described are designed having in mind best practices for the development of machine learning-based tools outlined by the bioinformatics community. Moreover, they are made available via user-friendly web servers. All this make them valuable tools for filling the gap between sequential and annotated data.
Resumo:
With the CERN LHC program underway, there has been an acceleration of data growth in the High Energy Physics (HEP) field and the usage of Machine Learning (ML) in HEP will be critical during the HL-LHC program when the data that will be produced will reach the exascale. ML techniques have been successfully used in many areas of HEP nevertheless, the development of a ML project and its implementation for production use is a highly time-consuming task and requires specific skills. Complicating this scenario is the fact that HEP data is stored in ROOT data format, which is mostly unknown outside of the HEP community. The work presented in this thesis is focused on the development of a ML as a Service (MLaaS) solution for HEP, aiming to provide a cloud service that allows HEP users to run ML pipelines via HTTP calls. These pipelines are executed by using the MLaaS4HEP framework, which allows reading data, processing data, and training ML models directly using ROOT files of arbitrary size from local or distributed data sources. Such a solution provides HEP users non-expert in ML with a tool that allows them to apply ML techniques in their analyses in a streamlined manner. Over the years the MLaaS4HEP framework has been developed, validated, and tested and new features have been added. A first MLaaS solution has been developed by automatizing the deployment of a platform equipped with the MLaaS4HEP framework. Then, a service with APIs has been developed, so that a user after being authenticated and authorized can submit MLaaS4HEP workflows producing trained ML models ready for the inference phase. A working prototype of this service is currently running on a virtual machine of INFN-Cloud and is compliant to be added to the INFN Cloud portfolio of services.
Resumo:
Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains.
Resumo:
The rapid progression of biomedical research coupled with the explosion of scientific literature has generated an exigent need for efficient and reliable systems of knowledge extraction. This dissertation contends with this challenge through a concentrated investigation of digital health, Artificial Intelligence, and specifically Machine Learning and Natural Language Processing's (NLP) potential to expedite systematic literature reviews and refine the knowledge extraction process. The surge of COVID-19 complicated the efforts of scientists, policymakers, and medical professionals in identifying pertinent articles and assessing their scientific validity. This thesis presents a substantial solution in the form of the COKE Project, an initiative that interlaces machine reading with the rigorous protocols of Evidence-Based Medicine to streamline knowledge extraction. In the framework of the COKE (“COVID-19 Knowledge Extraction framework for next-generation discovery science”) Project, this thesis aims to underscore the capacity of machine reading to create knowledge graphs from scientific texts. The project is remarkable for its innovative use of NLP techniques such as a BERT + bi-LSTM language model. This combination is employed to detect and categorize elements within medical abstracts, thereby enhancing the systematic literature review process. The COKE project's outcomes show that NLP, when used in a judiciously structured manner, can significantly reduce the time and effort required to produce medical guidelines. These findings are particularly salient during times of medical emergency, like the COVID-19 pandemic, when quick and accurate research results are critical.
Resumo:
There are many diseases that affect the thyroid gland, and among them are carcinoma. Thyroid cancer is the most common endocrine neoplasm and the second most frequent cancer in the 0-49 age group. This thesis deals with two studies I conducted during my PhD. The first concerns the development of a Deep Learning model to be able to assist the pathologist in screening of thyroid cytology smears. This tool created in collaboration with Prof. Diciotti, affiliated with the DEI-UNIBO "Guglielmo Marconi" Department of Electrical Energy and Information Engineering, has an important clinical implication in that it allows patients to be stratified between those who should undergo surgery and those who should not. The second concerns the application of spatial transcriptomics on well-differentiated thyroid carcinomas to better understand their invasion mechanisms and thus to better comprehend which genes may be involved in the proliferation of these tumors. This project specifically was made possible through a fruitful collaboration with the Gustave Roussy Institute in Paris. Studying thyroid carcinoma deeply is essential to improve patient care, increase survival rates, and enhance the overall understanding of this prevalent cancer. It can lead to more effective prevention, early detection, and treatment strategies that benefit both patients and the healthcare system.