35 resultados para Machine learning technique
Resumo:
Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains.
Resumo:
The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe
Resumo:
Statistical modelling and statistical learning theory are two powerful analytical frameworks for analyzing signals and developing efficient processing and classification algorithms. In this thesis, these frameworks are applied for modelling and processing biomedical signals in two different contexts: ultrasound medical imaging systems and primate neural activity analysis and modelling. In the context of ultrasound medical imaging, two main applications are explored: deconvolution of signals measured from a ultrasonic transducer and automatic image segmentation and classification of prostate ultrasound scans. In the former application a stochastic model of the radio frequency signal measured from a ultrasonic transducer is derived. This model is then employed for developing in a statistical framework a regularized deconvolution procedure, for enhancing signal resolution. In the latter application, different statistical models are used to characterize images of prostate tissues, extracting different features. These features are then uses to segment the images in region of interests by means of an automatic procedure based on a statistical model of the extracted features. Finally, machine learning techniques are used for automatic classification of the different region of interests. In the context of neural activity signals, an example of bio-inspired dynamical network was developed to help in studies of motor-related processes in the brain of primate monkeys. The presented model aims to mimic the abstract functionality of a cell population in 7a parietal region of primate monkeys, during the execution of learned behavioural tasks.
Resumo:
The diagnosis, grading and classification of tumours has benefited considerably from the development of DCE-MRI which is now essential to the adequate clinical management of many tumour types due to its capability in detecting active angiogenesis. Several strategies have been proposed for DCE-MRI evaluation. Visual inspection of contrast agent concentration curves vs time is a very simple yet operator dependent procedure, therefore more objective approaches have been developed in order to facilitate comparison between studies. In so called model free approaches, descriptive or heuristic information extracted from time series raw data have been used for tissue classification. The main issue concerning these schemes is that they have not a direct interpretation in terms of physiological properties of the tissues. On the other hand, model based investigations typically involve compartmental tracer kinetic modelling and pixel-by-pixel estimation of kinetic parameters via non-linear regression applied on region of interests opportunely selected by the physician. This approach has the advantage to provide parameters directly related to the pathophysiological properties of the tissue such as vessel permeability, local regional blood flow, extraction fraction, concentration gradient between plasma and extravascular-extracellular space. Anyway, nonlinear modelling is computational demanding and the accuracy of the estimates can be affected by the signal-to-noise ratio and by the initial solutions. The principal aim of this thesis is investigate the use of semi-quantitative and quantitative parameters for segmentation and classification of breast lesion. The objectives can be subdivided as follow: describe the principal techniques to evaluate time intensity curve in DCE-MRI with focus on kinetic model proposed in literature; to evaluate the influence in parametrization choice for a classic bi-compartmental kinetic models; to evaluate the performance of a method for simultaneous tracer kinetic modelling and pixel classification; to evaluate performance of machine learning techniques training for segmentation and classification of breast lesion.
Resumo:
Dysfunction of Autonomic Nervous System (ANS) is a typical feature of chronic heart failure and other cardiovascular disease. As a simple non-invasive technology, heart rate variability (HRV) analysis provides reliable information on autonomic modulation of heart rate. The aim of this thesis was to research and develop automatic methods based on ANS assessment for evaluation of risk in cardiac patients. Several features selection and machine learning algorithms have been combined to achieve the goals. Automatic assessment of disease severity in Congestive Heart Failure (CHF) patients: a completely automatic method, based on long-term HRV was proposed in order to automatically assess the severity of CHF, achieving a sensitivity rate of 93% and a specificity rate of 64% in discriminating severe versus mild patients. Automatic identification of hypertensive patients at high risk of vascular events: a completely automatic system was proposed in order to identify hypertensive patients at higher risk to develop vascular events in the 12 months following the electrocardiographic recordings, achieving a sensitivity rate of 71% and a specificity rate of 86% in identifying high-risk subjects among hypertensive patients. Automatic identification of hypertensive patients with history of fall: it was explored whether an automatic identification of fallers among hypertensive patients based on HRV was feasible. The results obtained in this thesis could have implications both in clinical practice and in clinical research. The system has been designed and developed in order to be clinically feasible. Moreover, since 5-minute ECG recording is inexpensive, easy to assess, and non-invasive, future research will focus on the clinical applicability of the system as a screening tool in non-specialized ambulatories, in order to identify high-risk patients to be shortlisted for more complex investigations.