90 resultados para Machine Learning,Natural Language Processing,Descriptive Text Mining,POIROT,Transformer
em AMS Tesi di Laurea - Alm@DL - Università di Bologna
Resumo:
Poiché la nostra conoscenza collettiva continua ad essere digitalizzata e memorizzata, diventa più difficile trovare e scoprire ciò che stiamo cercando. Abbiamo bisogno di nuovi strumenti computazionali per aiutare a organizzare, rintracciare e comprendere queste vaste quantità di informazioni. I modelli di linguaggio sono potenti strumenti che possono essere impiegati per estrarre conoscenza statisticamente significativa ed interpretabile tramite apprendimento non supervisionato, testuali o nel codice sorgente. L’obiettivo di questa tesi è impiegare una metodologia di descriptive text mining, denominata POIROT, per analizzare i rapporti medici del dataset Adverse Drug Reaction (ADE). Si vogliono stabilire delle correlazioni significative che permettano di comprendere le ragioni per cui un determinato rapporto medico fornisca o meno informazioni relative a effetti collaterali dovuti all’assunzione di determinati farmaci.
Resumo:
Driven by recent deep learning breakthroughs, natural language generation (NLG) models have been at the center of steady progress in the last few years. However, since our ability to generate human-indistinguishable artificial text lags behind our capacity to assess it, it is paramount to develop and apply even better automatic evaluation metrics. To facilitate researchers to judge the effectiveness of their models broadly, we suggest NLG-Metricverse—an end-to-end open-source library for NLG evaluation based on Python. This framework provides a living collection of NLG metrics in a unified and easy- to-use environment, supplying tools to efficiently apply, analyze, compare, and visualize them. This includes (i) the extensive support of heterogeneous automatic metrics with n-arity management, (ii) the meta-evaluation upon individual performance, metric-metric and metric-human correlations, (iii) graphical interpretations for helping humans better gain score intuitions, (iv) formal categorization and convenient documentation to accelerate metrics understanding. NLG-Metricverse aims to increase the comparability and replicability of NLG research, hopefully stimulating new contributions in the area.
Resumo:
Twitter is a highly popular social media which on one hand allows information transmission in real time and on the other hand represents a source of open access homogeneous text data. We propose an analysis of the most common self-reported COVID symptoms from a dataset of Italian tweets to investigate the evolution of the pandemic in Italy from the end of September 2020 to the end of January 2021. After manually filtering tweets actually describing COVID symptoms from the database - which contains words related to fever, cough and sore throat - we discuss usefulness of such filtering. We then compare our time series with the daily data of new hospitalisations in Italy, with the aim of building a simple linear regression model that accounts for the delay which is observed from the tweets mentioning individual symptoms to new hospitalisations. We discuss both the results and limitations of linear regression given that our data suggests that the relationship between time series of symptoms tweets and of new hospitalisations changes towards the end of the acquisition.
Resumo:
The aim of TinyML is to bring the capability of Machine Learning to ultra-low-power devices, typically under a milliwatt, and with this it breaks the traditional power barrier that prevents the widely distributed machine intelligence. TinyML allows greater reactivity and privacy by conducting inference on the computer and near-sensor while avoiding the energy cost associated with wireless communication, which is far higher at this scale than that of computing. In addition, TinyML’s efficiency makes a class of smart, battery-powered, always-on applications that can revolutionize the collection and processing of data in real time. This emerging field, which is the end of a lot of innovation, is ready to speed up its growth in the coming years. In this thesis, we deploy three model on a microcontroller. For the model, datasets are retrieved from an online repository and are preprocessed as per our requirement. The model is then trained on the split of preprocessed data at its best to get the most accuracy out of it. Later the trained model is converted to C language to make it possible to deploy on the microcontroller. Finally, we take step towards incorporating the model into the microcontroller by implementing and evaluating an interface for the user to utilize the microcontroller’s sensors. In our thesis, we will have 4 chapters. The first will give us an introduction of TinyML. The second chapter will help setup the TinyML Environment. The third chapter will be about a major use of TinyML in Wake Word Detection. The final chapter will deal with Gesture Recognition in TinyML.
Resumo:
With the advent of high-performance computing devices, deep neural networks have gained a lot of popularity in solving many Natural Language Processing tasks. However, they are also vulnerable to adversarial attacks, which are able to modify the input text in order to mislead the target model. Adversarial attacks are a serious threat to the security of deep neural networks, and they can be used to craft adversarial examples that steer the model towards a wrong decision. In this dissertation, we propose SynBA, a novel contextualized synonym-based adversarial attack for text classification. SynBA is based on the idea of replacing words in the input text with their synonyms, which are selected according to the context of the sentence. We show that SynBA successfully generates adversarial examples that are able to fool the target model with a high success rate. We demonstrate three advantages of this proposed approach: (1) effective - it outperforms state-of-the-art attacks by semantic similarity and perturbation rate, (2) utility-preserving - it preserves semantic content, grammaticality, and correct types classified by humans, and (3) efficient - it performs attacks faster than other methods.
Resumo:
In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training.
Resumo:
In recent years, Deep Learning techniques have shown to perform well on a large variety of problems both in Computer Vision and Natural Language Processing, reaching and often surpassing the state of the art on many tasks. The rise of deep learning is also revolutionizing the entire field of Machine Learning and Pattern Recognition pushing forward the concepts of automatic feature extraction and unsupervised learning in general. However, despite the strong success both in science and business, deep learning has its own limitations. It is often questioned if such techniques are only some kind of brute-force statistical approaches and if they can only work in the context of High Performance Computing with tons of data. Another important question is whether they are really biologically inspired, as claimed in certain cases, and if they can scale well in terms of "intelligence". The dissertation is focused on trying to answer these key questions in the context of Computer Vision and, in particular, Object Recognition, a task that has been heavily revolutionized by recent advances in the field. Practically speaking, these answers are based on an exhaustive comparison between two, very different, deep learning techniques on the aforementioned task: Convolutional Neural Network (CNN) and Hierarchical Temporal memory (HTM). They stand for two different approaches and points of view within the big hat of deep learning and are the best choices to understand and point out strengths and weaknesses of each of them. CNN is considered one of the most classic and powerful supervised methods used today in machine learning and pattern recognition, especially in object recognition. CNNs are well received and accepted by the scientific community and are already deployed in large corporation like Google and Facebook for solving face recognition and image auto-tagging problems. HTM, on the other hand, is known as a new emerging paradigm and a new meanly-unsupervised method, that is more biologically inspired. It tries to gain more insights from the computational neuroscience community in order to incorporate concepts like time, context and attention during the learning process which are typical of the human brain. In the end, the thesis is supposed to prove that in certain cases, with a lower quantity of data, HTM can outperform CNN.
Resumo:
Il periodo in cui viviamo rappresenta la cuspide di una forte e rapida evoluzione nella comprensione del linguaggio naturale, raggiuntasi prevalentemente grazie allo sviluppo di modelli neurali. Nell'ambito dell'information extraction, tali progressi hanno recentemente consentito di riconoscere efficacemente relazioni semantiche complesse tra entità menzionate nel testo, quali proteine, sintomi e farmaci. Tale task -- reso possibile dalla modellazione ad eventi -- è fondamentale in biomedicina, dove la crescita esponenziale del numero di pubblicazioni scientifiche accresce ulteriormente il bisogno di sistemi per l'estrazione automatica delle interazioni racchiuse nei documenti testuali. La combinazione di AI simbolica e sub-simbolica può consentire l'introduzione di conoscenza strutturata nota all'interno di language model, rendendo quest'ultimi più robusti, fattuali e interpretabili. In tale contesto, la verbalizzazione di grafi è uno dei task su cui si riversano maggiori aspettative. Nonostante l'importanza di tali contributi (dallo sviluppo di chatbot alla formulazione di nuove ipotesi di ricerca), ad oggi, risultano assenti contributi capaci di verbalizzare gli eventi biomedici espressi in letteratura, apprendendo il legame tra le interazioni espresse in forma a grafo e la loro controparte testuale. La tesi propone il primo dataset altamente comprensivo su coppie evento-testo, includendo diverse sotto-aree biomediche, quali malattie infettive, ricerca oncologica e biologia molecolare. Il dataset introdotto viene usato come base per l'addestramento di modelli generativi allo stato dell'arte sul task di verbalizzazione, adottando un approccio text-to-text e illustrando una tecnica formale per la codifica di grafi evento mediante testo aumentato. Infine, si dimostra la validità degli eventi per il miglioramento delle capacità di comprensione dei modelli neurali su altri task NLP, focalizzandosi su single-document summarization e multi-task learning.
Resumo:
The final goal of the thesis should be a real-world application in the production test data environment. This includes the pre-processing of the data, building models and visualizing the results. To do this, different machine learning models, outlier prediction oriented, should be investigated using a real dataset. Finally, the different outlier prediction algorithms should be compared, and their performance discussed.
Resumo:
Nonostante lo scetticismo di molti studiosi circa la possibilità di prevedere l'andamento della borsa valori, esistono svariate teorie ipotizzanti la possibilità di utilizzare le informazioni conosciute per predirne i movimenti futuri. L’avvento dell’intelligenza artificiale nella seconda parte dello scorso secolo ha permesso di ottenere risultati rivoluzionari in svariati ambiti, tanto che oggi tale disciplina trova ampio impiego nella nostra vita quotidiana in molteplici forme. In particolare, grazie al machine learning, è stato possibile sviluppare sistemi intelligenti che apprendono grazie ai dati, riuscendo a modellare problemi complessi. Visto il successo di questi sistemi, essi sono stati applicati anche all’arduo compito di predire la borsa valori, dapprima utilizzando i dati storici finanziari della borsa come fonte di conoscenza, e poi, con la messa a punto di tecniche di elaborazione del linguaggio naturale umano (NLP), anche utilizzando dati in linguaggio naturale, come il testo di notizie finanziarie o l’opinione degli investitori. Questo elaborato ha l’obiettivo di fornire una panoramica sull’utilizzo delle tecniche di machine learning nel campo della predizione del mercato azionario, partendo dalle tecniche più elementari per arrivare ai complessi modelli neurali che oggi rappresentano lo stato dell’arte. Vengono inoltre formalizzati il funzionamento e le tecniche che si utilizzano per addestrare e valutare i modelli di machine learning, per poi effettuare un esperimento in cui a partire da dati finanziari e soprattutto testuali si tenterà di predire correttamente la variazione del valore dell’indice di borsa S&P 500 utilizzando un language model basato su una rete neurale.
Resumo:
Natural Language Processing (NLP) has seen tremendous improvements over the last few years. Transformer architectures achieved impressive results in almost any NLP task, such as Text Classification, Machine Translation, and Language Generation. As time went by, transformers continued to improve thanks to larger corpora and bigger networks, reaching hundreds of billions of parameters. Training and deploying such large models has become prohibitively expensive, such that only big high tech companies can afford to train those models. Therefore, a lot of research has been dedicated to reducing a model’s size. In this thesis, we investigate the effects of Vocabulary Transfer and Knowledge Distillation for compressing large Language Models. The goal is to combine these two methodologies to further compress models without significant loss of performance. In particular, we designed different combination strategies and conducted a series of experiments on different vertical domains (medical, legal, news) and downstream tasks (Text Classification and Named Entity Recognition). Four different methods involving Vocabulary Transfer (VIPI) with and without a Masked Language Modelling (MLM) step and with and without Knowledge Distillation are compared against a baseline that assigns random vectors to new elements of the vocabulary. Results indicate that VIPI effectively transfers information of the original vocabulary and that MLM is beneficial. It is also noted that both vocabulary transfer and knowledge distillation are orthogonal to one another and may be applied jointly. The application of knowledge distillation first before subsequently applying vocabulary transfer is recommended. Finally, model performance due to vocabulary transfer does not always show a consistent trend as the vocabulary size is reduced. Hence, the choice of vocabulary size should be empirically selected by evaluation on the downstream task similar to hyperparameter tuning.
Resumo:
Artificial Intelligence (AI) has substantially influenced numerous disciplines in recent years. Biology, chemistry, and bioinformatics are among them, with significant advances in protein structure prediction, paratope prediction, protein-protein interactions (PPIs), and antibody-antigen interactions. Understanding PPIs is critical since they are responsible for practically everything living and have several uses in vaccines, cancer, immunology, and inflammatory illnesses. Machine Learning (ML) offers enormous potential for effectively simulating antibody-antigen interactions and improving in-silico optimization of therapeutic antibodies for desired features, including binding activity, stability, and low immunogenicity. This research looks at the use of AI algorithms to better understand antibody-antigen interactions, and it further expands and explains several difficulties encountered in the field. Furthermore, we contribute by presenting a method that outperforms existing state-of-the-art strategies in paratope prediction from sequence data.
Resumo:
The aim of this thesis project is to automatically localize HCC tumors in the human liver and subsequently predict if the tumor will undergo microvascular infiltration (MVI), the initial stage of metastasis development. The input data for the work have been partially supplied by Sant'Orsola Hospital and partially downloaded from online medical databases. Two Unet models have been implemented for the automatic segmentation of the livers and the HCC malignancies within it. The segmentation models have been evaluated with the Intersection-over-Union and the Dice Coefficient metrics. The outcomes obtained for the liver automatic segmentation are quite good (IOU = 0.82; DC = 0.35); the outcomes obtained for the tumor automatic segmentation (IOU = 0.35; DC = 0.46) are, instead, affected by some limitations: it can be state that the algorithm is almost always able to detect the location of the tumor, but it tends to underestimate its dimensions. The purpose is to achieve the CT images of the HCC tumors, necessary for features extraction. The 14 Haralick features calculated from the 3D-GLCM, the 120 Radiomic features and the patients' clinical information are collected to build a dataset of 153 features. Now, the goal is to build a model able to discriminate, based on the features given, the tumors that will undergo MVI and those that will not. This task can be seen as a classification problem: each tumor needs to be classified either as “MVI positive” or “MVI negative”. Techniques for features selection are implemented to identify the most descriptive features for the problem at hand and then, a set of classification models are trained and compared. Among all, the models with the best performances (around 80-84% ± 8-15%) result to be the XGBoost Classifier, the SDG Classifier and the Logist Regression models (without penalization and with Lasso, Ridge or Elastic Net penalization).
Resumo:
In questa tesi si trattano lo studio e la sperimentazione di un modello generativo retrieval-augmented, basato su Transformers, per il task di Abstractive Summarization su lunghe sentenze legali. La sintesi automatica del testo (Automatic Text Summarization) è diventata un task di Natural Language Processing (NLP) molto importante oggigiorno, visto il grandissimo numero di dati provenienti dal web e banche dati. Inoltre, essa permette di automatizzare un processo molto oneroso per gli esperti, specialmente nel settore legale, in cui i documenti sono lunghi e complicati, per cui difficili e dispendiosi da riassumere. I modelli allo stato dell’arte dell’Automatic Text Summarization sono basati su soluzioni di Deep Learning, in particolare sui Transformers, che rappresentano l’architettura più consolidata per task di NLP. Il modello proposto in questa tesi rappresenta una soluzione per la Long Document Summarization, ossia per generare riassunti di lunghe sequenze testuali. In particolare, l’architettura si basa sul modello RAG (Retrieval-Augmented Generation), recentemente introdotto dal team di ricerca Facebook AI per il task di Question Answering. L’obiettivo consiste nel modificare l’architettura RAG al fine di renderla adatta al task di Abstractive Long Document Summarization. In dettaglio, si vuole sfruttare e testare la memoria non parametrica del modello, con lo scopo di arricchire la rappresentazione del testo di input da riassumere. A tal fine, sono state sperimentate diverse configurazioni del modello su diverse tipologie di esperimenti e sono stati valutati i riassunti generati con diverse metriche automatiche.
Resumo:
The dissertation starts by providing a description of the phenomena related to the increasing importance recently acquired by satellite applications. The spread of such technology comes with implications, such as an increase in maintenance cost, from which derives the interest in developing advanced techniques that favor an augmented autonomy of spacecrafts in health monitoring. Machine learning techniques are widely employed to lay a foundation for effective systems specialized in fault detection by examining telemetry data. Telemetry consists of a considerable amount of information; therefore, the adopted algorithms must be able to handle multivariate data while facing the limitations imposed by on-board hardware features. In the framework of outlier detection, the dissertation addresses the topic of unsupervised machine learning methods. In the unsupervised scenario, lack of prior knowledge of the data behavior is assumed. In the specific, two models are brought to attention, namely Local Outlier Factor and One-Class Support Vector Machines. Their performances are compared in terms of both the achieved prediction accuracy and the equivalent computational cost. Both models are trained and tested upon the same sets of time series data in a variety of settings, finalized at gaining insights on the effect of the increase in dimensionality. The obtained results allow to claim that both models, combined with a proper tuning of their characteristic parameters, successfully comply with the role of outlier detectors in multivariate time series data. Nevertheless, under this specific context, Local Outlier Factor results to be outperforming One-Class SVM, in that it proves to be more stable over a wider range of input parameter values. This property is especially valuable in unsupervised learning since it suggests that the model is keen to adapting to unforeseen patterns.