967 resultados para Random Forest


60.00% 60.00%



The main purpose of this study is to evaluate the best set of features that automatically enables the identification of argumentative sentences from unstructured text. As corpus, we use case laws from the European Court of Human Rights (ECHR). Three kinds of experiments are conducted: Basic Experiments, Multi Feature Experiments and Tree Kernel Experiments. These experiments are basically categorized according to the type of features available in the corpus. The features are extracted from the corpus and Support Vector Machine (SVM) and Random Forest are the used as Machine learning algorithms. We achieved F1 score of 0.705 for identifying the argumentative sentences which is quite promising result and can be used as the basis for a general argument-mining framework.


60.00% 60.00%



O levantamento e a análise da espacialização dos atributos do solo através de ferramentas de geoestatística são fundamentais para que cada hectare de terra seja cultivado segundo as suas reais aptidões. As imagens de radar de abertura sintética (SAR) têm um grande potencial para a estimação de umidade do solo e, desta forma, estes sensores podem auxiliar no mapeamento de propriedades físicas e físico-hídricas dos solos. O objetivo geral deste estudo foi avaliar o potencial de utilização de imagens de radar (micro-ondas) ALOS/PALSAR na identificação de solos em uma área da Formação Botucatu, dominada por solos de textura arenosa e média no município de Mineiros - GO. A área tem aproximadamente 946 ha, com o relevo da região variando de plano a suave ondulado e geologia da área é composta basicamente, por Arenitos da Formação Botucatu. No presente estudo foram amostrados 84 pontos para calibração e 25 pontos para validação, coletados nas profundidades de 0-20 cm e 60-80 cm. As amostras de solo analisadas para a determinação de areia, silte, argila, capacidade de campo (CC), ponto de murcha permanente (PMP) e água total disponível (AD). Para o desenvolvimento do trabalho foram adquiridas imagens de cinco datas e diferentes polarizações, totalizando 14 imagens, que foram processadas para a correção geométrica e correção radiométrica, utilizando o MDE. Também foram gerados covariáveis dos atributos do terreno: elevação (ELEV), declividade (DECLIV), posição relativa da declividade (PR-DECL), distância vertical do canal de drenagem (DVCD), fator-ls (FATOR-LS) e distância euclidiana (D-EUCL). A predição dos atributos do solo foi realizada utilizando os métodos Random Forest (RF) e Random Forest Krigagem (RFK), tendo como covariáveis preditoras as imagens de radar e os atributos do terreno. O processamento das imagens do radar ALOS/PALSAR possibilitou as correções geométrica e radiométrica, transformando os dados em unidades de coeficiente de retroespalhamento (?º) corrigidos pelo modelo digital de elevação (MDE). As imagens adquiridas representaram de forma ampla as variações de ?º ocorridos em diferentes datas. Os solos da área de estudo são predominantemente arenosos, com a maioria dos pontos amostrados classificados como NEOSSOLOS QUARTZARÊNICOS, seguidos dos LATOSSOLOS. Os modelos RF empregados para a predição dos atributos físicos e físico-hídricos dos solos proporcionaram a análise da contribuição das covariáveis preditoras. Os atributos do terreno que exerceram maior influência na predição dos atributos estudados estão relacionados à elevação. As imagens de 03/05/2009 (HH1, VV1, HV1 e VH1) e 26/09/2010 (HH3 e HV3), obtidas em períodos mais secos, tiveram melhores correlações com os atributos do solo. As análises dos semivariogramas dos resíduos da predição dos modelos RF demonstraram maior dependência espacial na camada de 60 a 80 cm. A abordagem da Krigagem somada ao modelo RF contribuíram para a melhoria da predição dos atributos areia, argila, CC e PMP. O uso de imagens de radar ALOS/PALSAR e atributos do terreno como covariáveis em modelos RFK mostrou potencial para estimar os atributos físicos (areia e argila) e físico-hídricos (CC e PMP), que podem auxiliar no mapeamento de solos associados aos materiais de origem da Formação Botucatu.


60.00% 60.00%



L'abbandono del cliente, ossia il customer churn, si riferisce a quando un cliente cessa il suo rapporto con l'azienda. In genere, le aziende considerano un cliente come perso quando un determinato periodo di tempo è trascorso dall'ultima interazione del cliente con i servizi dell'azienda. La riduzione del tasso di abbandono è quindi un obiettivo di business chiave per ogni attività. Per riuscire a trattenere i clienti che stanno per abbandonare l'azienda, è necessario: prevedere in anticipo quali clienti abbandoneranno; sapere quali azioni di marketing avranno maggiore impatto sulla fidelizzazione di ogni particolare cliente. L'obiettivo della tesi è lo studio e l'implementazione di un sistema di previsione dell'abbandono dei clienti in una catena di palestre: il sistema è realizzato per conto di Technogym, azienda leader nel mercato del fitness. Technogym offre già un servizio di previsione del rischio di abbandono basato su regole statiche. Tale servizio offre risultati accettabili ma è un sistema che non si adatta automaticamente al variare delle caratteristiche dei clienti nel tempo. Con questa tesi si sono sfruttate le potenzialità offerte dalle tecnologie di apprendimento automatico, per cercare di far fronte ai limiti del sistema storicamente utilizzato dall'azienda. Il lavoro di tesi ha previsto tre macro-fasi: la prima fase è la comprensione e l'analisi del sistema storico, con lo scopo di capire la struttura dei dati, di migliorarne la qualità e di approfondirne tramite analisi statistiche il contenuto informativo in relazione alle features definite dagli algoritmi di apprendimento automatico. La seconda fase ha previsto lo studio, la definizione e la realizzazione di due modelli di ML basati sulle stesse features ma utilizzando due tecnologie differenti: Random Forest Classifier e il servizio AutoML Tables di Google. La terza fase si è concentrata su una valutazione comparativa delle performance dei modelli di ML rispetto al sistema storico.


60.00% 60.00%



Il riconoscimento delle condizioni del manto stradale partendo esclusivamente dai dati raccolti dallo smartphone di un ciclista a bordo del suo mezzo è un ambito di ricerca finora poco esplorato. Per lo sviluppo di questa tesi è stata sviluppata un'apposita applicazione, che combinata a script Python permette di riconoscere differenti tipologie di asfalto. L’applicazione raccoglie i dati rilevati dai sensori di movimento integrati nello smartphone, che registra i movimenti mentre il ciclista è alla guida del suo mezzo. Lo smartphone è fissato in un apposito holder fissato sul manubrio della bicicletta e registra i dati provenienti da giroscopio, accelerometro e magnetometro. I dati sono memorizzati su file CSV, che sono elaborati fino ad ottenere un unico DataSet contenente tutti i dati raccolti con le features estratte mediante appositi script Python. A ogni record sarà assegnato un cluster deciso in base ai risultati prodotti da K-means, risultati utilizzati in seguito per allenare algoritmi Supervised. Lo scopo degli algoritmi è riconoscere la tipologia di manto stradale partendo da questi dati. Per l’allenamento, il DataSet è stato diviso in due parti: il training set dal quale gli algoritmi imparano a classificare i dati e il test set sul quale gli algoritmi applicano ciò che hanno imparato per dare in output la classificazione che ritengono idonea. Confrontando le previsioni degli algoritmi con quello che i dati effettivamente rappresentano si ottiene la misura dell’accuratezza dell’algoritmo.


60.00% 60.00%



With the advent of new technologies it is increasingly easier to find data of different nature from even more accurate sensors that measure the most disparate physical quantities and with different methodologies. The collection of data thus becomes progressively important and takes the form of archiving, cataloging and online and offline consultation of information. Over time, the amount of data collected can become so relevant that it contains information that cannot be easily explored manually or with basic statistical techniques. The use of Big Data therefore becomes the object of more advanced investigation techniques, such as Machine Learning and Deep Learning. In this work some applications in the world of precision zootechnics and heat stress accused by dairy cows are described. Experimental Italian and German stables were involved for the training and testing of the Random Forest algorithm, obtaining a prediction of milk production depending on the microclimatic conditions of the previous days with satisfactory accuracy. Furthermore, in order to identify an objective method for identifying production drops, compared to the Wood model, typically used as an analytical model of the lactation curve, a Robust Statistics technique was used. Its application on some sample lactations and the results obtained allow us to be confident about the use of this method in the future.


60.00% 60.00%



Hematological cancers are a heterogeneous family of diseases that can be divided into leukemias, lymphomas, and myelomas, often called “liquid tumors”. Since they cannot be surgically removable, chemotherapy represents the mainstay of their treatment. However, it still faces several challenges like drug resistance and low response rate, and the need for new anticancer agents is compelling. The drug discovery process is long-term, costly, and prone to high failure rates. With the rapid expansion of biological and chemical "big data", some computational techniques such as machine learning tools have been increasingly employed to speed up and economize the whole process. Machine learning algorithms can create complex models with the aim to determine the biological activity of compounds against several targets, based on their chemical properties. These models are defined as multi-target Quantitative Structure-Activity Relationship (mt-QSAR) and can be used to virtually screen small and large chemical libraries for the identification of new molecules with anticancer activity. The aim of my Ph.D. project was to employ machine learning techniques to build an mt-QSAR classification model for the prediction of cytotoxic drugs simultaneously active against 43 hematological cancer cell lines. For this purpose, first, I constructed a large and diversified dataset of molecules extracted from the ChEMBL database. Then, I compared the performance of different ML classification algorithms, until Random Forest was identified as the one returning the best predictions. Finally, I used different approaches to maximize the performance of the model, which achieved an accuracy of 88% by correctly classifying 93% of inactive molecules and 72% of active molecules in a validation set. This model was further applied to the virtual screening of a small dataset of molecules tested in our laboratory, where it showed 100% accuracy in correctly classifying all molecules. This result is confirmed by our previous in vitro experiments.


60.00% 60.00%



Background There is a wide variation of recurrence risk of Non-small-cell lung cancer (NSCLC) within the same Tumor Node Metastasis (TNM) stage, suggesting that other parameters are involved in determining this probability. Radiomics allows extraction of quantitative information from images that can be used for clinical purposes. The primary objective of this study is to develop a radiomic prognostic model that predicts a 3 year disease free-survival (DFS) of resected Early Stage (ES) NSCLC patients. Material and Methods 56 pre-surgery non contrast Computed Tomography (CT) scans were retrieved from the PACS of our institution and anonymized. Then they were automatically segmented with an open access deep learning pipeline and reviewed by an experienced radiologist to obtain 3D masks of the NSCLC. Images and masks underwent to resampling normalization and discretization. From the masks hundreds Radiomic Features (RF) were extracted using Py-Radiomics. Hence, RF were reduced to select the most representative features. The remaining RF were used in combination with Clinical parameters to build a DFS prediction model using Leave-one-out cross-validation (LOOCV) with Random Forest. Results and Conclusion A poor agreement between the radiologist and the automatic segmentation algorithm (DICE score of 0.37) was found. Therefore, another experienced radiologist manually segmented the lesions and only stable and reproducible RF were kept. 50 RF demonstrated a high correlation with the DFS but only one was confirmed when clinicopathological covariates were added: Busyness a Neighbouring Gray Tone Difference Matrix (HR 9.610). 16 clinical variables (which comprised TNM) were used to build the LOOCV model demonstrating a higher Area Under the Curve (AUC) when RF were included in the analysis (0.67 vs 0.60) but the difference was not statistically significant (p=0,5147).


60.00% 60.00%



I giacimenti di cinabro (±stibina) del M.te Amiata costituiscono un distretto minerario di importanza mondiale, con una produzione storica totale che supera le 117 kt di Hg, prodotte tra il 1850 e il 1982. Nell’area del distretto minerario si trova l’omonimo sistema geotermico, con 6 impianti per la produzione di energia elettrica che producono 121MW equivalenti di energia. Lo scopo di questa tesi è di individuare e di verificare correlazioni esistenti tra la peculiare distribuzione N-S delle mineralizzazioni a cinabro, le manifestazioni geotermiche e l’assetto strutturale che caratterizza il distretto cinabrifero e il sistema geotermico. Le correlazioni sono state individuate attraverso l’applicazione di algoritmi Machine Learning (ML), utilizzando Scikit-learn, ad un dataset bidimensionale, costruito con applicazioni GIS per contenere tutti i dati geologici-giacimentologici reperiti in letteratura riguardo al distretto amiatino. È stato costruito un modello tridimensionale dell’area di studio basato sulla produzione di quattro solidi che raggruppano le formazioni geologiche presenti nell’area sulla base delle loro caratteristiche geoidrologiche. Sulla base dei risultati ottenuti si può affermare che le tecniche di ML si sono dimostrate utili nell’identificare correlazioni tra i diversi fattori geologico-strutturali che caratterizzano il sistema geotermico del M.te Amiata; la peculiare distribuzione spaziale N-S dei giacimenti del distretto dipende dalla combinazione di un sistema di faglie e di pieghe; i modelli di regressione basati su alberi decisionali (CatBoost e Random Forest) sono complessivamente più performanti e geologicamente significativi. Questo lavoro suggerisce che il ML rappresenta uno strumento in grado di suggerire nuove e poco sperimentate relazioni tra elementi geologici-giacimentologici di un’area complessa come un sistema geotermico ed è in grado di guidare eventuali fasi successive di studi geologici complessi.


60.00% 60.00%



As a consequence of the diffusion of next generation sequencing techniques, metagenomics databases have become one of the most promising repositories of information about features and behavior of microorganisms. One of the subjects that can be studied from those data are bacteria populations. Next generation sequencing techniques allow to study the bacteria population within an environment by sampling genetic material directly from it, without the needing of culturing a similar population in vitro and observing its behavior. As a drawback, it is quite complex to extract information from those data and usually there is more than one way to do that; AMR is no exception. In this study we will discuss how the quantified AMR, which regards the genotype of the bacteria, can be related to the bacteria phenotype and its actual level of resistance against the specific substance. In order to have a quantitative information about bacteria genotype, we will evaluate the resistome from the read libraries, aligning them against CARD database. With those data, we will test various machine learning algorithms for predicting the bacteria phenotype. The samples that we exploit should resemble those that could be obtained from a natural context, but are actually produced by a read libraries simulation tool. In this way we are able to design the populations with bacteria of known genotype, so that we can relay on a secure ground truth for training and testing our algorithms.


60.00% 60.00%



Day by day, machine learning is changing our lives in ways we could not have imagined just 5 years ago. ML expertise is more and more requested and needed, though just a limited number of ML engineers are available on the job market, and their knowledge is always limited by an inherent characteristic of theirs: they are humans. This thesis explores the possibilities offered by meta-learning, a new field in ML that takes learning a level higher: models are trained on other models' training data, starting from features of the dataset they were trained on, inference times, obtained performances, to try to understand the relationship between a good model and the way it was obtained. The so-called metamodel was trained on data collected by OpenML, the largest ML metadata platform that's publicly available today. Datasets were analyzed to obtain meta-features that describe them, which were then tied to model performances in a regression task. The obtained metamodel predicts the expected performances of a given model type (e.g., a random forest) on a given ML task (e.g., classification on the UCI census dataset). This research was then integrated into a custom-made AutoML framework, to show how meta-learning is not an end in itself, but it can be used to further progress our ML research. Encoding ML engineering expertise in a model allows better, faster, and more impactful ML applications across the whole world, while reducing the cost that is inevitably tied to human engineers.


60.00% 60.00%



Il mio progetto di tesi ha come obiettivo quello di creare un modello in grado di predire il rating delle applicazioni presenti all’interno del Play Store, uno dei più grandi servizi di distribuzione digitale Android. A tale scopo ho utilizzato il linguaggio Python, che grazie alle sue librerie, alla sua semplicità e alla sua versatilità è certamen- te uno dei linguaggi più usati nel campo dell’intelligenza artificiale. Il punto di partenza del mio studio è stato il Dataset (Insieme di dati strutturati in forma relazionale) “Google Play Store Apps” reperibile su Kaggle al seguente indirizzo: https://www.kaggle.com/datasets/lava18/google-play-store-apps, contenente 10841 osservazioni e 13 attributi. Dopo una prima parte relativa al caricamen- to, alla visualizzazione e alla preparazione dei dati su cui lavorare, ho applica- to quattro di↵erenti tecniche di Machine Learning per la stima del rating delle applicazioni. In particolare, sono state utilizzate:https://www.kaggle.com/datasets/lava18/google-play-store-apps, contenente 10841 osservazioni e 13 attributi. Dopo una prima parte relativa al caricamento, alla visualizzazione e alla preparazione dei dati su cui lavorare, ho applicato quattro differenti tecniche di Machine Learning per la stima del rating delle applicazioni: Ridje, Regressione Lineare, Random Forest e SVR. Tali algoritmi sono stati applicati attuando due tipi diversi di trasformazioni (Label Encoding e One Hot Encoding) sulla variabile ‘Category’, con lo scopo di analizzare come le suddette trasformazioni riescano a influire sulla bontà del modello. Ho confrontato poi l’errore quadratico medio (MSE), l’errore medio as- soluto (MAE) e l’errore mediano assoluto (MdAE) con il fine di capire quale sia l’algoritmo più efficiente.


60.00% 60.00%



The emissions estimation, both during homologation and standard driving, is one of the new challenges that automotive industries have to face. The new European and American regulation will allow a lower and lower quantity of Carbon Monoxide emission and will require that all the vehicles have to be able to monitor their own pollutants production. Since numerical models are too computationally expensive and approximated, new solutions based on Machine Learning are replacing standard techniques. In this project we considered a real V12 Internal Combustion Engine to propose a novel approach pushing Random Forests to generate meaningful prediction also in extreme cases (extrapolation, very high frequency peaks, noisy instrumentation etc.). The present work proposes also a data preprocessing pipeline for strongly unbalanced datasets and a reinterpretation of the regression problem as a classification problem in a logarithmic quantized domain. Results have been evaluated for two different models representing a pure interpolation scenario (more standard) and an extrapolation scenario, to test the out of bounds robustness of the model. The employed metrics take into account different aspects which can affect the homologation procedure, so the final analysis will focus on combining all the specific performances together to obtain the overall conclusions.


60.00% 60.00%



Combinatorial decision and optimization problems belong to numerous applications, such as logistics and scheduling, and can be solved with various approaches. Boolean Satisfiability and Constraint Programming solvers are some of the most used ones and their performance is significantly influenced by the model chosen to represent a given problem. This has led to the study of model reformulation methods, one of which is tabulation, that consists in rewriting the expression of a constraint in terms of a table constraint. To apply it, one should identify which constraints can help and which can hinder the solving process. So far this has been performed by hand, for example in MiniZinc, or automatically with manually designed heuristics, in Savile Row. Though, it has been shown that the performances of these heuristics differ across problems and solvers, in some cases helping and in others hindering the solving procedure. However, recent works in the field of combinatorial optimization have shown that Machine Learning (ML) can be increasingly useful in the model reformulation steps. This thesis aims to design a ML approach to identify the instances for which Savile Row’s heuristics should be activated. Additionally, it is possible that the heuristics miss some good tabulation opportunities, so we perform an exploratory analysis for the creation of a ML classifier able to predict whether or not a constraint should be tabulated. The results reached towards the first goal show that a random forest classifier leads to an increase in the performances of 4 different solvers. The experimental results in the second task show that a ML approach could improve the performance of a solver for some problem classes.


60.00% 60.00%



Worldwide, biodiversity is decreasing due to climate change, habitat fragmentation and agricultural intensification. Bees are essential crops pollinator, but their abundance and diversity are decreasing as well. For their conservation, it is necessary to assess the status of bee population. Field data collection methods are expensive and time consuming thus, recently, new methods based on remote sensing are used. In this study we tested the possibility of using flower cover diversity estimated by UAV images (FCD-UAV) to assess bee diversity and abundance in 10 agricultural meadows in the Netherlands. In order to do so, field data of flower and bee diversity and abundance were collected during a campaign in May 2021. Furthermore, RGB images of the areas have been collected using Unmanned Aerial Vehicle (UAV) and post-processed into orthomosaics. Lastly, Random Forest machine learning algorithm was applied to estimate FCD of the species detected in each field. Resulting FCD was expressed with Shannon and Simpson diversity indices, which were successively correlated to bee Shannon and Simpson diversity indices, abundance and species richness. The results showed a positive relationship between FCD-UAV and in-situ collected data about bee diversity, evaluated with Shannon index, abundance and species richness. The strongest relationship was found between FCD (Shannon Index) and bee abundance with R2=0.52. Following, good correlations were found with bee species richness (R2=0.39) and bee diversity (R2=0.37). R2 values of the relationship between FCD (Simpson Index) and bee abundance, species richness and diversity were slightly inferior (0.45, 0.37 and 0.35, respectively). Our results suggest that the proposed method based on the coupling of UAV imagery and machine learning for the assessment of flower species diversity could be developed into valuable tools for large-scale, standardized and cost-effective monitoring of flower cover and of the habitat quality for bees.


40.00% 40.00%



The research on multiple classifiers systems includes the creation of an ensemble of classifiers and the proper combination of the decisions. In order to combine the decisions given by classifiers, methods related to fixed rules and decision templates are often used. Therefore, the influence and relationship between classifier decisions are often not considered in the combination schemes. In this paper we propose a framework to combine classifiers using a decision graph under a random field model and a game strategy approach to obtain the final decision. The results of combining Optimum-Path Forest (OPF) classifiers using the proposed model are reported, obtaining good performance in experiments using simulated and real data sets. The results encourage the combination of OPF ensembles and the framework to design multiple classifier systems. © 2011 Springer-Verlag.