887 resultados para random forests
Resumo:
While plants of a single species emit a diversity of volatile organic compounds (VOCs) to attract or repel interacting organisms, these specific messages may be lost in the midst of the hundreds of VOCs produced by sympatric plants of different species, many of which may have no signal content. Receivers must be able to reduce the babel or noise in these VOCs in order to correctly identify the message. For chemical ecologists faced with vast amounts of data on volatile signatures of plants in different ecological contexts, it is imperative to employ accurate methods of classifying messages, so that suitable bioassays may then be designed to understand message content. We demonstrate the utility of `Random Forests' (RF), a machine-learning algorithm, for the task of classifying volatile signatures and choosing the minimum set of volatiles for accurate discrimination, using datam from sympatric Ficus species as a case study. We demonstrate the advantages of RF over conventional classification methods such as principal component analysis (PCA), as well as data-mining algorithms such as support vector machines (SVM), diagonal linear discriminant analysis (DLDA) and k-nearest neighbour (KNN) analysis. We show why a tree-building method such as RF, which is increasingly being used by the bioinformatics, food technology and medical community, is particularly advantageous for the study of plant communication using volatiles, dealing, as it must, with abundant noise.
Resumo:
Esta dissertação apresenta um sistema de indução de classificadores fuzzy. Ao invés de utilizar a abordagem tradicional de sistemas fuzzy baseados em regras, foi utilizado o modelo de Árvore de Padrões Fuzzy(APF), que é um modelo hierárquico, com uma estrutura baseada em árvores que possuem como nós internos operadores lógicos fuzzy e as folhas são compostas pela associação de termos fuzzy com os atributos de entrada. O classificador foi obtido sintetizando uma árvore para cada classe, esta árvore será uma descrição lógica da classe o que permite analisar e interpretar como é feita a classificação. O método de aprendizado originalmente concebido para a APF foi substituído pela Programação Genética Cartesiana com o intuito de explorar melhor o espaço de busca. O classificador APF foi comparado com as Máquinas de Vetores de Suporte, K-Vizinhos mais próximos, florestas aleatórias e outros métodos Fuzzy-Genéticos em diversas bases de dados do UCI Machine Learning Repository e observou-se que o classificador APF apresenta resultados competitivos. Ele também foi comparado com o método de aprendizado original e obteve resultados comparáveis com árvores mais compactas e com um menor número de avaliações.
Resumo:
We propose a novel model for the spatio-temporal clustering of trajectories based on motion, which applies to challenging street-view video sequences of pedestrians captured by a mobile camera. A key contribution of our work is the introduction of novel probabilistic region trajectories, motivated by the non-repeatability of segmentation of frames in a video sequence. Hierarchical image segments are obtained by using a state-of-the-art hierarchical segmentation algorithm, and connected from adjacent frames in a directed acyclic graph. The region trajectories and measures of confidence are extracted from this graph using a dynamic programming-based optimisation. Our second main contribution is a Bayesian framework with a twofold goal: to learn the optimal, in a maximum likelihood sense, Random Forests classifier of motion patterns based on video features, and construct a unique graph from region trajectories of different frames, lengths and hierarchical levels. Finally, we demonstrate the use of Isomap for effective spatio-temporal clustering of the region trajectories of pedestrians. We support our claims with experimental results on new and existing challenging video sequences. © 2011 IEEE.
Resumo:
This paper tackles the novel challenging problem of 3D object phenotype recognition from a single 2D silhouette. To bridge the large pose (articulation or deformation) and camera viewpoint changes between the gallery images and query image, we propose a novel probabilistic inference algorithm based on 3D shape priors. Our approach combines both generative and discriminative learning. We use latent probabilistic generative models to capture 3D shape and pose variations from a set of 3D mesh models. Based on these 3D shape priors, we generate a large number of projections for different phenotype classes, poses, and camera viewpoints, and implement Random Forests to efficiently solve the shape and pose inference problems. By model selection in terms of the silhouette coherency between the query and the projections of 3D shapes synthesized using the galleries, we achieve the phenotype recognition result as well as a fast approximate 3D reconstruction of the query. To verify the efficacy of the proposed approach, we present new datasets which contain over 500 images of various human and shark phenotypes and motions. The experimental results clearly show the benefits of using the 3D priors in the proposed method over previous 2D-based methods. © 2011 IEEE.
Resumo:
A novel hybrid data-driven approach is developed for forecasting power system parameters with the goal of increasing the efficiency of short-term forecasting studies for non-stationary time-series. The proposed approach is based on mode decomposition and a feature analysis of initial retrospective data using the Hilbert-Huang transform and machine learning algorithms. The random forests and gradient boosting trees learning techniques were examined. The decision tree techniques were used to rank the importance of variables employed in the forecasting models. The Mean Decrease Gini index is employed as an impurity function. The resulting hybrid forecasting models employ the radial basis function neural network and support vector regression. A part from introduction and references the paper is organized as follows. The second section presents the background and the review of several approaches for short-term forecasting of power system parameters. In the third section a hybrid machine learningbased algorithm using Hilbert-Huang transform is developed for short-term forecasting of power system parameters. Fourth section describes the decision tree learning algorithms used for the issue of variables importance. Finally in section six the experimental results in the following electric power problems are presented: active power flow forecasting, electricity price forecasting and for the wind speed and direction forecasting.
Resumo:
Background: Ineffective risk stratification can delay diagnosis of serious disease in patients with hematuria. We applied a systems biology approach to analyze clinical, demographic and biomarker measurements (n = 29) collected from 157 hematuric patients: 80 urothelial cancer (UC) and 77 controls with confounding pathologies.
Methods: On the basis of biomarkers, we conducted agglomerative hierarchical clustering to identify patient and biomarker clusters. We then explored the relationship between the patient clusters and clinical characteristics using Chi-square analyses. We determined classification errors and areas under the receiver operating curve of Random Forest Classifiers (RFC) for patient subpopulations using the biomarker clusters to reduce the dimensionality of the data.
Results: Agglomerative clustering identified five patient clusters and seven biomarker clusters. Final diagnoses categories were non-randomly distributed across the five patient clusters. In addition, two of the patient clusters were enriched with patients with ‘low cancer-risk’ characteristics. The biomarkers which contributed to the diagnostic classifiers for these two patient clusters were similar. In contrast, three of the patient clusters were significantly enriched with patients harboring ‘high cancer-risk” characteristics including proteinuria, aggressive pathological stage and grade, and malignant cytology. Patients in these three clusters included controls, that is, patients with other serious disease and patients with cancers other than UC. Biomarkers which contributed to the diagnostic classifiers for the largest ‘high cancer- risk’ cluster were different than those contributing to the classifiers for the ‘low cancer-risk’ clusters. Biomarkers which contributed to subpopulations that were split according to smoking status, gender and medication were different.
Conclusions: The systems biology approach applied in this study allowed the hematuric patients to cluster naturally on the basis of the heterogeneity within their biomarker data, into five distinct risk subpopulations. Our findings highlight an approach with the promise to unlock the potential of biomarkers. This will be especially valuable in the field of diagnostic bladder cancer where biomarkers are urgently required. Clinicians could interpret risk classification scores in the context of clinical parameters at the time of triage. This could reduce cystoscopies and enable priority diagnosis of aggressive diseases, leading to improved patient outcomes at reduced costs. © 2013 Emmert-Streib et al; licensee BioMed Central Ltd.
Resumo:
Efficient identification and follow-up of astronomical transients is hindered by the need for humans to manually select promising candidates from data streams that contain many false positives. These artefacts arise in the difference images that are produced by most major ground-based time-domain surveys with large format CCD cameras. This dependence on humans to reject bogus detections is unsustainable for next generation all-sky surveys and significant effort is now being invested to solve the problem computationally. In this paper, we explore a simple machine learning approach to real-bogus classification by constructing a training set from the image data of similar to 32 000 real astrophysical transients and bogus detections from the Pan-STARRS1 Medium Deep Survey. We derive our feature representation from the pixel intensity values of a 20 x 20 pixel stamp around the centre of the candidates. This differs from previous work in that it works directly on the pixels rather than catalogued domain knowledge for feature design or selection. Three machine learning algorithms are trained (artificial neural networks, support vector machines and random forests) and their performances are tested on a held-out subset of 25 per cent of the training data. We find the best results from the random forest classifier and demonstrate that by accepting a false positive rate of 1 per cent, the classifier initially suggests a missed detection rate of around 10 per cent. However, we also find that a combination of bright star variability, nuclear transients and uncertainty in human labelling means that our best estimate of the missed detection rate is approximately 6 per cent.
Resumo:
Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2014
Resumo:
Dissertação de Mestrado em Gestão do Território, Especialização em Detecção Remota e Sistemas de Informação Geográfica
Predicting sense of community and participation by applying machine learning to open government data
Resumo:
Community capacity is used to monitor socio-economic development. It is composed of a number of dimensions, which can be measured to understand the possible issues in the implementation of a policy or the outcome of a project targeting a community. Measuring community capacity dimensions is usually expensive and time consuming, requiring locally organised surveys. Therefore, we investigate a technique to estimate them by applying the Random Forests algorithm on secondary open government data. This research focuses on the prediction of measures for two dimensions: sense of community and participation. The most important variables for this prediction were determined. The variables included in the datasets used to train the predictive models complied with two criteria: nationwide availability; sufficiently fine-grained geographic breakdown, i.e. neighbourhood level. The models explained 77% of the sense of community measures and 63% of participation. Due to the low geographic detail of the outcome measures available, further research is required to apply the predictive models to a neighbourhood level. The variables that were found to be more determinant for prediction were only partially in agreement with the factors that, according to the social science literature consulted, are the most influential for sense of community and participation. This finding should be further investigated from a social science perspective, in order to be understood in depth.
Resumo:
L'increment de bases de dades que cada vegada contenen imatges més difícils i amb un nombre més elevat de categories, està forçant el desenvolupament de tècniques de representació d'imatges que siguin discriminatives quan es vol treballar amb múltiples classes i d'algorismes que siguin eficients en l'aprenentatge i classificació. Aquesta tesi explora el problema de classificar les imatges segons l'objecte que contenen quan es disposa d'un gran nombre de categories. Primerament s'investiga com un sistema híbrid format per un model generatiu i un model discriminatiu pot beneficiar la tasca de classificació d'imatges on el nivell d'anotació humà sigui mínim. Per aquesta tasca introduïm un nou vocabulari utilitzant una representació densa de descriptors color-SIFT, i desprès s'investiga com els diferents paràmetres afecten la classificació final. Tot seguit es proposa un mètode par tal d'incorporar informació espacial amb el sistema híbrid, mostrant que la informació de context es de gran ajuda per la classificació d'imatges. Desprès introduïm un nou descriptor de forma que representa la imatge segons la seva forma local i la seva forma espacial, tot junt amb un kernel que incorpora aquesta informació espacial en forma piramidal. La forma es representada per un vector compacte obtenint un descriptor molt adequat per ésser utilitzat amb algorismes d'aprenentatge amb kernels. Els experiments realitzats postren que aquesta informació de forma te uns resultats semblants (i a vegades millors) als descriptors basats en aparença. També s'investiga com diferents característiques es poden combinar per ésser utilitzades en la classificació d'imatges i es mostra com el descriptor de forma proposat juntament amb un descriptor d'aparença millora substancialment la classificació. Finalment es descriu un algoritme que detecta les regions d'interès automàticament durant l'entrenament i la classificació. Això proporciona un mètode per inhibir el fons de la imatge i afegeix invariança a la posició dels objectes dins les imatges. S'ensenya que la forma i l'aparença sobre aquesta regió d'interès i utilitzant els classificadors random forests millora la classificació i el temps computacional. Es comparen els postres resultats amb resultats de la literatura utilitzant les mateixes bases de dades que els autors Aixa com els mateixos protocols d'aprenentatge i classificació. Es veu com totes les innovacions introduïdes incrementen la classificació final de les imatges.
Resumo:
In a global economy, manufacturers mainly compete with cost efficiency of production, as the price of raw materials are similar worldwide. Heavy industry has two big issues to deal with. On the one hand there is lots of data which needs to be analyzed in an effective manner, and on the other hand making big improvements via investments in cooperate structure or new machinery is neither economically nor physically viable. Machine learning offers a promising way for manufacturers to address both these problems as they are in an excellent position to employ learning techniques with their massive resource of historical production data. However, choosing modelling a strategy in this setting is far from trivial and this is the objective of this article. The article investigates characteristics of the most popular classifiers used in industry today. Support Vector Machines, Multilayer Perceptron, Decision Trees, Random Forests, and the meta-algorithms Bagging and Boosting are mainly investigated in this work. Lessons from real-world implementations of these learners are also provided together with future directions when different learners are expected to perform well. The importance of feature selection and relevant selection methods in an industrial setting are further investigated. Performance metrics have also been discussed for the sake of completion.
Resumo:
A resistência a múltiplos fármacos é um grande problema na terapia anti-cancerígena, sendo a glicoproteína-P (P-gp) uma das responsáveis por esta resistência. A realização deste trabalho incidiu principalmente no desenvolvimento de modelos matemáticos/estatísticos e “químicos”. Para os modelos matemáticos/estatísticos utilizamos métodos de Machine Learning como o Support Vector Machine (SVM) e o Random Forest, (RF) em relação aos modelos químicos utilizou-se farmacóforos. Os métodos acima mencionados foram aplicados a diversas proteínas P-gp, p53 e complexo p53-MDM2, utilizando duas famílias: as pifitrinas para a p53 e flavonóides para P-gp e, em menor medida, um grupo diversificado de moléculas de diversas famílias químicas. Nos modelos obtidos pelo SVM quando aplicados à P-gp e à família dos flavonóides, obtivemos bons valores através do kernel Radial Basis Function (RBF), com precisão de conjunto de treino de 94% e especificidade de 96%. Quanto ao conjunto de teste com previsão de 70% e especificidade de 67%, sendo que o número de falsos negativos foi o mais baixo comparativamente aos restantes kernels. Aplicando o RF à família dos flavonóides verificou-se que o conjunto de treino apresenta 86% de precisão e uma especificidade de 90%, quanto ao conjunto de teste obtivemos uma previsão de 70% e uma especificidade de 60%, existindo a particularidade de o número de falsos negativos ser o mais baixo. Repetindo o procedimento anterior (RF) e utilizando um total de 63 descritores, os resultados apresentaram valores inferiores obtendo-se para o conjunto de treino 79% de precisão e 82% de especificidade. Aplicando o modelo ao conjunto de teste obteve-se 70% de previsão e 60% de especificidade. Comparando os dois métodos, escolhemos o método SVM com o kernel RBF como modelo que nos garante os melhores resultados de classificação. Aplicamos o método SVM à P-gp e a um conjunto de moléculas não flavonóides que são transportados pela P-gp, obteve-se bons valores através do kernel RBF, com precisão de conjunto de treino de 95% e especificidade de 93%. Quanto ao conjunto de teste, obtivemos uma previsão de 70% e uma especificidade de 69%, existindo a particularidade de o número de falsos negativos ser o mais baixo. Aplicou-se o método do farmacóforo a três alvos, sendo estes, um conjunto de inibidores flavonóides e de substratos não flavonóides para a P-gp, um grupo de piftrinas para a p53 e um conjunto diversificado de estruturas para a ligação da p53-MDM2. Em cada um dos quatro modelos de farmacóforos obtidos identificou-se três características, sendo que as características referentes ao anel aromático e ao dador de ligações de hidrogénio estão presentes em todos os modelos obtidos. Realizando o rastreio em diversas bases de dados utilizando os modelos, obtivemos hits com uma grande diversidade estrutural.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
There has been limited analysis of the effects of hepatocellular carcinoma (HCC) on liver metabolism and circulating endogenous metabolites. Here, we report the findings of a plasma metabolomic investigation of HCC patients by ultraperformance liquid chromatography-electrospray ionization-quadrupole time-of-flight mass spectrometry (UPLC-ESI-QTOFMS), random forests machine learning algorithm, and multivariate data analysis. Control subjects included healthy individuals as well as patients with liver cirrhosis or acute myeloid leukemia. We found that HCC was associated with increased plasma levels of glycodeoxycholate, deoxycholate 3-sulfate, and bilirubin. Accurate mass measurement also indicated upregulation of biliverdin and the fetal bile acids 7α-hydroxy-3-oxochol-4-en-24-oic acid and 3-oxochol-4,6-dien-24-oic acid in HCC patients. A quantitative lipid profiling of patient plasma was also conducted by ultraperformance liquid chromatography-electrospray ionization-triple quadrupole mass spectrometry (UPLC-ESI-TQMS). By this method, we found that HCC was also associated with reduced levels of lysophosphocholines and in 4 of 20 patients with increased levels of lysophosphatidic acid [LPA(16:0)], where it correlated with plasma α-fetoprotein levels. Interestingly, when fatty acids were quantitatively profiled by gas chromatography-mass spectrometry (GC-MS), we found that lignoceric acid (24:0) and nervonic acid (24:1) were virtually absent from HCC plasma. Overall, this investigation illustrates the power of the new discovery technologies represented in the UPLC-ESI-QTOFMS platform combined with the targeted, quantitative platforms of UPLC-ESI-TQMS and GC-MS for conducting metabolomic investigations that can engender new insights into cancer pathobiology.