930 resultados para Classification algorithms
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
Cancer cell lines can be treated with a drug and the molecular comparison of responders and non-responders may yield potential predictors that could be tested in the clinic. It is a bioinformatics challenge to apply the cell line-derived multivariable response predictors to patients who respond to therapy. Using the gene expression data from 23 breast cancer cell lines, I developed three predictors of dasatinib sensitivity by selecting differentially expressed genes and applying different classification algorithms. The performance of these predictors on independent cell lines with known dasatinib response was tested. The predictor based on weighted voting method has the best overall performance. It correctly predicted dasatinib sensitivity in 11 out of 12 (92%) breast and 17 out of 23 (74%) lung cancer cell lines. These predictors were then applied to the gene expression data from 133 breast cancer patients in an attempt to predict how the patients might respond to dasatinib therapy. Two predictors identified 13 patients in common to be dasatinib sensitive. Sixty two percent of these cases are triple negative (ER-negative, HER2-negative and PR-negative) and 76% are double negative. The result is consistent with the findings from other studies, which identified a target population for dasatinib treatment to be triple negative or basal breast cancer subtype. In conclusion, we think that the cell line-derived dasatinib classifiers can be applied to the human patients. ^
Resumo:
ZooScan with ZooProcess and Plankton Identifier (PkID) software is an integrated analysis system for acquisition and classification of digital zooplankton images from preserved zooplankton samples. Zooplankton samples are digitized by the ZooScan and processed by ZooProcess and PkID in order to detect, enumerate, measure and classify the digitized objects. Here we present a semi-automatic approach that entails automated classification of images followed by manual validation, which allows rapid and accurate classification of zooplankton and abiotic objects. We demonstrate this approach with a biweekly zooplankton time series from the Bay of Villefranche-sur-mer, France. The classification approach proposed here provides a practical compromise between a fully automatic method with varying degrees of bias and a manual but accurate classification of zooplankton. We also evaluate the appropriate number of images to include in digital learning sets and compare the accuracy of six classification algorithms. We evaluate the accuracy of the ZooScan for automated measurements of body size and present relationships between machine measures of size and C and N content of selected zooplankton taxa. We demonstrate that the ZooScan system can produce useful measures of zooplankton abundance, biomass and size spectra, for a variety of ecological studies.
Resumo:
El objetivo de la presente tesis doctoral es el desarrollo e implementación de un sistema para mejorar la metodología de extracción de la información geométrica necesaria asociada a los procesos de documentación de entidades de interés patrimonial, a partir de la información proporcionada por el empleo de sensores láser, tanto aéreos como terrestres. Para ello, inicialmente se realiza una presentación y justificación de los antecedentes y la problemática en el registro de información geométrica para el patrimonio, detallando todos aquellos sistemas de registro y análisis de la información geométrica utilizados en la actualidad. Este análisis permitirá realizar la comparación con los sistemas de registro basados en técnicas láser, aportando sugerencias de utilización para cada caso concreto. Posteriormente, se detallan los sistemas de registro basados en técnicas láser, comenzando por los sensores aerotransportados y concluyendo con el análisis pormenorizado de los sensores terrestres, tanto en su aplicación en modo estático como móvil. Se exponen las características técnicas y funcionamiento de cada uno de ellos, así como los ámbitos de aplicación y productos generados. Se analizan las fuentes de error que determinan la precisión que puede alcanzar el sistema. Tras la exposición de las características de los sistemas LiDAR, se detallan los procesos a realizar con los datos extraídos para poder generar la información necesaria para los diferentes tipos de objetos analizados. En esta exposición, se hace hincapié en los posibles riesgos que pueden ocurrir en algunas fases delicadas y se analizarán los diferentes algoritmos de filtrado y clasificación de los puntos, fundamentales en el procesamiento de la información LiDAR. Seguidamente, se propone una alternativa para optimizar los modelos de procesamiento existentes, basándose en el desarrollo de algoritmos nuevos y herramientas informáticas que mejoran el rendimiento en la gestión de la información LiDAR. En la implementación, se han tenido en cuenta características y necesidades particulares de la documentación de entidades de interés patrimonial, así como los diferentes ámbitos de utilización del LiDAR, tanto aéreo como terrestre. El resultado es un organigrama de las tareas a realizar desde la nube de puntos LiDAR hasta el cálculo de los modelos digitales del terreno y de superficies. Para llevar a cabo esta propuesta, se han desarrollado hasta 19 algoritmos diferentes que comprenden implementaciones para el modelado en 2.5D y 3D, visualización, edición, filtrado y clasificación de datos LiDAR, incorporación de información de sensores pasivos y cálculo de mapas derivados, tanto raster como vectoriales, como pueden ser mapas de curvas de nivel y ortofotos. Finalmente, para dar validez y consistencia a los desarrollos propuestos, se han realizado ensayos en diferentes escenarios posibles en un proceso de documentación del patrimonio y que abarcan desde proyectos con sensores aerotransportados, proyectos con sensores terrestres estáticos a media y corta distancia, así como un proyecto con un sensor terrestre móvil. Estos ensayos han permitido definir los diferentes parámetros necesarios para el adecuado funcionamiento de los algoritmos propuestos. Asimismo, se han realizado pruebas objetivas expuestas por la ISPRS para la evaluación y comparación del funcionamiento de algoritmos de clasificación LiDAR. Estas pruebas han permitido extraer datos de rendimiento y efectividad del algoritmo de clasificación presentado, permitiendo su comparación con otros algoritmos de prestigio existentes. Los resultados obtenidos han constatado el funcionamiento satisfactorio de la herramienta. Esta tesis está enmarcada dentro del proyecto Consolider-Ingenio 2010: “Programa de investigación en tecnologías para la valoración y conservación del patrimonio cultural” (ref. CSD2007-00058) realizado por el Consejo Superior de Investigaciones Científicas y la Universidad Politécnica de Madrid. ABSTRACT: The goal of this thesis is the design, development and implementation of a system to improve the extraction of useful geometric information in Heritage documentation processes. This system is based on information provided by laser sensors, both aerial and terrestrial. Firstly, a presentation of recording geometric information for Heritage processes is done. Then, a justification of the background and problems is done too. Here, current systems for recording and analyzing the geometric information are studied. This analysis will perform the comparison with the laser system techniques, providing suggestions of use for each specific case. Next, recording systems based on laser techniques are detailed. This study starts with airborne sensors and ends with terrestrial ones, both in static and mobile application. The technical characteristics and operation of each of them are described, as well as the areas of application and generated products. Error sources are also analyzed in order to know the precision this technology can achieve. Following the presentation of the LiDAR system characteristics, the processes to generate the required information for different types of scanned objects are described; the emphasis is on the potential risks that some steps can produce. Moreover different filtering and classification algorithms are analyzed, because of their main role in LiDAR processing. Then, an alternative to optimize existing processing models is proposed. It is based on the development of new algorithms and tools that improve the performance in LiDAR data management. In this implementation, characteristics and needs of the documentation of Heritage entities have been taken into account. Besides, different areas of use of LiDAR are considered, both air and terrestrial. The result is a flowchart of tasks from the LiDAR point cloud to the calculation of digital terrain models and digital surface models. Up to 19 different algorithms have been developed to implement this proposal. These algorithms include implementations for 2.5D and 3D modeling, viewing, editing, filtering and classification of LiDAR data, incorporating information from passive sensors and calculation of derived maps, both raster and vector, such as contour maps and orthophotos. Finally, in order to validate and give consistency to the proposed developments, tests in different cases have been executed. These tests have been selected to cover different possible scenarios in the Heritage documentation process. They include from projects with airborne sensors, static terrestrial sensors (medium and short distances) to mobile terrestrial sensor projects. These tests have helped to define the different parameters necessary for the appropriate functioning of the proposed algorithms. Furthermore, proposed tests from ISPRS have been tested. These tests have allowed evaluating the LiDAR classification algorithm performance and comparing it to others. Therefore, they have made feasible to obtain performance data and effectiveness of the developed classification algorithm. The results have confirmed the reliability of the tool. This investigation is framed within Consolider-Ingenio 2010 project titled “Programa de investigación en tecnologías para la valoración y conservación del patrimonio cultural” (ref. CSD2007-00058) by Consejo Superior de Investigaciones Científicas and Universidad Politécnica de Madrid.
Resumo:
En este proyecto estudia la posibilidad de realizar una verificación de locutor por medio de la biometría de voz. En primer lugar se obtendrán las características principales de la voz, que serían los coeficientes MFCC, partiendo de una base de datos de diferentes locutores con 10 muestras por cada locutor. Con estos resultados se procederá a la creación de los clasificadores con los que luego testearemos y haremos la verificación. Como resultado final obtendremos un sistema capaz de identificar si el locutor es el que buscamos o no. Para la verificación se utilizan clasificadores Support Vector Machine (SVM), especializado en resolver problemas biclase. Los resultados demuestran que el sistema es capaz de verificar que un locutor es quien dice ser comparándolo con el resto de locutores disponibles en la base de datos. ABSTRACT. Verification based on voice features is an important task for a wide variety of applications concerning biometric verification systems. In this work, we propose a human verification though the use of their voice features focused on supervised training classification algorithms. To this aim we have developed a voice feature extraction system based on MFCC features. For classification purposed we have focused our work in using a Support Vector Machine classificator due to it’s optimization for biclass problems. We test our system in a dataset composed of various individuals of di↵erent gender to evaluate our system’s performance. Experimental results reveal that the proposed system is capable of verificating one individual against the rest of the dataset.
Resumo:
The n-tuple recognition method was tested on 11 large real-world data sets and its performance compared to 23 other classification algorithms. On 7 of these, the results show no systematic performance gap between the n-tuple method and the others. Evidence was found to support a possible explanation for why the n-tuple method yields poor results for certain datasets. Preliminary empirical results of a study of the confidence interval (the difference between the two highest scores) are also reported. These suggest a counter-intuitive correlation between the confidence interval distribution and the overall classification performance of the system.
Resumo:
This article presents two novel approaches for incorporating sentiment prior knowledge into the topic model for weakly supervised sentiment analysis where sentiment labels are considered as topics. One is by modifying the Dirichlet prior for topic-word distribution (LDA-DP), the other is by augmenting the model objective function through adding terms that express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria (LDA-GE). We conducted extensive experiments on English movie review data and multi-domain sentiment dataset as well as Chinese product reviews about mobile phones, digital cameras, MP3 players, and monitors. The results show that while both LDA-DP and LDAGE perform comparably to existing weakly supervised sentiment classification algorithms, they are much simpler and computationally efficient, rendering themmore suitable for online and real-time sentiment classification on the Web. We observed that LDA-GE is more effective than LDA-DP, suggesting that it should be preferred when considering employing the topic model for sentiment analysis. Moreover, both models are able to extract highly domain-salient polarity words from text.
Resumo:
The development of 3G (the 3rd generation telecommunication) value-added services brings higher requirements of Quality of Service (QoS). Wideband Code Division Multiple Access (WCDMA) is one of three 3G standards, and enhancement of QoS for WCDMA Core Network (CN) becomes more and more important for users and carriers. The dissertation focuses on enhancement of QoS for WCDMA CN. The purpose is to realize the DiffServ (Differentiated Services) model of QoS for WCDMA CN. Based on the parallelism characteristic of Network Processors (NPs), the NP programming model is classified as Pool of Threads (POTs) and Hyper Task Chaining (HTC). In this study, an integrated programming model that combines both of the two models was designed. This model has highly efficient and flexible features, and also solves the problems of sharing conflicts and packet ordering. We used this model as the programming model to realize DiffServ QoS for WCDMA CN. ^ The realization mechanism of the DiffServ model mainly consists of buffer management, packet scheduling and packet classification algorithms based on NPs. First, we proposed an adaptive buffer management algorithm called Packet Adaptive Fair Dropping (PAFD), which takes into consideration of both fairness and throughput, and has smooth service curves. Then, an improved packet scheduling algorithm called Priority-based Weighted Fair Queuing (PWFQ) was introduced to ensure the fairness of packet scheduling and reduce queue time of data packets. At the same time, the delay and jitter are also maintained in a small range. Thirdly, a multi-dimensional packet classification algorithm called Classification Based on Network Processors (CBNPs) was designed. It effectively reduces the memory access and storage space, and provides less time and space complexity. ^ Lastly, an integrated hardware and software system of the DiffServ model of QoS for WCDMA CN was proposed. It was implemented on the NP IXP2400. According to the corresponding experiment results, the proposed system significantly enhanced QoS for WCDMA CN. It extensively improves consistent response time, display distortion and sound image synchronization, and thus increases network efficiency and saves network resource.^
Resumo:
When it comes to information sets in real life, often pieces of the whole set may not be available. This problem can find its origin in various reasons, describing therefore different patterns. In the literature, this problem is known as Missing Data. This issue can be fixed in various ways, from not taking into consideration incomplete observations, to guessing what those values originally were, or just ignoring the fact that some values are missing. The methods used to estimate missing data are called Imputation Methods. The work presented in this thesis has two main goals. The first one is to determine whether any kind of interactions exists between Missing Data, Imputation Methods and Supervised Classification algorithms, when they are applied together. For this first problem we consider a scenario in which the databases used are discrete, understanding discrete as that it is assumed that there is no relation between observations. These datasets underwent processes involving different combina- tions of the three components mentioned. The outcome showed that the missing data pattern strongly influences the outcome produced by a classifier. Also, in some of the cases, the complex imputation techniques investigated in the thesis were able to obtain better results than simple ones. The second goal of this work is to propose a new imputation strategy, but this time we constrain the specifications of the previous problem to a special kind of datasets, the multivariate Time Series. We designed new imputation techniques for this particular domain, and combined them with some of the contrasted strategies tested in the pre- vious chapter of this thesis. The time series also were subjected to processes involving missing data and imputation to finally propose an overall better imputation method. In the final chapter of this work, a real-world example is presented, describing a wa- ter quality prediction problem. The databases that characterized this problem had their own original latent values, which provides a real-world benchmark to test the algorithms developed in this thesis.
Resumo:
A Histologia, o estudo de tecidos, é uma das áreas fundamentais da Biologia que permitiu enormes avanços científicos. Sendo uma tarefa exigente, meticulosa e demorada, será importante aproveitar a existência de ferramentas e algoritmos computacionais no seu auxílio, tornando o processo mais rápido e possibilitando a descoberta de informação que poderá não estar visível à partida. Esta dissertação tem como principal objectivo averiguar se um animal foi ou não sujeito à ingestão de um xenobiótico. Com esse objectivo em vista, utilizaram-se técnicas de processamento e segmentação de imagem aplicadas a imagens de tecido renal de ratos saudáveis e ratos que ingeriram o xenobiótico. Destas imagens extraíram-se inúmeras características do corpúsculo renal que após serem analisadas através de vários algoritmos de classificação mostraram ser possível saber se o animal ingeriu ou não o xenobiótico, com um reduzido grau de incerteza. ABSTRACT: Histology, the study of tissues, is one of the key areas of Biology that has allowed huge advances in Science. Being a demanding, meticulous and time consuming task, it is important to use the existence of computational tools and algorithms in its aid, making the process faster and enabling the discovery of information that may not be initially visible. The main goal of this thesis is to ascertain if an animal was subjected or not to the ingestion of a xenobiotic. With this in mind, were used image processing and segmentation techniques applied on images of kidney tissue from healthy rats and rats that ingested the xenobiotic. From these images were extracted several features of renal glomeruli that after being analyzed by various classification algorithms had shown to be possible to know, with an acceptable degree of certainty, if the animal ingested or not the xenobiotic.
Resumo:
Monitoring agricultural crops constitutes a vital task for the general understanding of land use spatio-temporal dynamics. This paper presents an approach for the enhancement of current crop monitoring capabilities on a regional scale, in order to allow for the analysis of environmental and socio-economic drivers and impacts of agricultural land use. This work discusses the advantages and current limitations of using 250m VI data from the Moderate Resolution Imaging Spectroradiometer (MODIS) for this purpose, with emphasis in the difficulty of correctly analyzing pixels whose temporal responses are disturbed due to certain sources of interference such as mixed or heterogeneous land cover. It is shown that the influence of noisy or disturbed pixels can be minimized, and a much more consistent and useful result can be attained, if individual agricultural fields are identified and each field's pixels are analyzed in a collective manner. As such, a method is proposed that makes use of image segmentation techniques based on MODIS temporal information in order to identify portions of the study area that agree with actual agricultural field borders. The pixels of each portion or segment are then analyzed individually in order to estimate the reliability of the temporal signal observed and the consequent relevance of any estimation of land use from that data. The proposed method was applied in the state of Mato Grosso, in mid-western Brazil, where extensive ground truth data was available. Experiments were carried out using several supervised classification algorithms as well as different subsets of land cover classes, in order to test the methodology in a comprehensive way. Results show that the proposed method is capable of consistently improving classification results not only in terms of overall accuracy but also qualitatively by allowing a better understanding of the land use patterns detected. It thus provides a practical and straightforward procedure for enhancing crop-mapping capabilities using temporal series of moderate resolution remote sensing data.
Resumo:
Hematological cancers are a heterogeneous family of diseases that can be divided into leukemias, lymphomas, and myelomas, often called “liquid tumors”. Since they cannot be surgically removable, chemotherapy represents the mainstay of their treatment. However, it still faces several challenges like drug resistance and low response rate, and the need for new anticancer agents is compelling. The drug discovery process is long-term, costly, and prone to high failure rates. With the rapid expansion of biological and chemical "big data", some computational techniques such as machine learning tools have been increasingly employed to speed up and economize the whole process. Machine learning algorithms can create complex models with the aim to determine the biological activity of compounds against several targets, based on their chemical properties. These models are defined as multi-target Quantitative Structure-Activity Relationship (mt-QSAR) and can be used to virtually screen small and large chemical libraries for the identification of new molecules with anticancer activity. The aim of my Ph.D. project was to employ machine learning techniques to build an mt-QSAR classification model for the prediction of cytotoxic drugs simultaneously active against 43 hematological cancer cell lines. For this purpose, first, I constructed a large and diversified dataset of molecules extracted from the ChEMBL database. Then, I compared the performance of different ML classification algorithms, until Random Forest was identified as the one returning the best predictions. Finally, I used different approaches to maximize the performance of the model, which achieved an accuracy of 88% by correctly classifying 93% of inactive molecules and 72% of active molecules in a validation set. This model was further applied to the virtual screening of a small dataset of molecules tested in our laboratory, where it showed 100% accuracy in correctly classifying all molecules. This result is confirmed by our previous in vitro experiments.
Resumo:
Defining an efficient training set is one of the most delicate phases for the success of remote sensing image classification routines. The complexity of the problem, the limited temporal and financial resources, as well as the high intraclass variance can make an algorithm fail if it is trained with a suboptimal dataset. Active learning aims at building efficient training sets by iteratively improving the model performance through sampling. A user-defined heuristic ranks the unlabeled pixels according to a function of the uncertainty of their class membership and then the user is asked to provide labels for the most uncertain pixels. This paper reviews and tests the main families of active learning algorithms: committee, large margin, and posterior probability-based. For each of them, the most recent advances in the remote sensing community are discussed and some heuristics are detailed and tested. Several challenging remote sensing scenarios are considered, including very high spatial resolution and hyperspectral image classification. Finally, guidelines for choosing the good architecture are provided for new and/or unexperienced user.
Resumo:
The main focus of this thesis is to evaluate and compare Hyperbalilearning algorithm (HBL) to other learning algorithms. In this work HBL is compared to feed forward artificial neural networks using back propagation learning, K-nearest neighbor and 103 algorithms. In order to evaluate the similarity of these algorithms, we carried out three experiments using nine benchmark data sets from UCI machine learning repository. The first experiment compares HBL to other algorithms when sample size of dataset is changing. The second experiment compares HBL to other algorithms when dimensionality of data changes. The last experiment compares HBL to other algorithms according to the level of agreement to data target values. Our observations in general showed, considering classification accuracy as a measure, HBL is performing as good as most ANn variants. Additionally, we also deduced that HBL.:s classification accuracy outperforms 103's and K-nearest neighbour's for the selected data sets.