965 resultados para Semi-supervised classification


Relevância:

100.00% 100.00%

Publicador:

Resumo:

We have recently developed a principled approach to interactive non-linear hierarchical visualization [8] based on the Generative Topographic Mapping (GTM). Hierarchical plots are needed when a single visualization plot is not sufficient (e.g. when dealing with large quantities of data). In this paper we extend our system by giving the user a choice of initializing the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user interactively selects ``regions of interest'' as in [8], whereas in the automatic mode an unsupervised minimum message length (MML)-driven construction of a mixture of GTMs is used. The latter is particularly useful when the plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualizing large data sets. We illustrate our approach on a data set of 2300 18-dimensional points and mention extension of our system to accommodate discrete data types.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An interactive hierarchical Generative Topographic Mapping (HGTM) ¸iteHGTM has been developed to visualise complex data sets. In this paper, we build a more general visualisation system by extending the HGTM visualisation system in 3 directions: bf (1) We generalize HGTM to noise models from the exponential family of distributions. The basic building block is the Latent Trait Model (LTM) developed in ¸iteKabanpami. bf (2) We give the user a choice of initializing the child plots of the current plot in either em interactive, or em automatic mode. In the interactive mode the user interactively selects ``regions of interest'' as in ¸iteHGTM, whereas in the automatic mode an unsupervised minimum message length (MML)-driven construction of a mixture of LTMs is employed. bf (3) We derive general formulas for magnification factors in latent trait models. Magnification factors are a useful tool to improve our understanding of the visualisation plots, since they can highlight the boundaries between data clusters. The unsupervised construction is particularly useful when high-level plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualizing large data sets. We illustrate our approach on a toy example and apply our system to three more complex real data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Natural language understanding is to specify a computational model that maps sentences to their semantic mean representation. In this paper, we propose a novel framework to train the statistical models without using expensive fully annotated data. In particular, the input of our framework is a set of sentences labeled with abstract semantic annotations. These annotations encode the underlying embedded semantic structural relations without explicit word/semantic tag alignment. The proposed framework can automatically induce derivation rules that map sentences to their semantic meaning representations. The learning framework is applied on two statistical models, the conditional random fields (CRFs) and the hidden Markov support vector machines (HM-SVMs). Our experimental results on the DARPA communicator data show that both CRFs and HM-SVMs outperform the baseline approach, previously proposed hidden vector state (HVS) model which is also trained on abstract semantic annotations. In addition, the proposed framework shows superior performance than two other baseline approaches, a hybrid framework combining HVS and HM-SVMs and discriminative training of HVS, with a relative error reduction rate of about 25% and 15% being achieved in F-measure.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In machine learning, Gaussian process latent variable model (GP-LVM) has been extensively applied in the field of unsupervised dimensionality reduction. When some supervised information, e.g., pairwise constraints or labels of the data, is available, the traditional GP-LVM cannot directly utilize such supervised information to improve the performance of dimensionality reduction. In this case, it is necessary to modify the traditional GP-LVM to make it capable of handing the supervised or semi-supervised learning tasks. For this purpose, we propose a new semi-supervised GP-LVM framework under the pairwise constraints. Through transferring the pairwise constraints in the observed space to the latent space, the constrained priori information on the latent variables can be obtained. Under this constrained priori, the latent variables are optimized by the maximum a posteriori (MAP) algorithm. The effectiveness of the proposed algorithm is demonstrated with experiments on a variety of data sets. © 2010 Elsevier B.V.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Permafrost landscapes experience different disturbances and store large amounts of organic matter, which may become a source of greenhouse gases upon permafrost degradation. We analysed the influence of terrain and geomorphic disturbances (e.g. soil creep, active-layer detachment, gullying, thaw slumping, accumulation of fluvial deposits) on soil organic carbon (SOC) and total nitrogen (TN) storage using 11 permafrost cores from Herschel Island, western Canadian Arctic. Our results indicate a strong correlation between SOC storage and the topographic wetness index. Undisturbed sites stored the majority of SOC and TN in the upper 70 cm of soil. Sites characterised by mass wasting showed significant SOC depletion and soil compaction, whereas sites characterised by the accumulation of peat and fluvial deposits store SOC and TN along the whole core. We upscaled SOC and TN to estimate total stocks using the ecological units determined from vegetation composition, slope angle and the geomorphic disturbance regime. The ecological units were delineated with a supervised classification based on RapidEye multispectral satellite imagery and slope angle. Mean SOC and TN storage for the uppermost 1?m of soil on Herschel Island are 34.8 kg C/m**2 and 3.4 kg N/m**2, respectively.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Resources created at the University of Southampton for the module Remote Sensing for Earth Observation

Relevância:

100.00% 100.00%

Publicador:

Resumo:

El análisis de las diferentes alternativas en la planificación y diseño de corredores y trazados de carreteras debe basarse en la correcta definición de variables territoriales que sirvan como criterios para la toma de decisión y esto requiere un análisis ambiental preliminar de esas variables de calidad. En España, los estudios de viabilidad de nuevas carreteras y autovías están asociados a una fase del proceso de decisión que se corresponde con el denominado Estudio Informativo, el cual establece condicionantes físicos, ambientales, de uso del suelo y culturales que deben ser considerados en las primeras fases de la definición del trazado de un corredor de carretera. Así, la metodología más frecuente es establecer diferentes niveles de capacidad de acogida del territorio en el área de estudio con el fin de resumir las variables territoriales en mapas temáticos y facilitar el proceso de trazado de las alternativas de corredores de carretera. El paisaje es un factor limitante a tener en cuenta en la planificación y diseño de carreteras y, por tanto, deben buscarse trazados más sostenibles en relación con criterios estéticos y ecológicos del mismo. Pero este factor no es frecuentemente analizado en los Estudios Informativos e incluso, si es considerado, los estudios específicos de la calidad del paisaje (estético y ecológico) y de las formas del terreno no incorporan las recomendaciones de las guías de trazado para evitar o reducir los impactos en el paisaje. Además, los mapas de paisaje que se generan en este tipo de estudios no se corresponden con la escala de desarrollo del Estudio Informativo (1:5.000). Otro déficit común en planificación de corredores y trazados de carreteras es que no se tiene en cuenta la conectividad del paisaje durante el proceso de diseño de la carretera para prevenir la afección a los corredores de fauna existentes en el paisaje. Este déficit puede originar un posterior efecto barrera en los movimientos dispersivos de la fauna y la fragmentación de sus hábitats debido a la ocupación parcial o total de las teselas de hábitats con importancia biológica para la fauna (o hábitats focales) y a la interrupción de los corredores de fauna que concentran esos movimientos dispersivos de la fauna entre teselas. El objetivo principal de esta tesis es mejorar el estudio del paisaje para prevenir su afección durante el proceso de trazado de carreteras, facilitar la conservación de los corredores de fauna (o pasillos verdes) y la localización de medidas preventivas y correctoras en términos de selección y cuantificación de factores de idoneidad a fin de reducir los impactos visuales y ecológicos en el paisaje a escala local. Concretamente, la incorporación de valores cuantitativos y bien justificados en el proceso de decisión permite incrementar la transparencia en el proceso de diseño de corredores y trazados de carreteras. Con este fin, se han planteado cuatro preguntas específicas en esta investigación (1) ¿Cómo se seleccionan y evalúan los factores territoriales limitantes para localizar una nueva carretera por los profesionales españoles de planificación del territorio en relación con el paisaje? (2) ¿Cómo pueden ser definidos los corredores de fauna a partir de factores del paisaje que influyen en los movimientos dispersivos de la fauna? (3) ¿Cómo pueden delimitarse y evaluarse los corredores de fauna incluyendo el comportamiento parcialmente errático en los movimientos dispersivos de la fauna y el efecto barrera de los elementos antrópicos a una escala local? (4) ¿Qué y cómo las recomendaciones de diseño de carreteras relacionadas con el paisaje y las formas del terreno pueden ser incluidas en un modelo de Sistemas de Información Geográfica (SIG) para ayudar a los ingenieros civiles durante el proceso de diseño de un trazado de carreteras bajo el punto de vista de la sostenibilidad?. Esta tesis doctoral propone nuevas metodologías que mejoran el análisis visual y ecológico del paisaje utilizando indicadores y modelos SIG para obtener alternativas de trazado que produzcan un menor impacto en el paisaje. Estas metodologías fueron probadas en un paisaje heterogéneo con una alta tasa de densidad de corzo (Capreolus capreolus L.), uno de los grandes mamíferos más atropellados en la red de carreteras españolas, y donde está planificada la construcción de una nueva autovía que atravesará la mitad del área de distribución del corzo. Inicialmente, se han analizado las variables utilizadas en 22 estudios de proyectos de planificación de corredores de carreteras promovidos por el Ministerio de Fomento entre 2006 y 2008. Estas variables se agruparon según condicionantes físicos, ambientales, de usos del suelo y culturales con el fin de comparar los valores asignados de capacidad de acogida del territorio a cada variable en los diferentes estudios revisados. Posteriormente, y como etapa previa de un análisis de conectividad, se construyó un mapa de resistencia de los movimientos dispersivos del corzo en base a la literatura y al juicio de expertos. Usando esta investigación como base, se le asignó un valor de resistencia a cada factor seleccionado para construir la matriz de resistencia, ponderándolo y combinándolo con el resto de factores usando el proceso analítico jerárquico y los operadores de lógica difusa como métodos de análisis multicriterio. Posteriormente, se diseñó una metodología SIG para delimitar claramente la extensión física de los corredores de fauna de acuerdo a un valor umbral de ancho geométrico mínimo, así como la existencia de múltiples potenciales conexiones entre cada par de teselas de hábitats presentes en el paisaje estudiado. Finalmente, se realizó un procesado de datos Light Detection and Ranging (LiDAR) y un modelo SIG para calcular la calidad del paisaje (estético y ecológico), las formas del terreno que presentan características similares para trazar una carretera y la acumulación de vistas de potenciales conductores y observadores de los alrededores de la nueva vía. Las principales contribuciones de esta investigación al conocimiento científico existente en el campo de la evaluación del impacto ambiental en relación al diseño de corredores y trazados de carreteras son cuatro. Primero, el análisis realizado de 22 Estudios Informativos de planificación de carreteras reveló que los métodos aplicados por los profesionales para la evaluación de la capacidad de acogida del territorio no fue suficientemente estandarizada, ya que había una falta de uniformidad en el uso de fuentes cartográficas y en las metodologías de evaluación de la capacidad de acogida del territorio, especialmente en el análisis de la calidad del paisaje estético y ecológico. Segundo, el análisis realizado en esta tesis destaca la importancia de los métodos multicriterio para estructurar, combinar y validar factores que limitan los movimientos dispersivos de la fauna en el análisis de conectividad. Tercero, los modelos SIG desarrollados Generador de alternativas de corredores o Generator of Alternative Corridors (GAC) y Eliminador de Corredores Estrechos o Narrow Corridor Eraser (NCE) pueden ser aplicados sistemáticamente y sobre una base científica en análisis de conectividad como una mejora de las herramientas existentes para la comprensión el paisaje como una red compuesta por nodos y enlaces interconectados. Así, ejecutando los modelos GAC y NCE de forma iterativa, pueden obtenerse corredores alternativos con similar probabilidad de ser utilizados por la fauna y sin que éstos presenten cuellos de botella. Cuarto, el caso de estudio llevado a cabo de prediseño de corredores y trazado de una nueva autovía ha sido novedoso incluyendo una clasificación semisupervisada de las formas del terreno, filtrando una nube de puntos LiDAR e incluyendo la nueva geometría 3D de la carretera en el Modelo Digital de Superficie (MDS). El uso combinado del procesamiento de datos LiDAR y de índices y clasificaciones geomorfológicas puede ayudar a los responsables encargados en la toma de decisiones a evaluar qué alternativas de trazado causan el menor impacto en el paisaje, proporciona una visión global de los juicios de valor más aplicados y, en conclusión, define qué medidas de integración paisajística correctoras deben aplicarse y dónde. ABSTRACT The assessment of different alternatives in road-corridor planning and layout design must be based on a number of well-defined territorial variables that serve as decision-making criteria, and this requires a high-quality preliminary environmental analysis of those quality variables. In Spain, feasibility studies for new roads and motorways are associated to a phase of the decision procedure which corresponds with the one known as the Informative Study, which establishes the physical, environmental, land-use and cultural constraints to be considered in the early stages of defining road corridor layouts. The most common methodology is to establish different levels of Territorial Carrying Capacity (TCC) in the study area in order to summarize the territorial variables on thematic maps and facilitate the tracing process of road-corridor layout alternatives. Landscape is a constraint factor that must be considered in road planning and design, and the most sustainable layouts should be sought based on aesthetic and ecological criteria. However this factor is not often analyzed in Informative Studies and even if it is, baseline studies on landscape quality (aesthetic and ecological) and landforms do not usually include the recommendations of road tracing guides designed to avoid or reduce impacts on the landscape. The resolution of the landscape maps produced in this type of studies does not comply with the recommended road design scale (1:5,000) in the regulations for the Informative Study procedure. Another common shortcoming in road planning is that landscape ecological connectivity is not considered during road design in order to avoid affecting wildlife corridors in the landscape. In the prior road planning stage, this issue could lead to a major barrier effect for fauna dispersal movements and to the fragmentation of their habitat due to the partial or total occupation of habitat patches of biological importance for the fauna (or focal habitats), and the interruption of wildlife corridors that concentrate fauna dispersal movements between patches. The main goal of this dissertation is to improve the study of the landscape and prevent negative effects during the road tracing process, and facilitate the preservation of wildlife corridors (or green ways) and the location of preventive and corrective measures by selecting and quantifying suitability factors to reduce visual and ecological landscape impacts at a local scale. Specifically the incorporation of quantitative and well-supported values in the decision-making process provides increased transparency in the road corridors and layouts design process. Four specific questions were raised in this research: (1) How are territorial constraints selected and evaluated in terms of landscape by Spanish land-planning practitioners before locating a new road? (2) How can wildlife corridors be defined based on the landscape factors influencing the dispersal movements of fauna? (3) How can wildlife corridors be delimited and assessed to include the partially erratic movements of fauna and the barrier effect of the anthropic elements at a local scale? (4) How recommendations of road design related to landscape and landforms can be included in a Geographic Information System (GIS) model to aid civil engineers during the road layout design process and support sustainable development? This doctoral thesis proposes new methodologies that improve the assessment of the visual and ecological landscape character using indicators and GIS models to obtain road layout alternatives with a lower impact on the landscape. These methodologies were tested on a case study of a heterogeneous landscape with a high density of roe deer (Capreolus capreolus L.) –one of the large mammals most commonly hit by vehicles on the Spanish road network– and where a new motorway is planned to pass through the middle of their distribution area. We explored the variables used in 22 road-corridor planning projects sponsored by the Ministry of Public Works between 2006 and 2008. These variables were grouped into physical, environmental, land-use and cultural constraints for the purpose of comparing the TCC values assigned to each variable in the various studies reviewed. As a prior stage in a connectivity analysis, a map of resistance to roe deer dispersal movements was created based on the literature and experts judgment. Using this research as a base, each factor selected to build the matrix was assigned a resistance value and weighted and combined with the rest of the factors using the analytic hierarchy process (AHP) and fuzzy logic operators as multicriteria assessment (MCA) methods. A GIS methodology was designed to clearly delimit the physical area of wildlife corridors according to a geometric threshold width value, and the multiple potential connections between each pair of habitat patches in the landscape. A Digital Surface Model Light Detection and Ranging (LiDAR) dataset processing and a GIS model was performed to determine landscape quality (aesthetic and ecological) and landforms with similar characteristics for the road layout, and the cumulative viewshed of potential drivers and observers in the area surrounding the new motorway. The main contributions of this research to current scientific knowledge in the field of environmental impact assessment for road corridors and layouts design are four. First, the analysis of 22 Informative Studies on road planning revealed that the methods applied by practitioners for assessing the TCC were not sufficiently standardized due to the lack of uniformity in the cartographic information sources and the TCC valuation methodologies, especially in the analysis of the aesthetic and ecological quality of the landscape. Second, the analysis in this dissertation highlights the importance of multicriteria methods to structure, combine and validate factors that constrain wildlife dispersal movements in the connectivity analysis. Third, the “Generator of Alternative Corridors (GAC)” and “Narrow Corridor Eraser (NCE)” GIS models developed can be applied systematically and on a scientific basis in connectivity analyses to improve existing tools and understand landscape as a network composed of interconnected nodes and links. Thus, alternative corridors with similar probability of use by fauna and without bottlenecks can be obtained by iteratively running GAC and NCE models. Fourth, our case study of new motorway corridors and layouts design innovatively included semi-supervised classification of landforms, filtering of LiDAR point clouds and new 3D road geometry on the Digital Surface Model (DSM). The combined used of LiDAR data processing and geomorphological indices and classifications can help decision-makers assess which road layouts produce lower impacts on the landscape, provide an overall insight into the most commonly applied value judgments, and in conclusion, define which corrective measures should be applied in terms of landscaping, and where.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Transductive SVM (TSVM) is a well known semi-supervised large margin learning method for binary text classification. In this paper we extend this method to multi-class and hierarchical classification problems. We point out that the determination of labels of unlabeled examples with fixed classifier weights is a linear programming problem. We devise an efficient technique for solving it. The method is applicable to general loss functions. We demonstrate the value of the new method using large margin loss on a number of multi-class and hierarchical classification datasets. For maxent loss we show empirically that our method is better than expectation regularization/constraint and posterior regularization methods, and competitive with the version of entropy regularization method which uses label constraints.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

McCullagh and Yang (2006) suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Co-training is a semi-supervised learning method that is designed to take advantage of the redundancy that is present when the object to be identified has multiple descriptions. Co-training is known to work well when the multiple descriptions are conditional independent given the class of the object. The presence of multiple descriptions of objects in the form of text, images, audio and video in multimedia applications appears to provide redundancy in the form that may be suitable for co-training. In this paper, we investigate the suitability of utilizing text and image data from the Web for co-training. We perform measurements to find indications of conditional independence in the texts and images obtained from the Web. Our measurements suggest that conditional independence is likely to be present in the data. Our experiments, within a relevance feedback framework to test whether a method that exploits the conditional independence outperforms methods that do not, also indicate that better performance can indeed be obtained by designing algorithms that exploit this form of the redundancy when it is present.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Given a set of images of scenes containing different object categories (e.g. grass, roads) our objective is to discover these objects in each image, and to use this object occurrences to perform a scene classification (e.g. beach scene, mountain scene). We achieve this by using a supervised learning algorithm able to learn with few images to facilitate the user task. We use a probabilistic model to recognise the objects and further we classify the scene based on their object occurrences. Experimental results are shown and evaluated to prove the validity of our proposal. Object recognition performance is compared to the approaches of He et al. (2004) and Marti et al. (2001) using their own datasets. Furthermore an unsupervised method is implemented in order to evaluate the advantages and disadvantages of our supervised classification approach versus an unsupervised one

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As a fundamental tool for network management and security, traffic classification has attracted increasing attention in recent years. A significant challenge to the robustness of classification performance comes from zero-day applications previously unknown in traffic classification systems. In this paper, we propose a new scheme of Robust statistical Traffic Classification (RTC) by combining supervised and unsupervised machine learning techniques to meet this challenge. The proposed RTC scheme has the capability of identifying the traffic of zero-day applications as well as accurately discriminating predefined application classes. In addition, we develop a new method for automating the RTC scheme parameters optimization process. The empirical study on real-world traffic data confirms the effectiveness of the proposed scheme. When zero-day applications are present, the classification performance of the new scheme is significantly better than four state-of-the-art methods: random forest, correlation-based classification, semi-supervised clustering, and one-class SVM.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.