978 resultados para Supervised classification


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The Gaussian process latent variable model (GP-LVM) has been identified to be an effective probabilistic approach for dimensionality reduction because it can obtain a low-dimensional manifold of a data set in an unsupervised fashion. Consequently, the GP-LVM is insufficient for supervised learning tasks (e. g., classification and regression) because it ignores the class label information for dimensionality reduction. In this paper, a supervised GP-LVM is developed for supervised learning tasks, and the maximum a posteriori algorithm is introduced to estimate positions of all samples in the latent variable space. We present experimental evidences suggesting that the supervised GP-LVM is able to use the class label information effectively, and thus, it outperforms the GP-LVM and the discriminative extension of the GP-LVM consistently. The comparison with some supervised classification methods, such as Gaussian process classification and support vector machines, is also given to illustrate the advantage of the proposed method.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Les milieux humides remplissent plusieurs fonctions écologiques d’importance et contribuent à la biodiversité de la faune et de la flore. Même s’il existe une reconnaissance croissante sur l’importante de protéger ces milieux, il n’en demeure pas moins que leur intégrité est encore menacée par la pression des activités humaines. L’inventaire et le suivi systématique des milieux humides constituent une nécessité et la télédétection est le seul moyen réaliste d’atteindre ce but. L’objectif de cette thèse consiste à contribuer et à améliorer la caractérisation des milieux humides en utilisant des données satellites acquises par des radars polarimétriques en bande L (ALOS-PALSAR) et C (RADARSAT-2). Cette thèse se fonde sur deux hypothèses (chap. 1). La première hypothèse stipule que les classes de physionomies végétales, basées sur la structure des végétaux, sont plus appropriées que les classes d’espèces végétales car mieux adaptées au contenu informationnel des images radar polarimétriques. La seconde hypothèse stipule que les algorithmes de décompositions polarimétriques permettent une extraction optimale de l’information polarimétrique comparativement à une approche multipolarisée basée sur les canaux de polarisation HH, HV et VV (chap. 3). En particulier, l’apport de la décomposition incohérente de Touzi pour l’inventaire et le suivi de milieux humides est examiné en détail. Cette décomposition permet de caractériser le type de diffusion, la phase, l’orientation, la symétrie, le degré de polarisation et la puissance rétrodiffusée d’une cible à l’aide d’une série de paramètres extraits d’une analyse des vecteurs et des valeurs propres de la matrice de cohérence. La région du lac Saint-Pierre a été sélectionnée comme site d’étude étant donné la grande diversité de ses milieux humides qui y couvrent plus de 20 000 ha. L’un des défis posés par cette thèse consiste au fait qu’il n’existe pas de système standard énumérant l’ensemble possible des classes physionomiques ni d’indications précises quant à leurs caractéristiques et dimensions. Une grande attention a donc été portée à la création de ces classes par recoupement de sources de données diverses et plus de 50 espèces végétales ont été regroupées en 9 classes physionomiques (chap. 7, 8 et 9). Plusieurs analyses sont proposées pour valider les hypothèses de cette thèse (chap. 9). Des analyses de sensibilité par diffusiogramme sont utilisées pour étudier les caractéristiques et la dispersion des physionomies végétales dans différents espaces constitués de paramètres polarimétriques ou canaux de polarisation (chap. 10 et 12). Des séries temporelles d’images RADARSAT-2 sont utilisées pour approfondir la compréhension de l’évolution saisonnière des physionomies végétales (chap. 12). L’algorithme de la divergence transformée est utilisé pour quantifier la séparabilité entre les classes physionomiques et pour identifier le ou les paramètres ayant le plus contribué(s) à leur séparabilité (chap. 11 et 13). Des classifications sont aussi proposées et les résultats comparés à une carte existante des milieux humide du lac Saint-Pierre (14). Finalement, une analyse du potentiel des paramètres polarimétrique en bande C et L est proposé pour le suivi de l’hydrologie des tourbières (chap. 15 et 16). Les analyses de sensibilité montrent que les paramètres de la 1re composante, relatifs à la portion dominante (polarisée) du signal, sont suffisants pour une caractérisation générale des physionomies végétales. Les paramètres des 2e et 3e composantes sont cependant nécessaires pour obtenir de meilleures séparabilités entre les classes (chap. 11 et 13) et une meilleure discrimination entre milieux humides et milieux secs (chap. 14). Cette thèse montre qu’il est préférable de considérer individuellement les paramètres des 1re, 2e et 3e composantes plutôt que leur somme pondérée par leurs valeurs propres respectives (chap. 10 et 12). Cette thèse examine également la complémentarité entre les paramètres de structure et ceux relatifs à la puissance rétrodiffusée, souvent ignorée et normalisée par la plupart des décompositions polarimétriques. La dimension temporelle (saisonnière) est essentielle pour la caractérisation et la classification des physionomies végétales (chap. 12, 13 et 14). Des images acquises au printemps (avril et mai) sont nécessaires pour discriminer les milieux secs des milieux humides alors que des images acquises en été (juillet et août) sont nécessaires pour raffiner la classification des physionomies végétales. Un arbre hiérarchique de classification développé dans cette thèse constitue une synthèse des connaissances acquises (chap. 14). À l’aide d’un nombre relativement réduit de paramètres polarimétriques et de règles de décisions simples, il est possible d’identifier, entre autres, trois classes de bas marais et de discriminer avec succès les hauts marais herbacés des autres classes physionomiques sans avoir recours à des sources de données auxiliaires. Les résultats obtenus sont comparables à ceux provenant d’une classification supervisée utilisant deux images Landsat-5 avec une exactitude globale de 77.3% et 79.0% respectivement. Diverses classifications utilisant la machine à vecteurs de support (SVM) permettent de reproduire les résultats obtenus avec l’arbre hiérarchique de classification. L’exploitation d’une plus forte dimensionalitée par le SVM, avec une précision globale maximale de 79.1%, ne permet cependant pas d’obtenir des résultats significativement meilleurs. Finalement, la phase de la décomposition de Touzi apparaît être le seul paramètre (en bande L) sensible aux variations du niveau d’eau sous la surface des tourbières ouvertes (chap. 16). Ce paramètre offre donc un grand potentiel pour le suivi de l’hydrologie des tourbières comparativement à la différence de phase entre les canaux HH et VV. Cette thèse démontre que les paramètres de la décomposition de Touzi permettent une meilleure caractérisation, de meilleures séparabilités et de meilleures classifications des physionomies végétales des milieux humides que les canaux de polarisation HH, HV et VV. Le regroupement des espèces végétales en classes physionomiques est un concept valable. Mais certaines espèces végétales partageant une physionomie similaire, mais occupant un milieu différent (haut vs bas marais), ont cependant présenté des différences significatives quant aux propriétés de leur rétrodiffusion.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Decision trees and self organising feature maps (SOFM) are frequently used to identify groups. This research aims to compare the similarities between any groupings found between supervised (Classification and Regression Trees - CART) and unsupervised classification (SOFM), and to identify insights into factors associated with doctor-patient stability. Although CART and SOFM uses different learning paradigms to produce groupings, both methods came up with many similar groupings. Both techniques showed that self perceived health and age are important indicators of stability. In addition, this study has indicated profiles of patients that are at risk which might be interesting to general practitioners.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman--Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for the effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Traditional pattern recognition techniques can not handle the classification of large datasets with both efficiency and effectiveness. In this context, the Optimum-Path Forest (OPF) classifier was recently introduced, trying to achieve high recognition rates and low computational cost. Although OPF was much faster than Support Vector Machines for training, it was slightly slower for classification. In this paper, we present the Efficient OPF (EOPF), which is an enhanced and faster version of the traditional OPF, and validate it for the automatic recognition of white matter and gray matter in magnetic resonance images of the human brain. © 2010 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this work, a new approach for supervised pattern recognition is presented which improves the learning algorithm of the Optimum-Path Forest classifier (OPF), centered on detection and elimination of outliers in the training set. Identification of outliers is based on a penalty computed for each sample in the training set from the corresponding number of imputable false positive and false negative classification of samples. This approach enhances the accuracy of OPF while still gaining in classification time, at the expense of a slight increase in training time. © 2010 Springer-Verlag.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The water column overlying the submerged aquatic vegetation (SAV) canopy presents difficulties when using remote sensing images for mapping such vegetation. Inherent and apparent water optical properties and its optically active components, which are commonly present in natural waters, in addition to the water column height over the canopy, and plant characteristics are some of the factors that affect the signal from SAV mainly due to its strong energy absorption in the near-infrared. By considering these interferences, a hypothesis was developed that the vegetation signal is better conserved and less absorbed by the water column in certain intervals of the visible region of the spectrum; as a consequence, it is possible to distinguish the SAV signal. To distinguish the signal from SAV, two types of classification approaches were selected. Both of these methods consider the hemispherical-conical reflectance factor (HCRF) spectrum shape, although one type was supervised and the other one was not. The first method adopts cluster analysis and uses the parameters of the band (absorption, asymmetry, height and width) obtained by continuum removal as the input of the classification. The spectral angle mapper (SAM) was adopted as the supervised classification approach. Both approaches tested different wavelength intervals in the visible and near-infrared spectra. It was demonstrated that the 585 to 685-nm interval, corresponding to the green, yellow and red wavelength bands, offered the best results in both classification approaches. However, SAM classification showed better results relative to cluster analysis and correctly separated all spectral curves with or without SAV. Based on this research, it can be concluded that it is possible to discriminate areas with and without SAV using remote sensing. © 2013 by the authors; licensee MDPI, Basel, Switzerland.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Semi-supervised learning is one of the important topics in machine learning, concerning with pattern classification where only a small subset of data is labeled. In this paper, a new network-based (or graph-based) semi-supervised classification model is proposed. It employs a combined random-greedy walk of particles, with competition and cooperation mechanisms, to propagate class labels to the whole network. Due to the competition mechanism, the proposed model has a local label spreading fashion, i.e., each particle only visits a portion of nodes potentially belonging to it, while it is not allowed to visit those nodes definitely occupied by particles of other classes. In this way, a "divide-and-conquer" effect is naturally embedded in the model. As a result, the proposed model can achieve a good classification rate while exhibiting low computational complexity order in comparison to other network-based semi-supervised algorithms. Computer simulations carried out for synthetic and real-world data sets provide a numeric quantification of the performance of the method.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las técnicas de aprendizaje automático se usan para extraer información valiosa de datos. Hoy en día, la importancia de estas técnicas está siendo incluso mayor, debido a que la evolución en la adquisición y almacenamiento de datos está llevando a datos con diferentes características que deben ser explotadas. Por lo tanto, los avances en la recolección de datos deben ir ligados a avances en las técnicas de aprendizaje automático para resolver nuevos retos que pueden aparecer, tanto en aplicaciones académicas como reales. Existen varias técnicas de aprendizaje automático dependiendo de las características de los datos y del propósito. La clasificación no supervisada o clustering es una de las técnicas más conocidas cuando los datos carecen de supervisión (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificación supervisada necesita datos con supervisión (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una característica muy importante que guía no solo el aprendizaje sino también otras tareas relacionadas como la validación. Cuando solo algunos de los datos disponibles están etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificación supervisada se pueden utilizar. Este escenario, que está llegando a ser común hoy en día debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando técnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado más cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra característica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un único subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercanía de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validación en clustering es una tarea difícil debido a la ausencia de etiquetas. Aunque existen muchos índices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las características de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atípicos y ruido para estudiar de forma crítica cómo se comportan algunos de los índices de validación más conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando índices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un único cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando técnicas de clasificación supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultánea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la búsqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sintéticas, incluyendo comparaciones con otros métodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicación real y actual, se aplican diferentes técnicas de aprendizaje automático, incluyendo una de las propuestas de este trabajo (la más sofisticada) a una tarea de uno de los problemas biológicos más desafiantes hoy en día, el modelado del cerebro humano. Específicamente, expertos neurocientíficos no se ponen de acuerdo en una clasificación de neuronas para la corteza cerebral, lo que imposibilita no sólo cualquier intento de modelado sino también el trabajo del día a día al no tener una forma estándar de llamar a las neuronas. Por lo tanto, las técnicas de aprendizaje automático pueden ayudar a conseguir una solución aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Samples of Forsythia suspensa from raw (Laoqiao) and ripe (Qingqiao) fruit were analyzed with the use of HPLC-DAD and the EIS-MS techniques. Seventeen peaks were detected, and of these, twelve were identified. Most were related to the glucopyranoside molecular fragment. Samples collected from three geographical areas (Shanxi, Henan and Shandong Provinces), were discriminated with the use of hierarchical clustering analysis (HCA), discriminant analysis (DA), and principal component analysis (PCA) models, but only PCA was able to provide further information about the relationships between objects and loadings; eight peaks were related to the provinces of sample origin. The supervised classification models-K-nearest neighbor (KNN), least squares support vector machines (LS-SVM), and counter propagation artificial neural network (CP-ANN) methods, indicated successful classification but KNN produced 100% classification rate. Thus, the fruit were discriminated on the basis of their places of origin.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The loss and degradation of forest cover is currently a globally recognised problem. The fragmentation of forests is further affecting the biodiversity and well-being of the ecosystems also in Kenya. This study focuses on two indigenous tropical montane forests in the Taita Hills in southeastern Kenya. The study is a part of the TAITA-project within the Department of Geography in the University of Helsinki. The study forests, Ngangao and Chawia, are studied by remote sensing and GIS methods. The main data includes black and white aerial photography from 1955 and true colour digital camera data from 2004. This data is used to produce aerial mosaics from the study areas. The land cover of these study areas is studied by visual interpretation, pixel-based supervised classification and object-oriented supervised classification. The change of the forest cover is studied with GIS methods using the visual interpretations from 1955 and 2004. Furthermore, the present state of the study forests is assessed with leaf area index and canopy closure parameters retrieved from hemispherical photographs as well as with additional, previously collected forest health monitoring data. The canopy parameters are also compared with textural parameters from digital aerial mosaics. This study concludes that the classification of forest areas by using true colour data is not an easy task although the digital aerial mosaics are proved to be very accurate. The best classifications are still achieved with visual interpretation methods as the accuracies of the pixel-based and object-oriented supervised classification methods are not satisfying. According to the change detection of the land cover in the study areas, the area of indigenous woodland in both forests has decreased in 1955 2004. However in Ngangao, the overall woodland area has grown mainly because of plantations of exotic species. In general, the land cover of both study areas is more fragmented in 2004 than in 1955. Although the forest area has decreased, forests seem to have a more optimistic future than before. This is due to the increasing appreciation of the forest areas.