778 resultados para Traditional clustering


Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this work we present a new clustering method that groups up points of a data set in classes. The method is based in a algorithm to link auxiliary clusters that are obtained using traditional vector quantization techniques. It is described some approaches during the development of the work that are based in measures of distances or dissimilarities (divergence) between the auxiliary clusters. This new method uses only two a priori information, the number of auxiliary clusters Na and a threshold distance dt that will be used to decide about the linkage or not of the auxiliary clusters. The number os classes could be automatically found by the method, that do it based in the chosen threshold distance dt, or it is given as additional information to help in the choice of the correct threshold. Some analysis are made and the results are compared with traditional clustering methods. In this work different dissimilarities metrics are analyzed and a new one is proposed based on the concept of negentropy. Besides grouping points of a set in classes, it is proposed a method to statistical modeling the classes aiming to obtain a expression to the probability of a point to belong to one of the classes. Experiments with several values of Na e dt are made in tests sets and the results are analyzed aiming to study the robustness of the method and to consider heuristics to the choice of the correct threshold. During this work it is explored the aspects of information theory applied to the calculation of the divergences. It will be explored specifically the different measures of information and divergence using the Rényi entropy. The results using the different metrics are compared and commented. The work also has appendix where are exposed real applications using the proposed method

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Nowadays, organizations face the problem of keeping their information protected, available and trustworthy. In this context, machine learning techniques have also been extensively applied to this task. Since manual labeling is very expensive, several works attempt to handle intrusion detection with traditional clustering algorithms. In this paper, we introduce a new pattern recognition technique called Optimum-Path Forest (OPF) clustering to this task. Experiments on three public datasets have showed that OPF classifier may be a suitable tool to detect intrusions on computer networks, since it outperformed some state-of-the-art unsupervised techniques. © 2012 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This work proposes a method for data clustering based on complex networks theory. A data set is represented as a network by considering different metrics to establish the connection between each pair of objects. The clusters are obtained by taking into account five community detection algorithms. The network-based clustering approach is applied in two real-world databases and two sets of artificially generated data. The obtained results suggest that the exponential of the Minkowski distance is the most suitable metric to quantify the similarities between pairs of objects. In addition, the community identification method based on the greedy optimization provides the best cluster solution. We compare the network-based clustering approach with some traditional clustering algorithms and verify that it provides the lowest classification error rate. (C) 2012 Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las técnicas de aprendizaje automático se usan para extraer información valiosa de datos. Hoy en día, la importancia de estas técnicas está siendo incluso mayor, debido a que la evolución en la adquisición y almacenamiento de datos está llevando a datos con diferentes características que deben ser explotadas. Por lo tanto, los avances en la recolección de datos deben ir ligados a avances en las técnicas de aprendizaje automático para resolver nuevos retos que pueden aparecer, tanto en aplicaciones académicas como reales. Existen varias técnicas de aprendizaje automático dependiendo de las características de los datos y del propósito. La clasificación no supervisada o clustering es una de las técnicas más conocidas cuando los datos carecen de supervisión (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificación supervisada necesita datos con supervisión (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una característica muy importante que guía no solo el aprendizaje sino también otras tareas relacionadas como la validación. Cuando solo algunos de los datos disponibles están etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificación supervisada se pueden utilizar. Este escenario, que está llegando a ser común hoy en día debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando técnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado más cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra característica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un único subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercanía de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validación en clustering es una tarea difícil debido a la ausencia de etiquetas. Aunque existen muchos índices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las características de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atípicos y ruido para estudiar de forma crítica cómo se comportan algunos de los índices de validación más conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando índices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un único cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando técnicas de clasificación supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultánea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la búsqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sintéticas, incluyendo comparaciones con otros métodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicación real y actual, se aplican diferentes técnicas de aprendizaje automático, incluyendo una de las propuestas de este trabajo (la más sofisticada) a una tarea de uno de los problemas biológicos más desafiantes hoy en día, el modelado del cerebro humano. Específicamente, expertos neurocientíficos no se ponen de acuerdo en una clasificación de neuronas para la corteza cerebral, lo que imposibilita no sólo cualquier intento de modelado sino también el trabajo del día a día al no tener una forma estándar de llamar a las neuronas. Por lo tanto, las técnicas de aprendizaje automático pueden ayudar a conseguir una solución aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper describes the application of a new technique, rough clustering, to the problem of market segmentation. Rough clustering produces different solutions to k-means analysis because of the possibility of multiple cluster membership of objects. Traditional clustering methods generate extensional descriptions of groups, that show which objects are members of each cluster. Clustering techniques based on rough sets theory generate intensional descriptions, which outline the main characteristics of each cluster. In this study, a rough cluster analysis was conducted on a sample of 437 responses from a larger study of the relationship between shopping orientation (the general predisposition of consumers toward the act of shopping) and intention to purchase products via the Internet. The cluster analysis was based on five measures of shopping orientation: enjoyment, personalization, convenience, loyalty, and price. The rough clusters obtained provide interpretations of different shopping orientations present in the data without the restriction of attempting to fit each object into only one segment. Such descriptions can be an aid to marketers attempting to identify potential segments of consumers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Different types of water bodies, including lakes, streams, and coastal marine waters, are often susceptible to fecal contamination from a range of point and nonpoint sources, and have been evaluated using fecal indicator microorganisms. The most commonly used fecal indicator is Escherichia coli, but traditional cultivation methods do not allow discrimination of the source of pollution. The use of triplex PCR offers an approach that is fast and inexpensive, and here enabled the identification of phylogroups. The phylogenetic distribution of E. coli subgroups isolated from water samples revealed higher frequencies of subgroups A1 and B23 in rivers impacted by human pollution sources, while subgroups D1 and D2 were associated with pristine sites, and subgroup B1 with domesticated animal sources, suggesting their use as a first screening for pollution source identification. A simple classification is also proposed based on phylogenetic subgroup distribution using the w-clique metric, enabling differentiation of polluted and unpolluted sites.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

HEMOLIA (a project under European community’s 7th framework programme) is a new generation Anti-Money Laundering (AML) intelligent multi-agent alert and investigation system which in addition to the traditional financial data makes extensive use of modern society’s huge telecom data source, thereby opening up a new dimension of capabilities to all Money Laundering fighters (FIUs, LEAs) and Financial Institutes (Banks, Insurance Companies, etc.). This Master-Thesis project is done at AIA, one of the partners for the HEMOLIA project in Barcelona. The objective of this thesis is to find the clusters in a network drawn by using the financial data. An extensive literature survey has been carried out and several standard algorithms related to networks have been studied and implemented. The clustering problem is a NP-hard problem and several algorithms like K-Means and Hierarchical clustering are being implemented for studying several problems relating to sociology, evolution, anthropology etc. However, these algorithms have certain drawbacks which make them very difficult to implement. The thesis suggests (a) a possible improvement to the K-Means algorithm, (b) a novel approach to the clustering problem using the Genetic Algorithms and (c) a new algorithm for finding the cluster of a node using the Genetic Algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Nous avons utilisé une approche ethnobotanique pour identifier des espèces de plantes utilisées par les Cris afin de traiter les symptômes du diabète de type 2. Larix laricina du Roi (L. laricina) a récemment été identifiée comme une des meilleures plantes qui a stimulé le transport de glucose dans les cellules C2C12 et fortement potentialisé la différenciation des 3T3-L1 en indiquant une sensibilité potentiellement accrue à l’insuline. Ensuite, ces études de criblage ont été effectuées sur des extraits éthanolique (EE) en utilisant une série de bioessais in vitro. Cependant, les préparations traditionnelles des plantes sont souvent faites avec l’eau chaude. Le but de cette thèse de doctorat était d’isoler les principes actifs de L. laricina par un fractionnement guidé par l’adipogenèse; d’évaluer et de comparer l’activité et les mécanismes antidiabétiques des EE et des extraits aqueux (HWE) de ces 17 plantes. Pour le fractionnement de L. laricina, on a isolé plusieurs composés connus et identifié un nouveau composé actif cycloartane triterpene, qui a amélioré fortement l’adipogenèse et a été responsable en partie de l’activité adipogénique (potentiellement similaire à l’effet sensibilisateur à l’insuline des glitazone) de l’extrait éthanolique issu de l’écorce de L. laricina. Pour le métabolisme lipidique, nos résultats ont confirmé que 10 parmi les 17 EE ont augmenté la différenciation des adipocytes alors que 2 extraits seulement l’ont inhibée. Les HWE ont montré une faible activité adipogénique ou antiadipogénique. Les EE de R. groenlandicum et K. angustifolia ont le PPAR γ (peroxisome proliferator-activated receptor γ), le SREBP-1 (sterol regulatory element binding protein-1) et le C/EBP (CCAAT-enhancer binding proteins) α, alors que ceux de P. balsamifera et A. incana les ont inhibés. L’effet inhibiteur de P. balsamifera a également été prouvé d’avoir impliqué l’activation de la protéine kinase activée par l’AMP (AMPK). Les EE et HWE de R. groenlandicum ont stimulé les mêmes facteurs de transcription alors que les extraits aqueux d’autres plantes sélectionnées ont perdu ces effets en comparaison avec leurs extraits éthanoliques respectifs. L’analyse phytochimique a également identifié le groupe des espèces actives et inactives, notamment lorsque les espèces ont été séparées par famille de plante. Finalement concernant l’homéostasie de glucose, nos résultats ont confirmé que plusieurs EE ont stimulé le transport de glucose musculaire et inhibé l’activité de la glucose-6-phosphatase (G6Pase) hépatique. Certains des HWE ont partiellement ou complètement perdu ces activités antidiabétiques par rapport aux EE, tandis qu’une seule plante (R.groenlandicum) a juste conservé un potentiel similaire entre les EE et HWE dans les deux essais. Dans les cellules musculaires, les EE de R.groenlandicum, A. incana et S. purpurea ont stimulé le transport de glucose en activant la voie de signalisation de l’AMPK et en augmentant le niveau d’expression des GLUT4. En comparaison avec les EE, les HWE de R.groenlandicum ont montré des activités similaires; les HWE de A. incana ont complètement perdu leur effet sur tous les paramètres étudiés; les HWE de S. purpurea ont activé la voie de l’insuline au lieu de celle de l’AMPK pour augmenter le transport de glucose. Dans les cellules H4IIE, les EE et HWE des 5 plantes ont activé la voie de l’AMPK, et en plus les EE et HWE de 2 plantes ont activé la voie de l’insuline. La quercétine-3-O-galactoside et la quercétine 3-O-α-L-arabinopyranoside ont été identifiées comme des composés ayant un fort potentiel antidiabétique et donc responsables de l'activité biologique des plantes HWE actifs avec le transport du glucose. En conclusion, on a isolé plusieurs composés connus et identifié un nouveau triterpène actif à partir du fractionnement de L. laricina. Nous avons fourni également une preuve directe pour l'évaluation et la comparaison d'une action analogue à l'insuline ou insulino-sensibilisateur des EE et HWE de plantes médicinales Cris au niveau de muscle, de foie et de tissus adipeux. Une partie de leur action peut être liée à la stimulation des voies de signalisation intracellulaire insulino-dépendante et non-insulino-dépendante, ainsi que l’activation de PPARγ. Nos résultats indiquent que les espèces de plantes, les tissus ou les cellules cibles, ainsi que les méthodes d'extraction sont tous des déterminants significatifs de l'activité biologique de plantes médicinales Cris sur le métabolisme glucidique et lipidique.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

There is controversy about whether traditional medicine can guide drug discovery, and investment in ethnobotanically led research has fluctuated. One view is that traditionally used plants are not necessarily efficacious and there are no robust methods for distinguishing the ones that are most likely to be bioactive when selecting species for further testing. Here, we reconstruct a genus-level molecular phylogeny representing the 20,000 species found in the floras of three disparate biodiversity hotspots: Nepal, New Zealand and the Cape of South Africa. Borrowing phylogenetic methods from community ecology, we reveal significant clustering of the 1,500 traditionally used species, and provide a direct measure of the relatedness of the three medicinal floras. We demonstrate shared phylogenetic patterns across the floras: related plants from these regions are used to treat medical conditions in the same therapeutic areas. This strongly suggests independent discovery of plant efficacy, an interpretation corroborated by the presence of a significantly greater proportion of known bioactive species in these plant groups than in a random sample. Phylogenetic cross-cultural comparison can focus screening efforts on a subset of traditionally used plants that are richer in bioactive compounds, and could revitalise the use of traditional knowledge in bioprospecting.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a hierarchical clustering method for semantic Web service discovery. This method aims to improve the accuracy and efficiency of the traditional service discovery using vector space model. The Web service is converted into a standard vector format through the Web service description document. With the help of WordNet, a semantic analysis is conducted to reduce the dimension of the term vector and to make semantic expansion to meet the user’s service request. The process and algorithm of hierarchical clustering based semantic Web service discovery is discussed. Validation is carried out on the dataset.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We characterized 28 new isolates of Trypanosoma cruzi IIc (TCIIc) of mammals and triatomines from Northern to Southern Brazil, confirming the widespread distribution of this lineage. Phylogenetic analyses using cytochrome b and SSU rDNA sequences clearly separated TCIIc from TCIIa according to terrestrial and arboreal ecotopes of their preferential mammalian hosts and vectors. TCIIc was more closely related to TCIId/e, followed by TCIIa, and separated by large distances from TCIIb and TCI. Despite being indistinguishable by traditional genotyping and generally being assigned to Z3, we provide evidence that TCIIa from South America and TCIIa from North America correspond to independent lineages that circulate in distinct hosts and ecological niches. Armadillos, terrestrial didelphids and rodents, and domestic dogs were found infected by TCIIc in Brazil. We believe that, in Brazil, this is the first description of TCIIc from rodents and domestic dogs. Terrestrial triatomines of genera Panstrongylus and Triatoma were confirmed as vectors of TCIIc. Together, habitat, mammalian host and vector association corroborated the link between TCIIc and terrestrial transmission cycles/ecological niches. Analysis of ITS1 rDNA sequences disclosed clusters of TCIIc isolates in accordance with their geographic origin, independent of their host species. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Chiral symmetry breaking at finite baryon density is usually discussed in the context of quark matter, i.e. a system of deconfined quarks. Many systems like stable nuclei and neutron stars however have quarks confined within nucleons. In this paper we construct a Fermi sea of three-quark nucleon clusters and investigate the change of the quark condensate as a function of baryon density. We study the effect of quark clustering on the in-medium quark condensate and compare results with the traditional approach of modeling hadronic matter in terms of a Fermi sea of deconfined quarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Wireless Sensor Networks (WSN) are a special kind of ad-hoc networks that is usually deployed in a monitoring field in order to detect some physical phenomenon. Due to the low dependability of individual nodes, small radio coverage and large areas to be monitored, the organization of nodes in small clusters is generally used. Moreover, a large number of WSN nodes is usually deployed in the monitoring area to increase WSN dependability. Therefore, the best cluster head positioning is a desirable characteristic in a WSN. In this paper, we propose a hybrid clustering algorithm based on community detection in complex networks and traditional K-means clustering technique: the QK-Means algorithm. Simulation results show that QK-Means detect communities and sub-communities thus lost message rate is decreased and WSN coverage is increased. © 2012 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we propose a nature-inspired approach that can boost the Optimum-Path Forest (OPF) clustering algorithm by optimizing its parameters in a discrete lattice. The experiments in two public datasets have shown that the proposed algorithm can achieve similar parameters' values compared to the exhaustive search. Although, the proposed technique is faster than the traditional one, being interesting for intrusion detection in large scale traffic networks. © 2012 IEEE.