959 resultados para Clustering methods
Resumo:
Stroke stands for one of the most frequent causes of death, without distinguishing age or genders. Despite representing an expressive mortality fig-ure, the disease also causes long-term disabilities with a huge recovery time, which goes in parallel with costs. However, stroke and health diseases may also be prevented considering illness evidence. Therefore, the present work will start with the development of a decision support system to assess stroke risk, centered on a formal framework based on Logic Programming for knowledge rep-resentation and reasoning, complemented with a Case Based Reasoning (CBR) approach to computing. Indeed, and in order to target practically the CBR cycle, a normalization and an optimization phases were introduced, and clustering methods were used, then reducing the search space and enhancing the cases re-trieval one. On the other hand, and aiming at an improvement of the CBR theo-retical basis, the predicates` attributes were normalized to the interval 0…1, and the extensions of the predicates that match the universe of discourse were re-written, and set not only in terms of an evaluation of its Quality-of-Information (QoI), but also in terms of an assessment of a Degree-of-Confidence (DoC), a measure of one`s confidence that they fit into a given interval, taking into account their domains, i.e., each predicate attribute will be given in terms of a pair (QoI, DoC), a simple and elegant way to represent data or knowledge of the type incomplete, self-contradictory, or even unknown.
Resumo:
It is well known that rib cage dimensions depend on the gender and vary with the age of the individual. Under this setting it is therefore possible to assume that a computational approach to the problem may be thought out and, consequently, this work will focus on the development of an Artificial Intelligence grounded decision support system to predict individual’s age, based on such measurements. On the one hand, using some basic image processing techniques it were extracted such descriptions from chest X-rays (i.e., its maximum width and height). On the other hand, the computational framework was built on top of a Logic Programming Case Base approach to knowledge representation and reasoning, which caters for the handling of incomplete, unknown, or even contradictory information. Furthermore, clustering methods based on similarity analysis among cases were used to distinguish and aggregate collections of historical data in order to reduce the search space, therefore enhancing the cases retrieval and the overall computational process. The accuracy of the proposed model is satisfactory, close to 90%.
Resumo:
Plants of genus Schinus are native South America and introduced in Mediterranean countries, a long time ago. Some Schinus species have been used in folk medicine, and Essential Oils of Schinus spp. (EOs) have been reported as having antimicrobial, anti-tumoural and anti-inflammatory properties. Such assets are related with the EOs chemical composition that depends largely on the species, the geographic and climatic region, and on the part of the plants used. Considering the difficulty to infer the pharmacological properties of EOs of Schinus species without a hard experimental setting, this work will focus on the development of an Artificial Intelligence grounded Decision Support System to predict pharmacological properties of Schinus EOs. The computational framework was built on top of a Logic Programming Case Base approach to knowledge representation and reasoning, which caters to the handling of incomplete, unknown, or even self-contradictory information. New clustering methods centered on an analysis of attribute’s similarities were used to distinguish and aggregate historical data according to the context under which it was added to the Case Base, therefore enhancing the prediction process.
Resumo:
It is well known that human resources play a valuable role in a sustainable organizational development. Indeed, this work will focus on the development of a decision support system to assess workers’ satisfaction based on factors related to human resources management practices. The framework is built on top of a Logic Programming approach to Knowledge Representation and Reasoning, complemented with a Case Based approach to computing. The proposed solution is unique in itself, once it caters for the explicit treatment of incomplete, unknown, or even self-contradictory information, either in terms of a qualitative or quantitative setting. Furthermore, clustering methods based on similarity analysis among cases were used to distinguish and aggregate collections of historical data or knowledge in order to reduce the search space, therefore enhancing the cases retrieval and the overall computational process.
Resumo:
RESUMO: Objetivou-se com o presente trabalho avaliar a divergência genética e as características físicas e químicas de frutos de duas populações do maracujazeiro azedo na região Norte do Espírito Santo, como as progênies de meio-irmãos de acesso local de um plantio comercial (genótipos: 1; 2; 3; 4; 5; 6; 7; 8; 9 e 10) e do híbrido BRS Ouro Vermelho (genótipos: 11; 12; 13; 14; 15; 16; 17; 18; 19 e 20). A divergência genética foi avaliada por procedimentos multivariados como a distância generalizada de Mahalanobis (D2) e pelos métodos de agrupamento de otimização de Tocher e UPGMA. Encontrou-se divergência genética entre as populações estudadas promovendo a formação de grupos diferentes entre o método de Tocher e do UPGMA. As características, referentes ao tamanho do fruto, diâmetro polar e equatorial, foram as que mais contribuíram na diversidade genética dos genótipos. Nas populações estudadas de maracujazeiro azedo há grande variabilidade genética quanto às características avaliadas, o que possibilita selecionar plantas com elevado potencial para fins de melhoramento genético. O híbrido BRS Ouro Vermelho apresenta boa adaptação às condições locais. ABSTRACT: The aim of the present work was to evaluate genetic divergence and physical and chemical characteristics in fruit of two populations of sour passion fruit in the northern region of the State of Espírito Santo, Brazil, these being half-sibling progenies from local accessions of a commercial crop (genotypes 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10) and the hybrid BRS Ouro Vermelho (genotypes: 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20). Genetic divergence was evaluated using such multivariate procedures as the generalised Mahalanobis distance (D2) and the Tocher optimisation and UPGMA clustering methods. Genetic divergence was found between the populations under study, promoting the formation of different groups between the Tocher and UPGMA methods. As characteristics for fruit size, the polar and equatorial diameters had the most impact on the genetic diversity of the genotypes. In the populations of sour passion fruit being studied, great genetic variability is seen in the evaluated characteristics, making it possible to select plants of high potential for breeding purposes. The BRS Ouro Vermelho hybrid is well adapted to the local conditions.
Resumo:
Abstract : This work is concerned with the development and application of novel unsupervised learning methods, having in mind two target applications: the analysis of forensic case data and the classification of remote sensing images. First, a method based on a symbolic optimization of the inter-sample distance measure is proposed to improve the flexibility of spectral clustering algorithms, and applied to the problem of forensic case data. This distance is optimized using a loss function related to the preservation of neighborhood structure between the input space and the space of principal components, and solutions are found using genetic programming. Results are compared to a variety of state-of--the-art clustering algorithms. Subsequently, a new large-scale clustering method based on a joint optimization of feature extraction and classification is proposed and applied to various databases, including two hyperspectral remote sensing images. The algorithm makes uses of a functional model (e.g., a neural network) for clustering which is trained by stochastic gradient descent. Results indicate that such a technique can easily scale to huge databases, can avoid the so-called out-of-sample problem, and can compete with or even outperform existing clustering algorithms on both artificial data and real remote sensing images. This is verified on small databases as well as very large problems. Résumé : Ce travail de recherche porte sur le développement et l'application de méthodes d'apprentissage dites non supervisées. Les applications visées par ces méthodes sont l'analyse de données forensiques et la classification d'images hyperspectrales en télédétection. Dans un premier temps, une méthodologie de classification non supervisée fondée sur l'optimisation symbolique d'une mesure de distance inter-échantillons est proposée. Cette mesure est obtenue en optimisant une fonction de coût reliée à la préservation de la structure de voisinage d'un point entre l'espace des variables initiales et l'espace des composantes principales. Cette méthode est appliquée à l'analyse de données forensiques et comparée à un éventail de méthodes déjà existantes. En second lieu, une méthode fondée sur une optimisation conjointe des tâches de sélection de variables et de classification est implémentée dans un réseau de neurones et appliquée à diverses bases de données, dont deux images hyperspectrales. Le réseau de neurones est entraîné à l'aide d'un algorithme de gradient stochastique, ce qui rend cette technique applicable à des images de très haute résolution. Les résultats de l'application de cette dernière montrent que l'utilisation d'une telle technique permet de classifier de très grandes bases de données sans difficulté et donne des résultats avantageusement comparables aux méthodes existantes.
Resumo:
Although association mining has been highlighted in the last years, the huge number of rules that are generated hamper its use. To overcome this problem, many post-processing approaches were suggested, such as clustering, which organizes the rules in groups that contain, somehow, similar knowledge. Nevertheless, clustering can aid the user only if good descriptors be associated with each group. This is a relevant issue, since the labels will provide to the user a view of the topics to be explored, helping to guide its search. This is interesting, for example, when the user doesn't have, a priori, an idea where to start. Thus, the analysis of different labeling methods for association rule clustering is important. Considering the exposed arguments, this paper analyzes some labeling methods through two measures that are proposed. One of them, Precision, measures how much the methods can find labels that represent as accurately as possible the rules contained in its group and Repetition Frequency determines how the labels are distributed along the clusters. As a result, it was possible to identify the methods and the domain organizations with the best performances that can be applied in clusters of association rules.
Resumo:
Different types of water bodies, including lakes, streams, and coastal marine waters, are often susceptible to fecal contamination from a range of point and nonpoint sources, and have been evaluated using fecal indicator microorganisms. The most commonly used fecal indicator is Escherichia coli, but traditional cultivation methods do not allow discrimination of the source of pollution. The use of triplex PCR offers an approach that is fast and inexpensive, and here enabled the identification of phylogroups. The phylogenetic distribution of E. coli subgroups isolated from water samples revealed higher frequencies of subgroups A1 and B23 in rivers impacted by human pollution sources, while subgroups D1 and D2 were associated with pristine sites, and subgroup B1 with domesticated animal sources, suggesting their use as a first screening for pollution source identification. A simple classification is also proposed based on phylogenetic subgroup distribution using the w-clique metric, enabling differentiation of polluted and unpolluted sites.
Resumo:
The problem of designing spatially cohesive nature reserve systems that meet biodiversity objectives is formulated as a nonlinear integer programming problem. The multiobjective function minimises a combination of boundary length, area and failed representation of the biological attributes we are trying to conserve. The task is to reserve a subset of sites that best meet this objective. We use data on the distribution of habitats in the Northern Territory, Australia, to show how simulated annealing and a greedy heuristic algorithm can be used to generate good solutions to such large reserve design problems, and to compare the effectiveness of these methods.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
OBJECTIVE: To estimate the incidence rate of type 1 diabetes in the urban area of Santiago, Chile, from March 21, 1997 to March 20, 1998, and to assess the spatio-temporal clustering of cases during that period. METHODS: All sixty-one incident cases were located temporally (day of diagnosis) and spatially (place of residence) in the area of study. Knox's method was used to assess spatio-temporal clustering of incident cases. RESULTS: The overall incidence rate of type 1 diabetes was 4.11 cases per 100,000 children aged less than 15 years per year (95% confidence interval: 3.06--5.14). The incidence rate seems to have increased since the last estimate of the incidence calculated for the years 1986--1992 in the metropolitan region of Santiago. Different combinations of space-time intervals have been evaluated to assess spatio-temporal clustering. The smallest p-value was found for the combination of critical distances of 750 meters and 60 days (uncorrected p-value = 0.048). CONCLUSIONS: Although these are preliminary results regarding space-time clustering in Santiago, exploratory analysis of the data method would suggest a possible aggregation of incident cases in space-time coordinates.
Resumo:
TPM Vol. 21, No. 4, December 2014, 435-447 – Special Issue © 2014 Cises.
Resumo:
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.