998 resultados para data partitions
Resumo:
There is a family of well-known external clustering validity indexes to measure the degree of compatibility or similarity between two hard partitions of a given data set, including partitions with different numbers of categories. A unified, fully equivalent set-theoretic formulation for an important class of such indexes was derived and extended to the fuzzy domain in a previous work by the author [Campello, R.J.G.B., 2007. A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment. Pattern Recognition Lett., 28, 833-841]. However, the proposed fuzzy set-theoretic formulation is not valid as a general approach for comparing two fuzzy partitions of data. Instead, it is an approach for comparing a fuzzy partition against a hard referential partition of the data into mutually disjoint categories. In this paper, generalized external indexes for comparing two data partitions with overlapping categories are introduced. These indexes can be used as general measures for comparing two partitions of the same data set into overlapping categories. An important issue that is seldom touched in the literature is also addressed in the paper, namely, how to compare two partitions of different subsamples of data. A number of pedagogical examples and three simulation experiments are presented and analyzed in details. A review of recent related work compiled from the literature is also provided. (c) 2010 Elsevier B.V. All rights reserved.
Resumo:
This paper consists in the characterization of medium voltage (MV) electric power consumers based on a data clustering approach. It is intended to identify typical load profiles by selecting the best partition of a power consumption database among a pool of data partitions produced by several clustering algorithms. The best partition is selected using several cluster validity indices. These methods are intended to be used in a smart grid environment to extract useful knowledge about customers’ behavior. The data-mining-based methodology presented throughout the paper consists in several steps, namely the pre-processing data phase, clustering algorithms application and the evaluation of the quality of the partitions. To validate our approach, a case study with a real database of 1.022 MV consumers was used.
Resumo:
Phylogenetic analyses of chloroplast DNA sequences, morphology, and combined data have provided consistent support for many of the major branches within the angiosperm, clade Dipsacales. Here we use sequences from three mitochondrial loci to test the existing broad scale phylogeny and in an attempt to resolve several relationships that have remained uncertain. Parsimony, maximum likelihood, and Bayesian analyses of a combined mitochondrial data set recover trees broadly consistent with previous studies, although resolution and support are lower than in the largest chloroplast analyses. Combining chloroplast and mitochondrial data results in a generally well-resolved and very strongly supported topology but the previously recognized problem areas remain. To investigate why these relationships have been difficult to resolve we conducted a series of experiments using different data partitions and heterogeneous substitution models. Usually more complex modeling schemes are favored regardless of the partitions recognized but model choice had little effect on topology or support values. In contrast there are consistent but weakly supported differences in the topologies recovered from coding and non-coding matrices. These conflicts directly correspond to relationships that were poorly resolved in analyses of the full combined chloroplast-mitochondrial data set. We suggest incongruent signal has contributed to our inability to confidently resolve these problem areas. (c) 2007 Elsevier Inc. All rights reserved.
Resumo:
In simultaneous analyses of multiple data partitions, the trees relevant when measuring support for a clade are the optimal tree, and the best tree lacking the clade (i.e., the most reasonable alternative). The parsimony-based method of partitioned branch support (PBS) forces each data set to arbitrate between the two relevant trees. This value is the amount each data set contributes to clade support in the combined analysis, and can be very different to support apparent in separate analyses. The approach used in PBS can also be employed in likelihood: a simultaneous analysis of all data retrieves the maximum likelihood tree, and the best tree without the clade of interest is also found. Each data set is fitted to the two trees and the log-likelihood difference calculated, giving partitioned likelihood support (PLS) for each data set. These calculations can be performed regardless of the complexity of the ML model adopted. The significance of PLS can be evaluated using a variety of resampling methods, such as the Kishino-Hasegawa test, the Shimodiara-Hasegawa test, or likelihood weights, although the appropriateness and assumptions of these tests remains debated.
Resumo:
Agapophytinae subf.n. is a highly diverse lineage of Australasian Therevidae, comprising eight described and two new genera: Agapophytus Guerin-Meneville, Acupalpa Krober, Acraspisa Krober, Belonalys Krober, Bonjeania Irwin & Lyneborg, Parapsilocephala Krober, Acatopygia Krober, Laxotela Winterton & Irwin, Pipinnipons gen.n. and Patanothrix gen.n. A genus-level cladistic analysis of the subfamily was undertaken using sixty-eight adult morphological characters and c. 1000 base pairs of the elongation factor-1 alpha (EF-1 alpha) protein coding gene. The morphological data partition produced three most parsimonious cladograms, whereas the molecular data partition gave a single most parsimonious cladogram, which did not match any of the cladograms found in the morphological analysis. The level of congruence between the data partitions was determined using the partition homogeneity test (HTF) and Wilcoxon signed ranks rest. Despite being significantly incongruent in at least one of the incongruence tests, the partitions were combined in a simultaneous analysis. The combined data yielded a single cladogram that was better supported than that of the individual partitions analysed separately. The relative contributions of the data partitions to support for individual nodes on the combined cladogram were investigated using Partitioned Bremer Support. The level of support for many nodes on the combined cladogram was non-additive and often greater than the sum of support for the respective nodes on individual partitions. This synergistic interaction between incongruent data partitions indicates a common phylogenetic signal in both partitions. It also suggests that criteria for partition combination based solely on incongruence may be misleading. The phylogenetic relationships of the genera are discussed using the combined data. A key to genera of Agapophytinae is presented, with genera diagnosed and figured. Two new genera are described: Patanothrix with a new species (Pat. skevingtoni) and Pat. wilsoni (Mann) transferred from Parapsilocephala, and Pipinnipons with a new species (Pip. kroeberi). Pipinnipons fascipennis (Krober) is transferred from Squamopygin Krober and Pip. imitans (Mann) is transferred from Agapophytus. Agapophytus bicolor (Krober) is transferred from Parapsilocephala. Agapophytus varipennis Mann is synonymised with Aga, queenslandi Krober and Aga. flavicornis Mann is synonymised with Aga. pallidicornis (Krober).
Resumo:
A definition of medium voltage (MV) load diagrams was made, based on the data base knowledge discovery process. Clustering techniques were used as support for the agents of the electric power retail markets to obtain specific knowledge of their customers’ consumption habits. Each customer class resulting from the clustering operation is represented by its load diagram. The Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) were applied to an electricity consumption data from a utility client’s database in order to form the customer’s classes and to find a set of representative consumption patterns. The WEACS approach is a clustering ensemble combination approach that uses subsampling and that weights differently the partitions in the co-association matrix. As a complementary step to the WEACS approach, all the final data partitions produced by the different variations of the method are combined and the Ward Link algorithm is used to obtain the final data partition. Experiment results showed that WEACS approach led to better accuracy than many other clustering approaches. In this paper the WEACS approach separates better the customer’s population than Two-step clustering algorithm.
Resumo:
This paper presents the characterization of high voltage (HV) electric power consumers based on a data clustering approach. The typical load profiles (TLP) are obtained selecting the best partition of a power consumption database among a pool of data partitions produced by several clustering algorithms. The choice of the best partition is supported using several cluster validity indices. The proposed data-mining (DM) based methodology, that includes all steps presented in the process of knowledge discovery in databases (KDD), presents an automatic data treatment application in order to preprocess the initial database in an automatic way, allowing time saving and better accuracy during this phase. These methods are intended to be used in a smart grid environment to extract useful knowledge about customers’ consumption behavior. To validate our approach, a case study with a real database of 185 HV consumers was used.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
Broad-scale phylogenetic analyses of the angiosperms and of the Asteridae have failed to confidently resolve relationships among the major lineages of the campanulid Asteridae (i.e., the euasterid II of APG II, 2003). To address this problem we assembled presently available sequences for a core set of 50 taxa, representing the diversity of the four largest lineages (Apiales, Aquifoliales, Asterales, Dipsacales) as well as the smaller ""unplaced"" groups (e.g., Bruniaceae, Paracryphiaceae, Columelliaceae). We constructed four data matrices for phylogenetic analysis: a chloroplast coding matrix (atpB, matK, ndhF, rbcL), a chloroplast non-coding matrix (rps16 intron, trnT-F region, trnV-atpE IGS), a combined chloroplast dataset (all seven chloroplast regions), and a combined genome matrix (seven chloroplast regions plus 18S and 26S rDNA). Bayesian analyses of these datasets using mixed substitution models produced often well-resolved and supported trees. Consistent with more weakly supported results from previous studies, our analyses support the monophyly of the four major clades and the relationships among them. Most importantly, Asterales are inferred to be sister to a clade containing Apiales and Dipsacales. Paracryphiaceae is consistently placed sister to the Dipsacales. However, the exact relationships of Bruniaceae, Columelliaceae, and an Escallonia clade depended upon the dataset. Areas of poor resolution in combined analyses may be partly explained by conflict between the coding and non-coding data partitions. We discuss the implications of these results for our understanding of campanulid phylogeny and evolution, paying special attention to how our findings bear on character evolution and biogeography in Dipsacales.
Resumo:
Las redes Bayesianas constituyen un modelo ampliamente utilizado para la representación de relaciones de dependencia condicional en datos multivariantes. Su aprendizaje a partir de un conjunto de datos o expertos ha sido estudiado profundamente desde su concepción. Sin embargo, en determinados escenarios se demanda la obtención de un modelo común asociado a particiones de datos o conjuntos de expertos. En este caso, se trata el problema de fusión o agregación de modelos. Los trabajos y resultados en agregación de redes Bayesianas son de naturaleza variada, aunque escasos en comparación con aquellos de aprendizaje. En este documento, se proponen dos métodos para la agregación de redes Gaussianas, definidas como aquellas redes Bayesianas que modelan una distribución Gaussiana multivariante. Los métodos presentados son efectivos, precisos y producen redes con menor cantidad de parámetros en comparación con los modelos obtenidos individualmente. Además, constituyen un enfoque novedoso al incorporar nociones exploradas tradicionalmente por separado en el estado del arte. Futuras aplicaciones en entornos escalables hacen dichos métodos especialmente atractivos, dada su simplicidad y la ganancia en compacidad de la representación obtenida.---ABSTRACT---Bayesian networks are a widely used model for the representation of conditional dependence relationships among variables in multivariate data. The task of learning them from a data set or experts has been deeply studied since their conception. However, situations emerge where there is a need of obtaining a consensuated model from several data partitions or a set of experts. This situation is referred to as model fusion or aggregation. Results about Bayesian network aggregation, although rich in variety, have been scarce when compared to the learning task. In this context, two methods are proposed for the aggregation of Gaussian Bayesian networks, that is, Bayesian networks whose underlying modelled distribution is a multivariate Gaussian. Both methods are effective, precise and produce networks with fewer parameters in comparison with the models obtained by individual learning. They constitute a novel approach given that they incorporate notions traditionally explored separately in the state of the art. Future applications in scalable computer environments make such models specially attractive, given their simplicity and the gaining in sparsity of the produced model.
Resumo:
Almost half of the 4547 described bee flies (Bombyliidae: Diptera) in the world belong to the subfamily Anthracinae, with most of the world's diversity in three cosmopolitan tribes: Villini, Anthracini and Exoprosopini. Molecular data from 815 base pairs of 16S mitochondrial DNA and morphological characters from species-groups of these tribes in Australia were analysed cladistically. The results show that the relationships between the anthracine tribes reflect those found in a previous morphological analysis. The genera of the Anthracinae in Australia are monophyletic, except for Ligyra Newman, and are assigned to tribes. Although simultaneous analysis of the combined molecular and morphological data produced clades found in both separate analyses, the different data sources are significantly incongruent. We use phylogenetic measures to examine support for the relationships among the Australian Anthracinae inferred by the molecular and morphological data.
Resumo:
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Resumo:
Many seemingly disparate approaches for marginal modeling have been developed in recent years. We demonstrate that many current approaches for marginal modeling of correlated binary outcomes produce likelihoods that are equivalent to the proposed copula-based models herein. These general copula models of underlying latent threshold random variables yield likelihood based models for marginal fixed effects estimation and interpretation in the analysis of correlated binary data. Moreover, we propose a nomenclature and set of model relationships that substantially elucidates the complex area of marginalized models for binary data. A diverse collection of didactic mathematical and numerical examples are given to illustrate concepts.
Resumo:
This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.
Resumo:
This article is is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Attribution-NonCommercial (CC BY-NC) license lets others remix, tweak, and build upon work non-commercially, and although the new works must also acknowledge & be non-commercial.