21 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices

em Repositório Científico do Instituto Politécnico de Lisboa - Portugal


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Audiometer systems provide enormous amounts of detailed TV watching data. Several relevant and interdependent factors may influence TV viewers' behavior. In this work we focus on the time factor and derive Temporal Patterns of TV watching, based on panel data. Clustering base attributes are originated from 1440 binary minute-related attributes, capturing the TV watching status (watch/not watch). Since there are around 2500 panel viewers a data reduction procedure is first performed. K-Means algorithm is used to obtain daily clusters of viewers. Weekly patterns are then derived which rely on daily patterns. The obtained solutions are tested for consistency and stability. Temporal TV watching patterns provide new insights concerning Portuguese TV viewers' behavior.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação apresentada à Escola Superior de Educação de Lisboa para obtenção de grau de mestre em Ciências da Educação - Especialidade Educação Especial

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação apresentada à escola Superior de Educação de Lisboa para obtenção de grau de mestre em Educação Matemática na Educação Pré-Escolar e nos 1º e 2º Ciclos do Ensino Básico

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper proposes a novel demand response model using a fuzzy subtractive cluster approach. The model development provides support to domestic consumer decisions on controllable loads management, considering consumers' consumption needs and the appropriate load shape or rescheduling in order to achieve possible economic benefits. The model based on fuzzy subtractive clustering method considers clusters of domestic consumption covering an adequate consumption range. Analysis of different scenarios is presented considering available electric power and electric energy prices. Simulation results are presented and conclusions of the proposed demand response model are discussed. (C) 2016 Elsevier Ltd. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We calculate the equilibrium thermodynamic properties, percolation threshold, and cluster distribution functions for a model of associating colloids, which consists of hard spherical particles having on their surfaces three short-ranged attractive sites (sticky spots) of two different types, A and B. The thermodynamic properties are calculated using Wertheim's perturbation theory of associating fluids. This also allows us to find the onset of self-assembly, which can be quantified by the maxima of the specific heat at constant volume. The percolation threshold is derived, under the no-loop assumption, for the correlated bond model: In all cases it is two percolated phases that become identical at a critical point, when one exists. Finally, the cluster size distributions are calculated by mapping the model onto an effective model, characterized by a-state-dependent-functionality (f) over bar and unique bonding probability (p) over bar. The mapping is based on the asymptotic limit of the cluster distributions functions of the generic model and the effective parameters are defined through the requirement that the equilibrium cluster distributions of the true and effective models have the same number-averaged and weight-averaged sizes at all densities and temperatures. We also study the model numerically in the case where BB interactions are missing. In this limit, AB bonds either provide branching between A-chains (Y-junctions) if epsilon(AB)/epsilon(AA) is small, or drive the formation of a hyperbranched polymer if epsilon(AB)/epsilon(AA) is large. We find that the theoretical predictions describe quite accurately the numerical data, especially in the region where Y-junctions are present. There is fairly good agreement between theoretical and numerical results both for the thermodynamic (number of bonds and phase coexistence) and the connectivity properties of the model (cluster size distributions and percolation locus).

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The aim of this paper is to develop models for experimental open-channel water delivery systems and assess the use of three data-driven modeling tools toward that end. Water delivery canals are nonlinear dynamical systems and thus should be modeled to meet given operational requirements while capturing all relevant dynamics, including transport delays. Typically, the derivation of first principle models for open-channel systems is based on the use of Saint-Venant equations for shallow water, which is a time-consuming task and demands for specific expertise. The present paper proposes and assesses the use of three data-driven modeling tools: artificial neural networks, composite local linear models and fuzzy systems. The canal from Hydraulics and Canal Control Nucleus (A parts per thousand vora University, Portugal) will be used as a benchmark: The models are identified using data collected from the experimental facility, and then their performances are assessed based on suitable validation criterion. The performance of all models is compared among each other and against the experimental data to show the effectiveness of such tools to capture all significant dynamics within the canal system and, therefore, provide accurate nonlinear models that can be used for simulation or control. The models are available upon request to the authors.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper focus on a demand response model analysis in a smart grid context considering a contingency scenario. A fuzzy clustering technique is applied on the developed demand response model and an analysis is performed for the contingency scenario. Model considerations and architecture are described. The demand response developed model aims to support consumers decisions regarding their consumption needs and possible economic benefits.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Mestrado em Controlo de Gestão e dos Negócios

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper focus on a demand response model analysis in a smart grid context considering a contingency scenario. A fuzzy clustering technique is applied on the developed demand response model and an analysis is performed for the contingency scenario. Model considerations and architecture are described. The demand response developed model aims to support consumers decisions regarding their consumption needs and possible economic benefits.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion. © 2014 Springer-Verlag Berlin Heidelberg.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Conferência: CONTROLO’2012 - 16-18 July 2012 - Funchal