991 resultados para CATEGORICAL-DATA
Resumo:
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
Resumo:
Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.
Resumo:
We compare correspondance análisis to the logratio approach based on compositional data. We also compare correspondance análisis and an alternative approach using Hellinger distance, for representing categorical data in a contingency table. We propose a coefficient which globally measures the similarity between these approaches. This coefficient can be decomposed into several components, one component for each principal dimension, indicating the contribution of the dimensions to the difference between the two representations. These three methods of representation can produce quite similar results. One illustrative example is given
Resumo:
Compositional random vectors are fundamental tools in the Bayesian analysis of categorical data.Many of the issues that are discussed with reference to the statistical analysis of compositionaldata have a natural counterpart in the construction of a Bayesian statistical model for categoricaldata.This note builds on the idea of cross-fertilization of the two areas recommended by Aitchison (1986)in his seminal book on compositional data. Particular emphasis is put on the problem of whatparameterization to use
Resumo:
We compare correspondance análisis to the logratio approach based on compositional data. We also compare correspondance análisis and an alternative approach using Hellinger distance, for representing categorical data in a contingency table. We propose a coefficient which globally measures the similarity between these approaches. This coefficient can be decomposed into several components, one component for each principal dimension, indicating the contribution of the dimensions to the difference between the two representations. These three methods of representation can produce quite similar results. One illustrative example is given
Resumo:
Compositional random vectors are fundamental tools in the Bayesian analysis of categorical data. Many of the issues that are discussed with reference to the statistical analysis of compositional data have a natural counterpart in the construction of a Bayesian statistical model for categorical data. This note builds on the idea of cross-fertilization of the two areas recommended by Aitchison (1986) in his seminal book on compositional data. Particular emphasis is put on the problem of what parameterization to use
Resumo:
The proportional odds model provides a powerful tool for analysing ordered categorical data and setting sample size, although for many clinical trials its validity is questionable. The purpose of this paper is to present a new class of constrained odds models which includes the proportional odds model. The efficient score and Fisher's information are derived from the profile likelihood for the constrained odds model. These results are new even for the special case of proportional odds where the resulting statistics define the Mann-Whitney test. A strategy is described involving selecting one of these models in advance, requiring assumptions as strong as those underlying proportional odds, but allowing a choice of such models. The accuracy of the new procedure and its power are evaluated.
Resumo:
P>In the context of either Bayesian or classical sensitivity analyses of over-parametrized models for incomplete categorical data, it is well known that prior-dependence on posterior inferences of nonidentifiable parameters or that too parsimonious over-parametrized models may lead to erroneous conclusions. Nevertheless, some authors either pay no attention to which parameters are nonidentifiable or do not appropriately account for possible prior-dependence. We review the literature on this topic and consider simple examples to emphasize that in both inferential frameworks, the subjective components can influence results in nontrivial ways, irrespectively of the sample size. Specifically, we show that prior distributions commonly regarded as slightly informative or noninformative may actually be too informative for nonidentifiable parameters, and that the choice of over-parametrized models may drastically impact the results, suggesting that a careful examination of their effects should be considered before drawing conclusions.Resume Que ce soit dans un cadre Bayesien ou classique, il est bien connu que la surparametrisation, dans les modeles pour donnees categorielles incompletes, peut conduire a des conclusions erronees. Cependant, certains auteurs persistent a negliger les problemes lies a la presence de parametres non identifies. Nous passons en revue la litterature dans ce domaine, et considerons quelques exemples surparametres simples dans lesquels les elements subjectifs influencent de facon non negligeable les resultats, independamment de la taille des echantillons. Plus precisement, nous montrons comment des a priori consideres comme peu ou non-informatifs peuvent se reveler extremement informatifs en ce qui concerne les parametres non identifies, et que le recours a des modeles surparametres peut avoir sur les conclusions finales un impact considerable. Ceci suggere un examen tres attentif de l`impact potentiel des a priori.
Resumo:
We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.
Resumo:
There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.
Resumo:
The need for timely population data for health planning and Indicators of need has Increased the demand for population estimates. The data required to produce estimates is difficult to obtain and the process is time consuming. Estimation methods that require less effort and fewer data are needed. The structure preserving estimator (SPREE) is a promising technique not previously used to estimate county population characteristics. This study first uses traditional regression estimation techniques to produce estimates of county population totals. Then the structure preserving estimator, using the results produced in the first phase as constraints, is evaluated.^ Regression methods are among the most frequently used demographic methods for estimating populations. These methods use symptomatic indicators to predict population change. This research evaluates three regression methods to determine which will produce the best estimates based on the 1970 to 1980 indicators of population change. Strategies for stratifying data to improve the ability of the methods to predict change were tested. Difference-correlation using PMSA strata produced the equation which fit the data the best. Regression diagnostics were used to evaluate the residuals.^ The second phase of this study is to evaluate use of the structure preserving estimator in making estimates of population characteristics. The SPREE estimation approach uses existing data (the association structure) to establish the relationship between the variable of interest and the associated variable(s) at the county level. Marginals at the state level (the allocation structure) supply the current relationship between the variables. The full allocation structure model uses current estimates of county population totals to limit the magnitude of county estimates. The limited full allocation structure model has no constraints on county size. The 1970 county census age - gender population provides the association structure, the allocation structure is the 1980 state age - gender distribution.^ The full allocation model produces good estimates of the 1980 county age - gender populations. An unanticipated finding of this research is that the limited full allocation model produces estimates of county population totals that are superior to those produced by the regression methods. The full allocation model is used to produce estimates of 1986 county population characteristics. ^