991 resultados para CATEGORICAL-DATA


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Surveys can collect important data that inform policy decisions and drive social science research. Large government surveys collect information from the U.S. population on a wide range of topics, including demographics, education, employment, and lifestyle. Analysis of survey data presents unique challenges. In particular, one needs to account for missing data, for complex sampling designs, and for measurement error. Conceptually, a survey organization could spend lots of resources getting high-quality responses from a simple random sample, resulting in survey data that are easy to analyze. However, this scenario often is not realistic. To address these practical issues, survey organizations can leverage the information available from other sources of data. For example, in longitudinal studies that suffer from attrition, they can use the information from refreshment samples to correct for potential attrition bias. They can use information from known marginal distributions or survey design to improve inferences. They can use information from gold standard sources to correct for measurement error.

This thesis presents novel approaches to combining information from multiple sources that address the three problems described above.

The first method addresses nonignorable unit nonresponse and attrition in a panel survey with a refreshment sample. Panel surveys typically suffer from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, the panel data alone cannot inform the extent of the bias due to attrition, so analysts must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new

individuals during some later wave of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the

refreshment sample itself. As we illustrate, nonignorable unit nonresponse

can significantly compromise the analyst's ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences---corrected for panel attrition---are to different assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse

in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study.

The second method incorporates informative prior beliefs about

marginal probabilities into Bayesian latent class models for categorical data.

The basic idea is to append synthetic observations to the original data such that

(i) the empirical distributions of the desired margins match those of the prior beliefs, and (ii) the values of the remaining variables are left missing. The degree of prior uncertainty is controlled by the number of augmented records. Posterior inferences can be obtained via typical MCMC algorithms for latent class models, tailored to deal efficiently with the missing values in the concatenated data.

We illustrate the approach using a variety of simulations based on data from the American Community Survey, including an example of how augmented records can be used to fit latent class models to data from stratified samples.

The third method leverages the information from a gold standard survey to model reporting error. Survey data are subject to reporting error when respondents misunderstand the question or accidentally select the wrong response. Sometimes survey respondents knowingly select the wrong response, for example, by reporting a higher level of education than they actually have attained. We present an approach that allows an analyst to model reporting error by incorporating information from a gold standard survey. The analyst can specify various reporting error models and assess how sensitive their conclusions are to different assumptions about the reporting error process. We illustrate the approach using simulations based on data from the 1993 National Survey of College Graduates. We use the method to impute error-corrected educational attainments in the 2010 American Community Survey using the 2010 National Survey of College Graduates as the gold standard survey.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Aim The aim of this study was to explore parental preparedness for discharge and their experiences of going home with their infant after the first-stage surgery for a functionally univentricular heart. Background Technological advances worldwide have improved outcomes for infants with a functionally univentricular heart over the last 3 decades; however, concern remains regarding mortality in the period between the first and second stages of surgery. The implementation of home monitoring programmes for this group of infants has improved this initial inter-stage survival; however, little is known about parents’ experiences of going home, their preparedness for discharge, and parents’ recognition of deterioration in their fragile infant. Method This study was conducted in 2011–2013; eight sets of parents were consulted in the research planning stage in September, 2011, and 22 parents with children aged 0–2 years responded to an online survey during November, 2012–March, 2013. Description of categorical data and deductive thematic analysis of the open-ended questions were undertaken. Results Not all parents were taught signs of deterioration or given written information specific to their baby. The following three themes emerged from the qualitative data: mixed emotions about going home, knowledge and preparedness, and support systems. Conclusions Parents are not adequately prepared for discharge and are not well equipped to recognise deterioration in their child. There is a role for greater parental education through development of an early warning tool to address the gap in parents’ understanding of signs of deterioration, enabling appropriate contact and earlier management by clinicians.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: Sexually transmitted infections (STIs) still stand as one of the commonest health problems affecting women of reproductive age. The knowledge and practices of STIs, among susceptible populations such as women of reproductive age, living in slums like Katanga in Kampala Uganda need to be established. Methods: This was a cross- sectional study with 339 participants in Katanga slum. Data was collected using an interviewer administered questionnaire, entered and analysed using SPSS version 17.0. Data was summarized using frequencies for categorical data and medians for continuous data. Results: Majority of the participants (71.9%) were ≥25years with a mean age of 28.0(SD ±7.0) years. The commonest symptoms known to the participants were genital itching (60%) and genital rash (14.5%). Most mentioned multiple partners (63.7%) and unprotected sex (50.7%) as predisposing factors to STIs. Knowledge on methods of prevention was high (92.3%) however, 18.8% were found positive for STIs using the syndromic approach and 82% mentioned having suffered from STIs in the past 6 months more than once. Conclusion: Most participants did not know about the systemic effects of STIs to their health and did not follow the appropriate behavior patterns despite being knowledgeable about the various methods of prevention of STIs.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background and Purpose—Vascular prevention trials mostly count “yes/no” (binary) outcome events, eg, stroke/no stroke. Analysis of ordered categorical vascular events (eg, fatal stroke/nonfatal stroke/no stroke) is clinically relevant and could be more powerful statistically. Although this is not a novel idea in the statistical community, ordinal outcomes have not been applied to stroke prevention trials in the past. Methods—Summary data on stroke, myocardial infarction, combined vascular events, and bleeding were obtained by treatment group from published vascular prevention trials. Data were analyzed using 10 statistical approaches which allow comparison of 2 ordinal or binary treatment groups. The results for each statistical test for each trial were then compared using Friedman 2-way analysis of variance with multiple comparison procedures. Results—Across 85 trials (335 305 subjects) the test results differed substantially so that approaches which used the ordinal nature of stroke events (fatal/nonfatal/no stroke) were more efficient than those which combined the data to form 2 groups (P0.0001). The most efficient tests were bootstrapping the difference in mean rank, Mann–Whitney U test, and ordinal logistic regression; 4- and 5-level data were more efficient still. Similar findings were obtained for myocardial infarction, combined vascular outcomes, and bleeding. The findings were consistent across different types, designs and sizes of trial, and for the different types of intervention. Conclusions—When analyzing vascular events from prevention trials, statistical tests which use ordered categorical data are more efficient and are more likely to yield reliable results than binary tests. This approach gives additional information on treatment effects by severity of event and will allow trials to be smaller. (Stroke. 2008;39:000-000.)

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Over the past 15 years, the number of international development projects aimed at combating global poverty has increased significantly. Within the water and sanitation sector however, and despite heightened global attention and an increase in the number of infrastructure projects, over 800 million people remain without access to appropriate water and sanitation facilities. The majority of donor aid in the water supply and sanitation sector of developing countries is delivered through standalone projects. The quality of projects at the design and preparation stage is a critical determinant in meeting project objectives. The quality of projects at early stage of design, widely referred to as quality at entry (QAE), however remains unquantified and largely subjective. This research argues that water and sanitation infrastructure projects in the developing world tend to be designed in the absence of a specific set of actions that ensure high QAE, and consequently have relatively high rates of failure. This research analyzes 32 cases of water and sanitation infrastructure projects implemented with partial or full World Bank financing globally from 2000 – 2010. The research uses categorical data analysis, regression analysis and descriptive analysis to examine perceived linkages between project QAE and project development outcomes and determines which upstream project design factors are likely to impact the QAE of international development projects in water supply and sanitation. The research proposes a number of specific design stage actions that can be incorporated into the formal review process of water and sanitation projects financed by the World Bank or other international development partners.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Introducción Los sistemas de puntuación para predicción se han desarrollado para medir la severidad de la enfermedad y el pronóstico de los pacientes en la unidad de cuidados intensivos. Estas medidas son útiles para la toma de decisiones clínicas, la estandarización de la investigación, y la comparación de la calidad de la atención al paciente crítico. Materiales y métodos Estudio de tipo observacional analítico de cohorte en el que reviso las historias clínicas de 283 pacientes oncológicos admitidos a la unidad de cuidados intensivos (UCI) durante enero de 2014 a enero de 2016 y a quienes se les estimo la probabilidad de mortalidad con los puntajes pronósticos APACHE IV y MPM II, se realizó regresión logística con las variables predictoras con las que se derivaron cada uno de los modelos es sus estudios originales y se determinó la calibración, la discriminación y se calcularon los criterios de información Akaike AIC y Bayesiano BIC. Resultados En la evaluación de desempeño de los puntajes pronósticos APACHE IV mostro mayor capacidad de predicción (AUC = 0,95) en comparación con MPM II (AUC = 0,78), los dos modelos mostraron calibración adecuada con estadístico de Hosmer y Lemeshow para APACHE IV (p = 0,39) y para MPM II (p = 0,99). El ∆ BIC es de 2,9 que muestra evidencia positiva en contra de APACHE IV. Se reporta el estadístico AIC siendo menor para APACHE IV lo que indica que es el modelo con mejor ajuste a los datos. Conclusiones APACHE IV tiene un buen desempeño en la predicción de mortalidad de pacientes críticamente enfermos, incluyendo pacientes oncológicos. Por lo tanto se trata de una herramienta útil para el clínico en su labor diaria, al permitirle distinguir los pacientes con alta probabilidad de mortalidad.

Relevância:

40.00% 40.00%

Publicador:

Relevância:

40.00% 40.00%

Publicador:

Resumo:

There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson’s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The paper investigates a Bayesian hierarchical model for the analysis of categorical longitudinal data from a large social survey of immigrants to Australia. Data for each subject are observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and the explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Many variables that are of interest in social science research are nominal variables with two or more categories, such as employment status, occupation, political preference, or self-reported health status. With longitudinal survey data it is possible to analyse the transitions of individuals between different employment states or occupations (for example). In the statistical literature, models for analysing categorical dependent variables with repeated observations belong to the family of models known as generalized linear mixed models (GLMMs). The specific GLMM for a dependent variable with three or more categories is the multinomial logit random effects model. For these models, the marginal distribution of the response does not have a closed form solution and hence numerical integration must be used to obtain maximum likelihood estimates for the model parameters. Techniques for implementing the numerical integration are available but are computationally intensive requiring a large amount of computer processing time that increases with the number of clusters (or individuals) in the data and are not always readily accessible to the practitioner in standard software. For the purposes of analysing categorical response data from a longitudinal social survey, there is clearly a need to evaluate the existing procedures for estimating multinomial logit random effects model in terms of accuracy, efficiency and computing time. The computational time will have significant implications as to the preferred approach by researchers. In this paper we evaluate statistical software procedures that utilise adaptive Gaussian quadrature and MCMC methods, with specific application to modeling employment status of women using a GLMM, over three waves of the HILDA survey.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion. © 2014 Springer-Verlag Berlin Heidelberg.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.