991 resultados para CATEGORICAL-DATA


Relevância:

30.00% 30.00%

Publicador:

Resumo:

When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Introduction: The treatment offered to chronic kidney disease (CKD) patients before starting hemodialysis (HD) impacts prognosis. Objective: We seek differences among incident HD patients according to the distance between home and the dialysis center. Methods: We included 179 CKD patients undergoing HD. Patients were stratified in two groups: "living near the dialysis center" (patients whose hometown was in cities up to 100 km from the dialysis center) or as "living far from the dialysis center" (patients whose hometown was more than 100 km from the dialysis center). Socioeconomic status, laboratory results, awareness of CKD before starting HD, consultation with nephrologist before the first HD session, and type of vascular access when starting HD were compared between the two groups. Comparisons of continuous and categorical variables were performed using Student's t-test and the Chi-square test, respectively. Results: Ninety (50.3%) patients were classified as "living near the dialysis center" and 89 (49.7%) as "living far from the dialysis center". Patients living near the dialysis center were more likely to know about their condition of CKD than those living far from the dialysis center, respectively 46.6% versus 28.0% (p = 0.015). Although without statistical significance, patients living near the dialysis center had more frequent previous consultation with nephrologists (55.5% versus 42.6%; p = 0.116) and first HD by fistula (30.0% versus 19.1%; p = 0.128) than those living far from the dialysis center. Conclusion: There are potential advantages of CKD awareness, referral to nephrologists and starting HD through fistula among patients living near the dialysis center.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Video on how to enter data into Excel. This includes setting up column headings, changing sheet names, using lists for categorical items, using validation on columns of data, plus how to check validation is operating correctly.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Genetic association analyses of family-based studies with ordered categorical phenotypes are often conducted using methods either for quantitative or for binary traits, which can lead to suboptimal analyses. Here we present an alternative likelihood-based method of analysis for single nucleotide polymorphism (SNP) genotypes and ordered categorical phenotypes in nuclear families of any size. Our approach, which extends our previous work for binary phenotypes, permits straightforward inclusion of covariate, gene-gene and gene-covariate interaction terms in the likelihood, incorporates a simple model for ascertainment and allows for family-specific effects in the hypothesis test. Additionally, our method produces interpretable parameter estimates and valid confidence intervals. We assess the proposed method using simulated data, and apply it to a polymorphism in the c-reactive protein (CRP) gene typed in families collected to investigate human systemic lupus erythematosus. By including sex interactions in the analysis, we show that the polymorphism is associated with anti-nuclear autoantibody (ANA) production in females, while there appears to be no effect in males.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cross-bred cow adoption is an important and potent policy variable precipitating subsistence household entry into emerging milk markets. This paper focuses on the problem of designing policies that encourage and sustain milkmarket expansion among a sample of subsistence households in the Ethiopian highlands. In this context it is desirable to measure households’ ‘proximity’ to market in terms of the level of deficiency of essential inputs. This problem is compounded by four factors. One is the existence of cross-bred cow numbers (count data) as an important, endogenous decision by the household; second is the lack of a multivariate generalization of the Poisson regression model; third is the censored nature of the milk sales data (sales from non-participating households are, essentially, censored at zero); and fourth is an important simultaneity that exists between the decision to adopt a cross-bred cow, the decision about how much milk to produce, the decision about how much milk to consume and the decision to market that milk which is produced but not consumed internally by the household. Routine application of Gibbs sampling and data augmentation overcome these problems in a relatively straightforward manner. We model the count data from two sites close to Addis Ababa in a latent, categorical-variable setting with known bin boundaries. The single-equation model is then extended to a multivariate system that accommodates the covariance between crossbred-cow adoption, milk-output, and milk-sales equations. The latent-variable procedure proves tractable in extension to the multivariate setting and provides important information for policy formation in emerging-market settings

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Evaluations of measurement invariance provide essential construct validity evidence. However, the quality of such evidence is partly dependent upon the validity of the resulting statistical conclusions. The presence of Type I or Type II errors can render measurement invariance conclusions meaningless. The purpose of this study was to determine the effects of categorization and censoring on the behavior of the chi-square/likelihood ratio test statistic and two alternative fit indices (CFI and RMSEA) under the context of evaluating measurement invariance. Monte Carlo simulation was used to examine Type I error and power rates for the (a) overall test statistic/fit indices, and (b) change in test statistic/fit indices. Data were generated according to a multiple-group single-factor CFA model across 40 conditions that varied by sample size, strength of item factor loadings, and categorization thresholds. Seven different combinations of model estimators (ML, Yuan-Bentler scaled ML, and WLSMV) and specified measurement scales (continuous, censored, and categorical) were used to analyze each of the simulation conditions. As hypothesized, non-normality increased Type I error rates for the continuous scale of measurement and did not affect error rates for the categorical scale of measurement. Maximum likelihood estimation combined with a categorical scale of measurement resulted in more correct statistical conclusions than the other analysis combinations. For the continuous and censored scales of measurement, the Yuan-Bentler scaled ML resulted in more correct conclusions than normal-theory ML. The censored measurement scale did not offer any advantages over the continuous measurement scale. Comparing across fit statistics and indices, the chi-square-based test statistics were preferred over the alternative fit indices, and ΔRMSEA was preferred over ΔCFI. Results from this study should be used to inform the modeling decisions of applied researchers. However, no single analysis combination can be recommended for all situations. Therefore, it is essential that researchers consider the context and purpose of their analyses.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The study describes brain areas involved in medial temporal lobe (mTL) seizures of 12 patients. All patients showed so-called oro-alimentary behavior within the first 20 s of clinical seizure manifestation characteristic of mTL seizures. Single photon emission computed tomography (SPECT) images of regional cerebral blood flow (rCBF) were acquired from the patients in ictal and interictal phases and from normal volunteers. Image analysis employed categorical comparisons with statistical parametric mapping and principal component analysis (PCA) to assess functional connectivity. PCA supplemented the findings of the categorical analysis by decomposing the covariance matrix containing images of patients and healthy subjects into distinct component images of independent variance, including areas not identified by the categorical analysis. Two principal components (PCs) discriminated the subject groups: patients with right or left mTL seizures and normal volunteers, indicating distinct neuronal networks implicated by the seizure. Both PCs were correlated with seizure duration, one positively and the other negatively, confirming their physiological significance. The independence of the two PCs yielded a clear clustering of subject groups. The local pattern within the temporal lobe describes critical relay nodes which are the counterpart of oro-alimentary behavior: (1) right mesial temporal zone and ipsilateral anterior insula in right mTL seizures, and (2) temporal poles on both sides that are densely interconnected by the anterior commissure. Regions remote from the temporal lobe may be related to seizure propagation and include positively and negatively loaded areas. These patterns, the covarying areas of the temporal pole and occipito-basal visual association cortices, for example, are related to known anatomic paths.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The discrete-time Markov chain is commonly used in describing changes of health states for chronic diseases in a longitudinal study. Statistical inferences on comparing treatment effects or on finding determinants of disease progression usually require estimation of transition probabilities. In many situations when the outcome data have some missing observations or the variable of interest (called a latent variable) can not be measured directly, the estimation of transition probabilities becomes more complicated. In the latter case, a surrogate variable that is easier to access and can gauge the characteristics of the latent one is usually used for data analysis. ^ This dissertation research proposes methods to analyze longitudinal data (1) that have categorical outcome with missing observations or (2) that use complete or incomplete surrogate observations to analyze the categorical latent outcome. For (1), different missing mechanisms were considered for empirical studies using methods that include EM algorithm, Monte Carlo EM and a procedure that is not a data augmentation method. For (2), the hidden Markov model with the forward-backward procedure was applied for parameter estimation. This method was also extended to cover the computation of standard errors. The proposed methods were demonstrated by the Schizophrenia example. The relevance of public health, the strength and limitations, and possible future research were also discussed. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ordinal outcomes are frequently employed in diagnosis and clinical trials. Clinical trials of Alzheimer's disease (AD) treatments are a case in point using the status of mild, moderate or severe disease as outcome measures. As in many other outcome oriented studies, the disease status may be misclassified. This study estimates the extent of misclassification in an ordinal outcome such as disease status. Also, this study estimates the extent of misclassification of a predictor variable such as genotype status. An ordinal logistic regression model is commonly used to model the relationship between disease status, the effect of treatment, and other predictive factors. A simulation study was done. First, data based on a set of hypothetical parameters and hypothetical rates of misclassification was created. Next, the maximum likelihood method was employed to generate likelihood equations accounting for misclassification. The Nelder-Mead Simplex method was used to solve for the misclassification and model parameters. Finally, this method was applied to an AD dataset to detect the amount of misclassification present. The estimates of the ordinal regression model parameters were close to the hypothetical parameters. β1 was hypothesized at 0.50 and the mean estimate was 0.488, β2 was hypothesized at 0.04 and the mean of the estimates was 0.04. Although the estimates for the rates of misclassification of X1 were not as close as β1 and β2, they validate this method. X 1 0-1 misclassification was hypothesized as 2.98% and the mean of the simulated estimates was 1.54% and, in the best case, the misclassification of k from high to medium was hypothesized at 4.87% and had a sample mean of 3.62%. In the AD dataset, the estimate for the odds ratio of X 1 of having both copies of the APOE 4 allele changed from an estimate of 1.377 to an estimate 1.418, demonstrating that the estimates of the odds ratio changed when the analysis includes adjustment for misclassification. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Mixture modeling is commonly used to model categorical latent variables that represent subpopulations in which population membership is unknown but can be inferred from the data. In relatively recent years, the potential of finite mixture models has been applied in time-to-event data. However, the commonly used survival mixture model assumes that the effects of the covariates involved in failure times differ across latent classes, but the covariate distribution is homogeneous. The aim of this dissertation is to develop a method to examine time-to-event data in the presence of unobserved heterogeneity under a framework of mixture modeling. A joint model is developed to incorporate the latent survival trajectory along with the observed information for the joint analysis of a time-to-event variable, its discrete and continuous covariates, and a latent class variable. It is assumed that the effects of covariates on survival times and the distribution of covariates vary across different latent classes. The unobservable survival trajectories are identified through estimating the probability that a subject belongs to a particular class based on observed information. We applied this method to a Hodgkin lymphoma study with long-term follow-up and observed four distinct latent classes in terms of long-term survival and distributions of prognostic factors. Our results from simulation studies and from the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. This flexible inference method provides more accurate estimation and accommodates unobservable heterogeneity among individuals while taking involved interactions between covariates into consideration.^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

mgof computes goodness-of-fit tests for the distribution of a discrete (categorical, multinomial) variable. The default is to perform classical large sample chi-squared approximation tests based on Pearson's X2 statistic and the log likelihood ratio (G2) statistic or a statistic from the Cressie-Read family. Alternatively, mgof computes exact tests using Monte Carlo methods or exhaustive enumeration. A Kolmogorov-Smirnov test for discrete data is also provided. The moremata package, also available from SSC, is required.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The article presents abstracts of papers for a conference on research methods including "On the Folly of Rewarding A While Hoping for B: A Critical Assessment of Theory Development," "All That Jazz: A Methodological Story of Stories," and "An Accounting of Counting: Universalism, Particularism, and the Counting of Qualitative Data."

Relevância:

30.00% 30.00%

Publicador:

Resumo:

2000 Mathematics Subject Classification: 62P10, 62H30

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract

Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.

The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.

The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.

The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.