3 resultados para categorical and mix datasets
em Duke University
Resumo:
Purpose: To build a model that will predict the survival time for patients that were treated with stereotactic radiosurgery for brain metastases using support vector machine (SVM) regression.
Methods and Materials: This study utilized data from 481 patients, which were equally divided into training and validation datasets randomly. The SVM model used a Gaussian RBF function, along with various parameters, such as the size of the epsilon insensitive region and the cost parameter (C) that are used to control the amount of error tolerated by the model. The predictor variables for the SVM model consisted of the actual survival time of the patient, the number of brain metastases, the graded prognostic assessment (GPA) and Karnofsky Performance Scale (KPS) scores, prescription dose, and the largest planning target volume (PTV). The response of the model is the survival time of the patient. The resulting survival time predictions were analyzed against the actual survival times by single parameter classification and two-parameter classification. The predicted mean survival times within each classification were compared with the actual values to obtain the confidence interval associated with the model’s predictions. In addition to visualizing the data on plots using the means and error bars, the correlation coefficients between the actual and predicted means of the survival times were calculated during each step of the classification.
Results: The number of metastases and KPS scores, were consistently shown to be the strongest predictors in the single parameter classification, and were subsequently used as first classifiers in the two-parameter classification. When the survival times were analyzed with the number of metastases as the first classifier, the best correlation was obtained for patients with 3 metastases, while patients with 4 or 5 metastases had significantly worse results. When the KPS score was used as the first classifier, patients with a KPS score of 60 and 90/100 had similar strong correlation results. These mixed results are likely due to the limited data available for patients with more than 3 metastases or KPS scores of 60 or less.
Conclusions: The number of metastases and the KPS score both showed to be strong predictors of patient survival time. The model was less accurate for patients with more metastases and certain KPS scores due to the lack of training data.
Resumo:
This paper introduces two new datasets on national level elections from 1975 to 2004. The data are grouped into two separate datasets, the Quality of Elections Data and the Data on International Election Monitoring. Together these data sets provide original information on elections, election observation and election quality, and will enable researchers to study a variety of research questions. The datasets will be publicly available and are maintained at a project website.
Resumo:
Abstract
Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.
The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.
The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.
The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.