910 resultados para Missing Data
Resumo:
The paper investigates a Bayesian hierarchical model for the analysis of categorical longitudinal data from a large social survey of immigrants to Australia. Data for each subject are observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and the explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia.
Resumo:
Objective: An estimation of cut-off points for the diagnosis of diabetes mellitus (DM) based on individual risk factors. Methods: A subset of the 1991 Oman National Diabetes Survey is used, including all patients with a 2h post glucose load >= 200 mg/dl (278 subjects) and a control group of 286 subjects. All subjects previously diagnosed as diabetic and all subjects with missing data values were excluded. The data set was analyzed by use of the SPSS Clementine data mining system. Decision Tree Learners (C5 and CART) and a method for mining association rules (the GRI algorithm) are used. The fasting plasma glucose (FPG), age, sex, family history of diabetes and body mass index (BMI) are input risk factors (independent variables), while diabetes onset (the 2h post glucose load >= 200 mg/dl) is the output (dependent variable). All three techniques used were tested by use of crossvalidation (89.8%). Results: Rules produced for diabetes diagnosis are: A- GRI algorithm (1) FPG>=108.9 mg/dl, (2) FPG>=107.1 and age>39.5 years. B- CART decision trees: FPG >=110.7 mg/dl. C- The C5 decision tree learner: (1) FPG>=95.5 and 54, (2) FPG>=106 and 25.2 kg/m2. (3) FPG>=106 and =133 mg/dl. The three techniques produced rules which cover a significant number of cases (82%), with confidence between 74 and 100%. Conclusion: Our approach supports the suggestion that the present cut-off value of fasting plasma glucose (126 mg/dl) for the diagnosis of diabetes mellitus needs revision, and the individual risk factors such as age and BMI should be considered in defining the new cut-off value.
Resumo:
Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.
Resumo:
Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The probabilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete values) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world heterogeneous data with missing values.
Resumo:
A certain type of bacterial inclusion, known as a bacterial microcompartment, was recently identified and imaged through cryo-electron tomography. A reconstructed 3D object from single-axis limited angle tilt-series cryo-electron tomography contains missing regions and this problem is known as the missing wedge problem. Due to missing regions on the reconstructed images, analyzing their 3D structures is a challenging problem. The existing methods overcome this problem by aligning and averaging several similar shaped objects. These schemes work well if the objects are symmetric and several objects with almost similar shapes and sizes are available. Since the bacterial inclusions studied here are not symmetric, are deformed, and show a wide range of shapes and sizes, the existing approaches are not appropriate. This research develops new statistical methods for analyzing geometric properties, such as volume, symmetry, aspect ratio, polyhedral structures etc., of these bacterial inclusions in presence of missing data. These methods work with deformed and non-symmetric varied shaped objects and do not necessitate multiple objects for handling the missing wedge problem. The developed methods and contributions include: (a) an improved method for manual image segmentation, (b) a new approach to 'complete' the segmented and reconstructed incomplete 3D images, (c) a polyhedral structural distance model to predict the polyhedral shapes of these microstructures, (d) a new shape descriptor for polyhedral shapes, named as polyhedron profile statistic, and (e) the Bayes classifier, linear discriminant analysis and support vector machine based classifiers for supervised incomplete polyhedral shape classification. Finally, the predicted 3D shapes for these bacterial microstructures belong to the Johnson solids family, and these shapes along with their other geometric properties are important for better understanding of their chemical and biological characteristics.
Resumo:
Estimates of HIV prevalence are important for policy in order to establish the health status of a country's population and to evaluate the effectiveness of population-based interventions and campaigns. However, participation rates in testing for surveillance conducted as part of household surveys, on which many of these estimates are based, can be low. HIV positive individuals may be less likely to participate because they fear disclosure, in which case estimates obtained using conventional approaches to deal with missing data, such as imputation-based methods, will be biased. We develop a Heckman-type simultaneous equation approach which accounts for non-ignorable selection, but unlike previous implementations, allows for spatial dependence and does not impose a homogeneous selection process on all respondents. In addition, our framework addresses the issue of separation, where for instance some factors are severely unbalanced and highly predictive of the response, which would ordinarily prevent model convergence. Estimation is carried out within a penalized likelihood framework where smoothing is achieved using a parametrization of the smoothing criterion which makes estimation more stable and efficient. We provide the software for straightforward implementation of the proposed approach, and apply our methodology to estimating national and sub-national HIV prevalence in Swaziland, Zimbabwe and Zambia.
Resumo:
ABSTRACT Researchers frequently have to analyze scales in which some participants have failed to respond to some items. In this paper we focus on the exploratory factor analysis of multidimensional scales (i.e., scales that consist of a number of subscales) where each subscale is made up of a number of Likert-type items, and the aim of the analysis is to estimate participants' scores on the corresponding latent traits. We propose a new approach to deal with missing responses in such a situation that is based on (1) multiple imputation of non-responses and (2) simultaneous rotation of the imputed datasets. We applied the approach in a real dataset where missing responses were artificially introduced following a real pattern of non-responses, and a simulation study based on artificial datasets. The results show that our approach (specifically, Hot-Deck multiple imputation followed of Consensus Promin rotation) was able to successfully compute factor score estimates even for participants that have missing data.
Resumo:
Hot and cold temperatures significantly increase mortality rates around the world, but which measure of temperature is the best predictor of mortality is not known. We used mortality data from 107 US cities for the years 1987–2000 and examined the association between temperature and mortality using Poisson regression and modelled a non-linear temperature effect and a non-linear lag structure. We examined mean, minimum and maximum temperature with and without humidity, and apparent temperature and the Humidex. The best measure was defined as that with the minimum cross-validated residual. We found large differences in the best temperature measure between age groups, seasons and cities, and there was no one temperature measure that was superior to the others. The strong correlation between different measures of temperature means that, on average, they have the same predictive ability. The best temperature measure for new studies can be chosen based on practical concerns, such as choosing the measure with the least amount of missing data.
Resumo:
The aim of this study was to design and validate an interviewer-administered pelvic floor questionnaire that integrates bladder, bowel and sexual function, pelvic organ prolapse, severity, bothersomeness and condition-specific quality of life. Validation testing of the questionnaire was performed using data from 106 urogynaecological patients and a separately sampled community cohort of 49 women. Missing data did not exceed 2% for any question. It distinguished community and urogynaecological populations regarding pelvic floor dysfunction. The bladder domain correlated with the short version of the Urogenital Distress Inventory, bowel function with an established bowel questionnaire and prolapse symptoms with the International Continence Society prolapse quantification. Sexual function assessment reflected scores on the McCoy Female Sexuality Questionnaire. Cronbach’s α coefficients were acceptable in all domains. Kappa coefficients of agreement for the test–retest analyses varied from 0.5 to 1.0. The interviewer-administered pelvic floor questionnaire assessed pelvic floor function in a reproducible and valid fashion in a typical urogynaecological clinic.
Resumo:
Introduction and hypothesis: The aim of this study was to validate a self-administered version of the already validated interviewer-administered Australian pelvic floor questionnaire. Methods: The questionnaire was completed by 163 women attending an urogynecological clinic. Face and convergent validity was assessed. Reliability testing and comparison with the interviewer-administered version was performed in a subset of 105 patients. Responsiveness was evaluated in a subset of 73 women. Results: Missing data did not exceed 4% for any question. Cronbach’s alpha coefficients were acceptable in all domains. Kappa coefficients for the test–retest analyses varied from 0.64–1.0. Prolapse symptoms correlated significantly with the pelvic organ prolapse quantification. Urodynamics confirmed the reported symptom stress incontinence in 70%. The self and interviewer administered questionnaires demonstrated equivalence. Effect sizes ranged from 0.6 to 1.4. Conclusions: This self-administered pelvic floor questionnaire assessed pelvic floor function in a reproducible and valid fashion and due to its responsiveness, can be used for routine clinical assessment and outcome research.
Resumo:
This paper sets out to examine from published literature and crash data analyses whether alcohol in bicycle crashes is an issue about which we should be concerned. It discusses factors that have the potential to increase the number of bicycle crashes in which alcohol is involved (such growth in the size and diversity of the cyclist population, and balance and coordination demands) and factors which may reduce the importance of alcohol in bicycle crashes (such as time of data factors and child riders). It also examines data availability issues that contribute to difficulties in determining the true magnitude of the issue. Methods: This paper reviews previous research and reports analyses of data from Queensland, Australia, that examine the role of alcohol in Police-reported road crashes. In Queensland it is an offence to ride a bicycle or drive a motor vehicle with a BAC exceeding 0.05% (or lower for novice and professional drivers). Results: In the five years 2003-2007, alcohol was reported as involved in 165 bicycle crashes (4%). The bicycle rider was coded as “under the influence” or “over the prescribed BAC limit” in 15 were single unit crashes (12%). In multi-vehicle bicycle crashes, alcohol involvement was reported for 16 cyclists (0.4%) and 110 operators of other vehicles (3%). Additional analyses including characteristics of the cyclist crashes involving alcohol and the importance of missing data will be discussed in the paper. Conclusion: The increase in participation in cycling and the vulnerability of cyclists to injuries support the need to examine the role of alcohol in bicycle crashes. Current data suggest that alcohol on the part of the vehicle driver is a larger concern than alcohol on the part of the cyclist, but improvements in data collection are needed before more precise conclusions can be drawn.
Resumo:
Objectives: The purpose of this study was to describe the use, as well as perceived effectiveness, of mainstream and complementary and alternative medicine (CAM) therapies in the treatment of lymphedema following breast or gynecological cancer. Further, the study assessed the relationship between the characteristics of lymphedema (including type, severity, stability, and duration), and the use of CAM and/or mainstream treatment. Methods: This was a cross-sectional study using a convenience sample of women with lymphedema following breast and gynecological cancers. A self-administered questionnaire was sent to 247 potentially eligible women. Of those returned (50%), 23 were ineligible and 6 were excluded due to level of missing data. Results: In the previous 12 months, the majority of women (90%) had used mainstream treatments to treat their lymphedema, with massage being the most commonly used (86%). One (1) in 2 women had used CAM to treat their lymphedema, and 98% of those using CAM were also using mainstream treatments. Over 27 types of CAM were reported, with use of a chi machine, vitamin E supplements, yoga, and meditation being the most commonly reported forms. The perceived effectiveness ratings (1–7 with 7 = completely effective) of mainstream(mean – standard deviation (SD): 5.3 – 1.5) and CAM therapies (mean – SD: 5.2 + 1.6) were considered high. Conclusions: These results demonstrate that mainstream and CAM treatment use is common, varied, and considered to be effective among women with lymphedema following breast or gynecological cancer. Furthermore, it highlights the immediate need for larger prospective studies assessing the inter-relationship between the use of mainstream and CAM therapies for treatment success.
Resumo:
This article reviews the literature on the outcome of flapless surgery for dental implants in the posterior maxilla. The literature search was carried out in using the keywords: flapless, dental implants and maxilla. A hand search and Medline search were carried out on studies published between 1971 and 2011. The authors included research involving a minimum of 15 dental implants with a followup period of 1 year, an outcome measurement of implant survival, but excluded studies involving multiple simultaneous interventions, and studies with missing data. The Cochrane approach for cohort studies and Oxford Centre for Evidence- Based Medicine were applied. Of the 56 published papers selected, 14 papers on the flapless technique showed high overall implant survival rates. The prospective studies yielded 97.01% (95% CI: 90.72–99.0) while retrospective studies or case series illustrated 95.08% (95% CI: 91.0–97.93) survival. The average of intraoperative complications was 6.55% using the flapless procedure. The limited data obtained showed that flapless surgery in posterior maxilla areas could be a viable and predictable treatment method for implant placement. Flapless surgery tends to be more applicable in this area of the mouth. Further long-term clinical controlled studies are needed.
Resumo:
The ability to estimate the asset reliability and the probability of failure is critical to reducing maintenance costs, operation downtime, and safety hazards. Predicting the survival time and the probability of failure in future time is an indispensable requirement in prognostics and asset health management. In traditional reliability models, the lifetime of an asset is estimated using failure event data, alone; however, statistically sufficient failure event data are often difficult to attain in real-life situations due to poor data management, effective preventive maintenance, and the small population of identical assets in use. Condition indicators and operating environment indicators are two types of covariate data that are normally obtained in addition to failure event and suspended data. These data contain significant information about the state and health of an asset. Condition indicators reflect the level of degradation of assets while operating environment indicators accelerate or decelerate the lifetime of assets. When these data are available, an alternative approach to the traditional reliability analysis is the modelling of condition indicators and operating environment indicators and their failure-generating mechanisms using a covariate-based hazard model. The literature review indicates that a number of covariate-based hazard models have been developed. All of these existing covariate-based hazard models were developed based on the principle theory of the Proportional Hazard Model (PHM). However, most of these models have not attracted much attention in the field of machinery prognostics. Moreover, due to the prominence of PHM, attempts at developing alternative models, to some extent, have been stifled, although a number of alternative models to PHM have been suggested. The existing covariate-based hazard models neglect to fully utilise three types of asset health information (including failure event data (i.e. observed and/or suspended), condition data, and operating environment data) into a model to have more effective hazard and reliability predictions. In addition, current research shows that condition indicators and operating environment indicators have different characteristics and they are non-homogeneous covariate data. Condition indicators act as response variables (or dependent variables) whereas operating environment indicators act as explanatory variables (or independent variables). However, these non-homogenous covariate data were modelled in the same way for hazard prediction in the existing covariate-based hazard models. The related and yet more imperative question is how both of these indicators should be effectively modelled and integrated into the covariate-based hazard model. This work presents a new approach for addressing the aforementioned challenges. The new covariate-based hazard model, which termed as Explicit Hazard Model (EHM), explicitly and effectively incorporates all three available asset health information into the modelling of hazard and reliability predictions and also drives the relationship between actual asset health and condition measurements as well as operating environment measurements. The theoretical development of the model and its parameter estimation method are demonstrated in this work. EHM assumes that the baseline hazard is a function of the both time and condition indicators. Condition indicators provide information about the health condition of an asset; therefore they update and reform the baseline hazard of EHM according to the health state of asset at given time t. Some examples of condition indicators are the vibration of rotating machinery, the level of metal particles in engine oil analysis, and wear in a component, to name but a few. Operating environment indicators in this model are failure accelerators and/or decelerators that are included in the covariate function of EHM and may increase or decrease the value of the hazard from the baseline hazard. These indicators caused by the environment in which an asset operates, and that have not been explicitly identified by the condition indicators (e.g. Loads, environmental stresses, and other dynamically changing environment factors). While the effects of operating environment indicators could be nought in EHM; condition indicators could emerge because these indicators are observed and measured as long as an asset is operational and survived. EHM has several advantages over the existing covariate-based hazard models. One is this model utilises three different sources of asset health data (i.e. population characteristics, condition indicators, and operating environment indicators) to effectively predict hazard and reliability. Another is that EHM explicitly investigates the relationship between condition and operating environment indicators associated with the hazard of an asset. Furthermore, the proportionality assumption, which most of the covariate-based hazard models suffer from it, does not exist in EHM. According to the sample size of failure/suspension times, EHM is extended into two forms: semi-parametric and non-parametric. The semi-parametric EHM assumes a specified lifetime distribution (i.e. Weibull distribution) in the form of the baseline hazard. However, for more industry applications, due to sparse failure event data of assets, the analysis of such data often involves complex distributional shapes about which little is known. Therefore, to avoid the restrictive assumption of the semi-parametric EHM about assuming a specified lifetime distribution for failure event histories, the non-parametric EHM, which is a distribution free model, has been developed. The development of EHM into two forms is another merit of the model. A case study was conducted using laboratory experiment data to validate the practicality of the both semi-parametric and non-parametric EHMs. The performance of the newly-developed models is appraised using the comparison amongst the estimated results of these models and the other existing covariate-based hazard models. The comparison results demonstrated that both the semi-parametric and non-parametric EHMs outperform the existing covariate-based hazard models. Future research directions regarding to the new parameter estimation method in the case of time-dependent effects of covariates and missing data, application of EHM in both repairable and non-repairable systems using field data, and a decision support model in which linked to the estimated reliability results, are also identified.
Resumo:
Background Non-fatal health outcomes from diseases and injuries are a crucial consideration in the promotion and monitoring of individual and population health. The Global Burden of Disease (GBD) studies done in 1990 and 2000 have been the only studies to quantify non-fatal health outcomes across an exhaustive set of disorders at the global and regional level. Neither effort quantified uncertainty in prevalence or years lived with disability (YLDs). Methods Of the 291 diseases and injuries in the GBD cause list, 289 cause disability. For 1160 sequelae of the 289 diseases and injuries, we undertook a systematic analysis of prevalence, incidence, remission, duration, and excess mortality. Sources included published studies, case notification, population-based cancer registries, other disease registries, antenatal clinic serosurveillance, hospital discharge data, ambulatory care data, household surveys, other surveys, and cohort studies. For most sequelae, we used a Bayesian meta-regression method, DisMod-MR, designed to address key limitations in descriptive epidemiological data, including missing data, inconsistency, and large methodological variation between data sources. For some disorders, we used natural history models, geospatial models, back-calculation models (models calculating incidence from population mortality rates and case fatality), or registration completeness models (models adjusting for incomplete registration with health-system access and other covariates). Disability weights for 220 unique health states were used to capture the severity of health loss. YLDs by cause at age, sex, country, and year levels were adjusted for comorbidity with simulation methods. We included uncertainty estimates at all stages of the analysis. Findings Global prevalence for all ages combined in 2010 across the 1160 sequelae ranged from fewer than one case per 1 million people to 350 000 cases per 1 million people. Prevalence and severity of health loss were weakly correlated (correlation coefficient −0·37). In 2010, there were 777 million YLDs from all causes, up from 583 million in 1990. The main contributors to global YLDs were mental and behavioural disorders, musculoskeletal disorders, and diabetes or endocrine diseases. The leading specific causes of YLDs were much the same in 2010 as they were in 1990: low back pain, major depressive disorder, iron-deficiency anaemia, neck pain, chronic obstructive pulmonary disease, anxiety disorders, migraine, diabetes, and falls. Age-specific prevalence of YLDs increased with age in all regions and has decreased slightly from 1990 to 2010. Regional patterns of the leading causes of YLDs were more similar compared with years of life lost due to premature mortality. Neglected tropical diseases, HIV/AIDS, tuberculosis, malaria, and anaemia were important causes of YLDs in sub-Saharan Africa. Interpretation Rates of YLDs per 100 000 people have remained largely constant over time but rise steadily with age. Population growth and ageing have increased YLD numbers and crude rates over the past two decades. Prevalences of the most common causes of YLDs, such as mental and behavioural disorders and musculoskeletal disorders, have not decreased. Health systems will need to address the needs of the rising numbers of individuals with a range of disorders that largely cause disability but not mortality. Quantification of the burden of non-fatal health outcomes will be crucial to understand how well health systems are responding to these challenges. Effective and affordable strategies to deal with this rising burden are an urgent priority for health systems in most parts of the world. Funding Bill & Melinda Gates Foundation.