804 resultados para missing values
Resumo:
Attrition in longitudinal studies can lead to biased results. The study is motivated by the unexpected observation that alcohol consumption decreased despite increased availability, which may be due to sample attrition of heavy drinkers. Several imputation methods have been proposed, but rarely compared in longitudinal studies of alcohol consumption. The imputation of consumption level measurements is computationally particularly challenging due to alcohol consumption being a semi-continuous variable (dichotomous drinking status and continuous volume among drinkers), and the non-normality of data in the continuous part. Data come from a longitudinal study in Denmark with four waves (2003-2006) and 1771 individuals at baseline. Five techniques for missing data are compared: Last value carried forward (LVCF) was used as a single, and Hotdeck, Heckman modelling, multivariate imputation by chained equations (MICE), and a Bayesian approach as multiple imputation methods. Predictive mean matching was used to account for non-normality, where instead of imputing regression estimates, "real" observed values from similar cases are imputed. Methods were also compared by means of a simulated dataset. The simulation showed that the Bayesian approach yielded the most unbiased estimates for imputation. The finding of no increase in consumption levels despite a higher availability remained unaltered. Copyright (C) 2011 John Wiley & Sons, Ltd.
Resumo:
Genome-wide association studies (GWAS) are conducted with the promise to discover novel genetic variants associated with diverse traits. For most traits, associated markers individually explain just a modest fraction of the phenotypic variation, but their number can well be in the hundreds. We developed a maximum likelihood method that allows us to infer the distribution of associated variants even when many of them were missed by chance. Compared to previous approaches, the novelty of our method is that it (a) does not require having an independent (unbiased) estimate of the effect sizes; (b) makes use of the complete distribution of P-values while allowing for the false discovery rate; (c) takes into account allelic heterogeneity and the SNP pruning strategy. We applied our method to the latest GWAS meta-analysis results of the GIANT consortium. It revealed that while the explained variance of genome-wide (GW) significant SNPs is around 1% for waist-hip ratio (WHR), the observed P-values provide evidence for the existence of variants explaining 10% (CI=[8.5-11.5%]) of the phenotypic variance in total. Similarly, the total explained variance likely to exist for height is estimated to be 29% (CI=[28-30%]), three times higher than what the observed GW significant SNPs give rise to. This methodology also enables us to predict the benefit of future GWA studies that aim to reveal more associated genetic markers via increased sample size.
Resumo:
Nearly all chemistry–climate models (CCMs) have a systematic bias of a delayed springtime breakdown of the Southern Hemisphere (SH) stratospheric polar vortex, implying insufficient stratospheric wave drag. In this study the Canadian Middle Atmosphere Model (CMAM) and the CMAM Data Assimilation System (CMAM-DAS) are used to investigate the cause of this bias. Zonal wind analysis increments from CMAMDAS reveal systematic negative values in the stratosphere near 608S in winter and early spring. These are interpreted as indicating a bias in the model physics, namely, missing gravity wave drag (GWD). The negative analysis increments remain at a nearly constant height during winter and descend as the vortex weakens, much like orographic GWD. This region is also where current orographic GWD parameterizations have a gap in wave drag, which is suggested to be unrealistic because of missing effects in those parameterizations. These findings motivate a pair of free-runningCMAMsimulations to assess the impact of extra orographicGWDat 608S. The control simulation exhibits the cold-pole bias and delayed vortex breakdown seen in the CCMs. In the simulation with extra GWD, the cold-pole bias is significantly reduced and the vortex breaks down earlier. Changes in resolved wave drag in the stratosphere also occur in response to the extra GWD, which reduce stratospheric SH polar-cap temperature biases in late spring and early summer. Reducing the dynamical biases, however, results in degraded Antarctic column ozone. This suggests that CCMs that obtain realistic column ozone in the presence of an overly strong and persistent vortex may be doing so through compensating errors.
Resumo:
When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
This Letter describes the search for an enhanced production rate of events with a charged lepton and a neutrino in high-energy pp collisions at the LHC. The analysis uses data collected with the CMS detector, with an integrated luminosity of 5.0 fb-1 at √s=7 TeV, and a further 3.7 fb -1 at √s=8 TeV. No evidence is found for an excess. The results are interpreted in terms of limits on a heavy charged gauge boson (W ′) in the sequential standard model, a split universal extra dimension model, and contact interactions in the helicity-nonconserving model. For the last, values of the binding energy below 10.5 (8.8) TeV in the electron (muon) channel are excluded at a 95% confidence level. Interpreting the ℓν final state in terms of a heavy W′ with standard model couplings, masses below 2.90 TeV are excluded. © 2013 CERN.
Resumo:
A search for diphoton events with large missing transverse energy is presented. The data were collected with the ATLAS detector in proton-proton collisions at √s=7 TeV at the CERN Large Hadron Collider and correspond to an integrated luminosity of 3.1 pb⁻¹. No excess of such events is observed above the standard model background prediction. In the context of a specific model with one universal extra dimension with compactification radius R and gravity-induced decays, values of 1/R<729 GeV are excluded at 95% C. L., providing the most sensitive limit on this model to date.
Resumo:
A search for supersymmetry (SUSY) in events with large missing transverse momentum, jets, at least one hadronically decaying tau lepton and zero or one additional light leptons (electron/muon), has been performed using 20.3 fb−1 of proton-proton collision data at √s = 8TeV recorded with the ATLAS detector at the Large Hadron Collider. No excess above the Standard Model background expectation is observed in the various signal regions and 95% confidence level upper limits on the visible cross section for new phenomena are set. The results of the analysis are interpreted in several SUSY scenarios, significantly extending previous limits obtained in the same final states. In the framework of minimal gauge-mediated SUSY breaking models, values of the SUSY breaking scale ʌ below 63TeV are excluded, independently of tan β. Exclusion limits are also derived for an mSUGRA/CMSSM model, in both the R-parity-conserving and R-parity-violating case. A further interpretation is presented in a framework of natural gauge mediation, in which the gluino is assumed to be the only light coloured sparticle and gluino masses below 1090GeV are excluded.
Resumo:
The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^
Resumo:
With most clinical trials, missing data presents a statistical problem in evaluating a treatment's efficacy. There are many methods commonly used to assess missing data; however, these methods leave room for bias to enter the study. This thesis was a secondary analysis on data taken from TIME, a phase 2 randomized clinical trial conducted to evaluate the safety and effect of the administration timing of bone marrow mononuclear cells (BMMNC) for subjects with acute myocardial infarction (AMI).^ We evaluated the effect of missing data by comparing the variance inflation factor (VIF) of the effect of therapy between all subjects and only subjects with complete data. Through the general linear model, an unbiased solution was made for the VIF of the treatment's efficacy using the weighted least squares method to incorporate missing data. Two groups were identified from the TIME data: 1) all subjects and 2) subjects with complete data (baseline and follow-up measurements). After the general solution was found for the VIF, it was migrated Excel 2010 to evaluate data from TIME. The resulting numerical value from the two groups was compared to assess the effect of missing data.^ The VIF values from the TIME study were considerably less in the group with missing data. By design, we varied the correlation factor in order to evaluate the VIFs of both groups. As the correlation factor increased, the VIF values increased at a faster rate in the group with only complete data. Furthermore, while varying the correlation factor, the number of subjects with missing data was also varied to see how missing data affects the VIF. When subjects with only baseline data was increased, we saw a significant rate increase in VIF values in the group with only complete data while the group with missing data saw a steady and consistent increase in the VIF. The same was seen when we varied the group with follow-up only data. This essentially showed that the VIFs steadily increased when missing data is not ignored. When missing data is ignored as with our comparison group, the VIF values sharply increase as correlation increases.^
Resumo:
The fuzzy min–max neural network classifier is a supervised learning method. This classifier takes the hybrid neural networks and fuzzy systems approach. All input variables in the network are required to correspond to continuously valued variables, and this can be a significant constraint in many real-world situations where there are not only quantitative but also categorical data. The usual way of dealing with this type of variables is to replace the categorical by numerical values and treat them as if they were continuously valued. But this method, implicitly defines a possibly unsuitable metric for the categories. A number of different procedures have been proposed to tackle the problem. In this article, we present a new method. The procedure extends the fuzzy min–max neural network input to categorical variables by introducing new fuzzy sets, a new operation, and a new architecture. This provides for greater flexibility and wider application. The proposed method is then applied to missing data imputation in voting intention polls. The micro data—the set of the respondents’ individual answers to the questions—of this type of poll are especially suited for evaluating the method since they include a large number of numerical and categorical attributes.
Resumo:
Objective: An estimation of cut-off points for the diagnosis of diabetes mellitus (DM) based on individual risk factors. Methods: A subset of the 1991 Oman National Diabetes Survey is used, including all patients with a 2h post glucose load >= 200 mg/dl (278 subjects) and a control group of 286 subjects. All subjects previously diagnosed as diabetic and all subjects with missing data values were excluded. The data set was analyzed by use of the SPSS Clementine data mining system. Decision Tree Learners (C5 and CART) and a method for mining association rules (the GRI algorithm) are used. The fasting plasma glucose (FPG), age, sex, family history of diabetes and body mass index (BMI) are input risk factors (independent variables), while diabetes onset (the 2h post glucose load >= 200 mg/dl) is the output (dependent variable). All three techniques used were tested by use of crossvalidation (89.8%). Results: Rules produced for diabetes diagnosis are: A- GRI algorithm (1) FPG>=108.9 mg/dl, (2) FPG>=107.1 and age>39.5 years. B- CART decision trees: FPG >=110.7 mg/dl. C- The C5 decision tree learner: (1) FPG>=95.5 and 54, (2) FPG>=106 and 25.2 kg/m2. (3) FPG>=106 and =133 mg/dl. The three techniques produced rules which cover a significant number of cases (82%), with confidence between 74 and 100%. Conclusion: Our approach supports the suggestion that the present cut-off value of fasting plasma glucose (126 mg/dl) for the diagnosis of diabetes mellitus needs revision, and the individual risk factors such as age and BMI should be considered in defining the new cut-off value.
Resumo:
We present a new method for ecologically sustainable land use planning within multiple land use schemes. Our aims were (1) to develop a method that can be used to locate important areas based on their ecological values; (2) to evaluate the quality, quantity, availability, and usability of existing ecological data sets; and (3) to demonstrate the use of the method in Eastern Finland, where there are requirements for the simultaneous development of nature conservation, tourism, and recreation. We compiled all available ecological data sets from the study area, complemented the missing data using habitat suitability modeling, calculated the total ecological score (TES) for each 1 ha grid cell in the study area, and finally, demonstrated the use of TES in assessing the success of nature conservation in covering ecologically valuable areas and locating ecologically sustainable areas for tourism and recreational infrastructure. The method operated quite well at the level required for regional and local scale planning. The quality, quantity, availability, and usability of existing data sets were generally high, and they could be further complemented by modeling. There are still constraints that limit the use of the method in practical land use planning. However, as increasing data become available and open access, and modeling tools improve, the usability and applicability of the method will increase.
Resumo:
Current data indicate that the size of high-density lipoprotein (HDL) may be considered an important marker for cardiovascular disease risk. We established reference values of mean HDL size and volume in an asymptomatic representative Brazilian population sample (n=590) and their associations with metabolic parameters by gender. Size and volume were determined in HDL isolated from plasma by polyethyleneglycol precipitation of apoB-containing lipoproteins and measured using the dynamic light scattering (DLS) technique. Although the gender and age distributions agreed with other studies, the mean HDL size reference value was slightly lower than in some other populations. Both HDL size and volume were influenced by gender and varied according to age. HDL size was associated with age and HDL-C (total population); non- white ethnicity and CETP inversely (females); HDL-C and PLTP mass (males). On the other hand, HDL volume was determined only by HDL-C (total population and in both genders) and by PLTP mass (males). The reference values for mean HDL size and volume using the DLS technique were established in an asymptomatic and representative Brazilian population sample, as well as their related metabolic factors. HDL-C was a major determinant of HDL size and volume, which were differently modulated in females and in males.