987 resultados para missing value imputation
Resumo:
Obtaining attribute values of non-chosen alternatives in a revealed preference context is challenging because non-chosen alternative attributes are unobserved by choosers, chooser perceptions of attribute values may not reflect reality, existing methods for imputing these values suffer from shortcomings, and obtaining non-chosen attribute values is resource intensive. This paper presents a unique Bayesian (multiple) Imputation Multinomial Logit model that imputes unobserved travel times and distances of non-chosen travel modes based on random draws from the conditional posterior distribution of missing values. The calibrated Bayesian (multiple) Imputation Multinomial Logit model imputes non-chosen time and distance values that convincingly replicate observed choice behavior. Although network skims were used for calibration, more realistic data such as supplemental geographically referenced surveys or stated preference data may be preferred. The model is ideally suited for imputing variation in intrazonal non-chosen mode attributes and for assessing the marginal impacts of travel policies, programs, or prices within traffic analysis zones.
Resumo:
Purpose Potential positive associations between youth physical activity and wellness scores could emphasize the value of youth physical activity engagement and promotion interventions, beyond the many established physiological and psychological benefits of increased physical activity. The purpose of this study was to explore the associations between adolescents' self-reported physical activity and wellness. Methods This investigation included 493 adolescents (165 males and 328 females) aged between 12 and 15 years. The participants were recruited from six secondary schools of varying socioeconomic status within a metropolitan area. Students were administered the Five-Factor Wellness Inventory and the International Physical Activity Questionnaire for Adolescents to assess both wellness and physical activity, respectively. Results Data indicated that significant associations between physical activity and wellness existed. Self-reported physical activity was shown to be positively associated with four dimensions including friendship, gender identity, spirituality, and exercise—the higher order factor physical self and total wellness, and negatively associated with self-care, self-worth, love, and cultural identity. Conclusion This study suggests that relationships exist between self-reported physical activity and various elements of wellness. Future research should use controlled trials of physical activity and wellness to establish causal links among youth populations. Understanding the nature of these relationships, including causality, has implications for the justification of youth physical activity promotion interventions and the development of youth physical activity engagement programs.
Resumo:
One cannot help but be impressed by the inroads that digital oilfield technologies have made into the exploration and production (E&P) industry in the past decade. Today’s production systems can be monitored by “smart” sensors that allow engineers to observe almost any aspect of performance in real time. Our understanding of how reservoirs are behaving has improved considerably since the dawn of this revolution, and the industry has been able to move away from point answers to more holistic “big picture” integrated solutions. Indeed, the industry has already reaped the rewards of many of these kinds of investments. Many billions of dollars of value have been delivered by this heightened awareness of what is going on within our assets and the world around them (Van Den Berg et al. 2010).
Resumo:
A new database called the World Resource Table is constructed in this study. Missing values are known to produce complications when constructing global databases. This study provides a solution for applying multiple imputation techniques and estimates the global environmental Kuznets curve (EKC) for CO2, SO2, PM10, and BOD. Policy implications for each type of emission are derived based on the results of the EKC using WRI. Finally, we predicted the future emissions trend and regional share of CO2 emissions. We found that East Asia and South Asia will be increasing their emissions share while other major CO2 emitters will still produce large shares of the total global emissions.
Resumo:
We propose a novel second order cone programming formulation for designing robust classifiers which can handle uncertainty in observations. Similar formulations are also derived for designing regression functions which are robust to uncertainties in the regression setting. The proposed formulations are independent of the underlying distribution, requiring only the existence of second order moments. These formulations are then specialized to the case of missing values in observations for both classification and regression problems. Experiments show that the proposed formulations outperform imputation.
Resumo:
BACKGROUND: Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods. METHODOLOGY/PRINCIPAL FINDINGS: We searched PubMed and Cochrane databases (2000-2006) for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout) rates being approximated by an exponential decay curve (e(-lambdat)) where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100) and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive. CONCLUSION/SIGNIFICANCE: Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last observation carried forward as the primary method of analysis.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Adjusting HIV Prevalence Estimates for Non-participation: an Application to Demographic Surveillance
Resumo:
Introduction: HIV testing is a cornerstone of efforts to combat the HIV epidemic, and testing conducted as part of surveillance provides invaluable data on the spread of infection and the effectiveness of campaigns to reduce the transmission of HIV. However, participation in HIV testing can be low, and if respondents systematically select not to be tested because they know or suspect they are HIV positive (and fear disclosure), standard approaches to deal with missing data will fail to remove selection bias. We implemented Heckman-type selection models, which can be used to adjust for missing data that are not missing at random, and established the extent of selection bias in a population-based HIV survey in an HIV hyperendemic community in rural South Africa.
Methods: We used data from a population-based HIV survey carried out in 2009 in rural KwaZulu-Natal, South Africa. In this survey, 5565 women (35%) and 2567 men (27%) provided blood for an HIV test. We accounted for missing data using interviewer identity as a selection variable which predicted consent to HIV testing but was unlikely to be independently associated with HIV status. Our approach involved using this selection variable to examine the HIV status of residents who would ordinarily refuse to test, except that they were allocated a persuasive interviewer. Our copula model allows for flexibility when modelling the dependence structure between HIV survey participation and HIV status.
Results: For women, our selection model generated an HIV prevalence estimate of 33% (95% CI 27–40) for all people eligible to consent to HIV testing in the survey. This estimate is higher than the estimate of 24% generated when only information from respondents who participated in testing is used in the analysis, and the estimate of 27% when imputation analysis is used to predict missing data on HIV status. For men, we found an HIV prevalence of 25% (95% CI 15–35) using the selection model, compared to 16% among those who participated in testing, and 18% estimated with imputation. We provide new confidence intervals that correct for the fact that the relationship between testing and HIV status is unknown and requires estimation.
Conclusions: We confirm the feasibility and value of adopting selection models to account for missing data in population-based HIV surveys and surveillance systems. Elements of survey design, such as interviewer identity, present the opportunity to adopt this approach in routine applications. Where non-participation is high, true confidence intervals are much wider than those generated by standard approaches to dealing with missing data suggest.
Resumo:
As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced
Resumo:
1. The establishment of grassy strips at the margins of arable fields is an agri-environment scheme that aims to provide resources for native flora and fauna and thus increase farmland biodiversity. These margins can be managed to target certain groups, such as farmland birds and pollinators, but the impact of such management on the soil fauna has been poorly studied. This study assessed the effect of seed mix and management on the biodiversity, conservation and functional value of field margins for soil macrofauna. 2. Experimental margin plots were established in 2001 in a winter wheat field in Cambridgeshire, UK, using a factorial design of three seed mixes and three management practices [spring cut, herbicide application and soil disturbance (scarification)]. In spring and autumn 2005, soil cores taken from the margin plots and the crop were hand-sorted for soil macrofauna. The Lumbricidae, Isopoda, Chilopoda, Diplopoda, Carabidae and Staphylinidae were identified to species and classified according to feeding type. 3. Diversity in the field margins was generally higher than in the crop, with the Lumbricidae, Isopoda and Coleoptera having significantly more species and/or higher abundances in the margins. Within the margins, management had a significant effect on the soil macrofauna, with scarified plots containing lower abundances and fewer species of Isopods. The species composition of the scarified plots was similar to that of the crop. 4. Scarification also reduced soil- and litter-feeder abundances and predator species densities, although populations appeared to recover by the autumn, probably as a result of dispersal from neighbouring plots and boundary features. The implications of the responses of these feeding groups for ecosystem services are discussed. 5. Synthesis and applications. This study shows that the management of agri-environment schemes can significantly influence their value for soil macrofauna. In order to encourage the litter-dwelling invertebrates that tend to be missing from arable systems, agri-environment schemes should aim to minimize soil cultivation and develop a substantial surface litter layer. However, this may conflict with other aims of these schemes, such as enhancing floristic and pollinator diversity.
Resumo:
When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
Many things have been said about literature after postmodernism, but one point there seems to be some agreement on is that it does not turn its back radically on its postmodernist forerunner, but rather generally continues to heed and value their insights. There seems to be something strikingly non-oedipal about the recent aesthetic shift. It is a project of reconstruction that remains deeply rooted in postmodernist tenets. Such an essentially non-oedipal attitude, I would argue, is central to the nature of the reconstructive shift. This, however, also raises questions about the wider cultural context from which such an aesthetic stance arises. If postmodernism was nurtured by the revolutionary spirits of the late 1960s, reconstruction faces a different world with different strategies. Instead of the postmodernist urge to subvert, expose and undermine, reconstruction yearns towards tentative and fragile intersubjective understanding, towards responsibility and community. Instead of revolt and rebellion it explores reconciliation and compromise. One instance in which this becomes visible in reconstructive narratives is the recurring figure of the lost father. Missing father figures abound in recent novels by authors like Mark Z. Danielewski, Dave Eggers, Yann Mantel, David Mitchell etc. It almost seems like a younger generation is yearning for the fathers which postmodernism has struggled hard to do away with. My paper will focus on one particularly striking example to explore the implications of this development: Daniel Wallace’s novel Big Fish and Tim Burton’s well-known film adaptation of the same. In their negotiation of fact and fiction, of doubt and belief, of freedom and responsibility, all of which converge in a father-son relationship, they serve well to illustrate central characteristics and concerns of recent attempts to leave postmodernism behind.
Resumo:
Missing outcome data are common in clinical trials and despite a well-designed study protocol, some of the randomized participants may leave the trial early without providing any or all of the data, or may be excluded after randomization. Premature discontinuation causes loss of information, potentially resulting in attrition bias leading to problems during interpretation of trial findings. The causes of information loss in a trial, known as mechanisms of missingness, may influence the credibility of the trial results. Analysis of trials with missing outcome data should ideally be handled with intention to treat (ITT) rather than per protocol (PP) analysis. However, true ITT analysis requires appropriate assumptions and imputation of missing data. Using a worked example from a published dental study, we highlight the key issues associated with missing outcome data in clinical trials, describe the most recognized approaches to handling missing outcome data, and explain the principles of ITT and PP analysis.