18 resultados para Missing-data
Resumo:
This paper proposes a probabilistic principal component analysis (PCA) approach applied to islanding detection study based on wide area PMU data. The increasing probability of uncontrolled islanding operation, according to many power system operators, is one of the biggest concerns with a large penetration of distributed renewable generation. The traditional islanding detection methods, such as RoCoF and vector shift, are however extremely sensitive and may result in many unwanted trips. The proposed probabilistic PCA aims to improve islanding detection accuracy and reduce the risk of unwanted tripping based on PMU measurements, while addressing a practical issue on missing data. The reliability and accuracy of the proposed probabilistic PCA approach are demonstrated using real data recorded in the UK power system by the OpenPMU project. The results show that the proposed methods can detect islanding accurately, without being falsely triggered by generation trips, even in the presence of missing values.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
IMPORTANCE Systematic reviews and meta-analyses of individual participant data (IPD) aim to collect, check, and reanalyze individual-level data from all studies addressing a particular research question and are therefore considered a gold standard approach to evidence synthesis. They are likely to be used with increasing frequency as current initiatives to share clinical trial data gain momentum and may be particularly important in reviewing controversial therapeutic areas.
OBJECTIVE To develop PRISMA-IPD as a stand-alone extension to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Statement, tailored to the specific requirements of reporting systematic reviews and meta-analyses of IPD. Although developed primarily for reviews of randomized trials, many items will apply in other contexts, including reviews of diagnosis and prognosis.
DESIGN Development of PRISMA-IPD followed the EQUATOR Network framework guidance and used the existing standard PRISMA Statement as a starting point to draft additional relevant material. A web-based survey informed discussion at an international workshop that included researchers, clinicians, methodologists experienced in conducting systematic reviews and meta-analyses of IPD, and journal editors. The statement was drafted and iterative refinements were made by the project, advisory, and development groups. The PRISMA-IPD Development Group reached agreement on the PRISMA-IPD checklist and flow diagram by consensus.
FINDINGS Compared with standard PRISMA, the PRISMA-IPD checklist includes 3 new items that address (1) methods of checking the integrity of the IPD (such as pattern of randomization, data consistency, baseline imbalance, and missing data), (2) reporting any important issues that emerge, and (3) exploring variation (such as whether certain types of individual benefit more from the intervention than others). A further additional item was created by reorganization of standard PRISMA items relating to interpreting results. Wording was modified in 23 items to reflect the IPD approach.
CONCLUSIONS AND RELEVANCE PRISMA-IPD provides guidelines for reporting systematic reviews and meta-analyses of IPD.
Resumo:
Estimates of HIV prevalence are important for policy in order to establish the health status of a country's population and to evaluate the effectiveness of population-based interventions and campaigns. However, participation rates in testing for surveillance conducted as part of household surveys, on which many of these estimates are based, can be low. HIV positive individuals may be less likely to participate because they fear disclosure, in which case estimates obtained using conventional approaches to deal with missing data, such as imputation-based methods, will be biased. We develop a Heckman-type simultaneous equation approach which accounts for non-ignorable selection, but unlike previous implementations, allows for spatial dependence and does not impose a homogeneous selection process on all respondents. In addition, our framework addresses the issue of separation, where for instance some factors are severely unbalanced and highly predictive of the response, which would ordinarily prevent model convergence. Estimation is carried out within a penalized likelihood framework where smoothing is achieved using a parametrization of the smoothing criterion which makes estimation more stable and efficient. We provide the software for straightforward implementation of the proposed approach, and apply our methodology to estimating national and sub-national HIV prevalence in Swaziland, Zimbabwe and Zambia.
Resumo:
Objectives: This study examined the validity of a latent class typology of adolescent drinking based on four alcohol dimensions; frequency of drinking, quantity consumed, frequency of binge drinking and the number of alcohol related problems encountered. Method: Data used were from the 1970 British Cohort Study sixteen-year-old follow-up. Partial or complete responses to the selected alcohol measures were provided by 6,516 cohort members. The data were collected via a series of postal questionnaires. Results: A five class LCA typology was constructed. Around 12% of the sample were classified as �hazardous drinkers� reporting frequent drinking, high levels of alcohol consumed, frequent binge drinking and multiple alcohol related problems. Multinomial logistic regression, with multiple imputation for missing data, was used to assess the covariates of adolescent drinking patterns. Hazardous drinking was associated with being white, being male, having heavy drinking parents (in particular fathers), smoking, illicit drug use, and minor and violent offending behaviour. Non-significant associations were found between drinking patterns and general mental health and attention deficient disorder. Conclusion: The latent class typology exhibited concurrent validity in terms of its ability to distinguish respondents across a number of alcohol and non-alcohol indicators. Notwithstanding a number of limitations, latent class analysis offers an alternative data reduction method for the construction of drinking typologies that addresses known weaknesses inherent in more tradition classification methods.
Resumo:
Background: This study assessed the association between adolescent ecstasy use and depressive symptoms in adolescence. Methods: The Belfast Youth Development Study surveyed a cohort annually from age 11 to 16 years. Gender, Strengths and Difficulties Questionnaire emotional subscale, living arrangements, parental affluence, parent and peer attachment, tobacco, alcohol, cannabis and ecstasy use were investigated as predictors of Short Mood and Feelings Questionnaire (SMFQ) outcome. Results: Of 5371 respondents, 301 (5.6%) had an SMFQ > 15, and 1620 (30.2) had missing data for SMFQ. Around 8% of the cohort had used ecstasy by the end of follow-up. Of the non-drug users, ∼2% showed symptoms of depression, compared with 6% of those who had used alcohol, 6% of cannabis users, 6% of ecstasy users and 7% of frequent ecstasy users. Without adjustment, ecstasy users showed around a 4-fold increased odds of depressive symptoms compared with non-drug users [odds ratio (OR) = 0.26; 95% confidence interval (CI) = 0.10, 0.68]. Further adjustment for living arrangements, peer and parental attachment attenuated the association to under a 3-fold increase (OR = 0.37; 95% CI = 0.15, 0.94). There were no differences by frequency of use. Conclusions: Ecstasy use during adolescence may be associated with poorer mental health; however, this association can be explained by the confounding social influence of family dynamics. These findings could be used to aid effective evidence-based drug policies, which concentrate criminal justice and public health resources on reducing harm.
Resumo:
We present TANC, a TAN classifier (tree-augmented naive) based on imprecise probabilities. TANC models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM). A first contribution of this paper is the experimental comparison between EDM and the global Imprecise Dirichlet Model using the naive credal classifier (NCC), with the aim of showing that EDM is a sensible approximation of the global IDM. TANC is able to deal with missing data in a conservative manner by considering all possible completions (without assuming them to be missing-at-random), but avoiding an exponential increase of the computational time. By experiments on real data sets, we show that TANC is more reliable than the Bayesian TAN and that it provides better performance compared to previous TANs based on imprecise probabilities. Yet, TANC is sometimes outperformed by NCC because the learned TAN structures are too complex; this calls for novel algorithms for learning the TAN structures, better suited for an imprecise probability classifier.
Resumo:
In this paper we present TANC, i.e., a tree-augmented naive credal classifier based on imprecise probabilities; it models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM) (Cano et al., 2007) and deals conservatively with missing data in the training set, without assuming them to be missing-at-random. The EDM is an approximation of the global Imprecise Dirichlet Model (IDM), which considerably simplifies the computation of upper and lower probabilities; yet, having been only recently introduced, the quality of the provided approximation needs still to be verified. As first contribution, we extensively compare the output of the naive credal classifier (one of the few cases in which the global IDM can be exactly implemented) when learned with the EDM and the global IDM; the output of the classifier appears to be identical in the vast majority of cases, thus supporting the adoption of the EDM in real classification problems. Then, by experiments we show that TANC is more reliable than the precise TAN (learned with uniform prior), and also that it provides better performance compared to a previous (Zaffalon, 2003) TAN model based on imprecise probabilities. TANC treats missing data by considering all possible completions of the training set, but avoiding an exponential increase of the computational times; eventually, we present some preliminary results with missing data.
Resumo:
Some reasons for registering trials might be considered as self-serving, such as satisfying the requirements of a journal in which the researchers wish to publish their eventual findings or publicising the trial to boost recruitment. Registry entries also help others, including systematic reviewers, to know about ongoing or unpublished studies and contribute to reducing research waste by making it clear what studies are ongoing. Other sources of research waste include inconsistency in outcome measurement across trials in the same area, missing data on important outcomes from some trials, and selective reporting of outcomes. One way to reduce this waste is through the use of core outcome sets: standardised sets of outcomes for research in specific areas of health and social care. These do not restrict the outcomes that will be measured, but provide the minimum to include if a trial is to be of the most use to potential users. We propose that trial registries, such as ISRCTN, encourage researchers to note their use of a core outcome set in their entry. This will help people searching for trials and those worried about selective reporting in closed trials. Trial registries can facilitate these efforts to make new trials as useful as possible and reduce waste. The outcomes section in the entry could prompt the researcher to consider using a core outcome set and facilitate the specification of that core outcome set and its component outcomes through linking to the original core outcome set. In doing this, registries will contribute to the global effort to ensure that trials answer important uncertainties, can be brought together in systematic reviews, and better serve their ultimate aim of improving health and well-being through improving health and social care.
Resumo:
Background: Selection bias in HIV prevalence estimates occurs if non-participation in testing is correlated with HIV status. Longitudinal data suggests that individuals who know or suspect they are HIV positive are less likely to participate in testing in HIV surveys, in which case methods to correct for missing data which are based on imputation and observed characteristics will produce biased results. Methods: The identity of the HIV survey interviewer is typically associated with HIV testing participation, but is unlikely to be correlated with HIV status. Interviewer identity can thus be used as a selection variable allowing estimation of Heckman-type selection models. These models produce asymptotically unbiased HIV prevalence estimates, even when non-participation is correlated with unobserved characteristics, such as knowledge of HIV status. We introduce a new random effects method to these selection models which overcomes non-convergence caused by collinearity, small sample bias, and incorrect inference in existing approaches. Our method is easy to implement in standard statistical software, and allows the construction of bootstrapped standard errors which adjust for the fact that the relationship between testing and HIV status is uncertain and needs to be estimated. Results: Using nationally representative data from the Demographic and Health Surveys, we illustrate our approach with new point estimates and confidence intervals (CI) for HIV prevalence among men in Ghana (2003) and Zambia (2007). In Ghana, we find little evidence of selection bias as our selection model gives an HIV prevalence estimate of 1.4% (95% CI 1.2% – 1.6%), compared to 1.6% among those with a valid HIV test. In Zambia, our selection model gives an HIV prevalence estimate of 16.3% (95% CI 11.0% - 18.4%), compared to 12.1% among those with a valid HIV test. Therefore, those who decline to test in Zambia are found to be more likely to be HIV positive. Conclusions: Our approach corrects for selection bias in HIV prevalence estimates, is possible to implement even when HIV prevalence or non-participation is very high or very low, and provides a practical solution to account for both sampling and parameter uncertainty in the estimation of confidence intervals. The wide confidence intervals estimated in an example with high HIV prevalence indicate that it is difficult to correct statistically for the bias that may occur when a large proportion of people refuse to test.
Adjusting HIV Prevalence Estimates for Non-participation: an Application to Demographic Surveillance
Resumo:
Introduction: HIV testing is a cornerstone of efforts to combat the HIV epidemic, and testing conducted as part of surveillance provides invaluable data on the spread of infection and the effectiveness of campaigns to reduce the transmission of HIV. However, participation in HIV testing can be low, and if respondents systematically select not to be tested because they know or suspect they are HIV positive (and fear disclosure), standard approaches to deal with missing data will fail to remove selection bias. We implemented Heckman-type selection models, which can be used to adjust for missing data that are not missing at random, and established the extent of selection bias in a population-based HIV survey in an HIV hyperendemic community in rural South Africa.
Methods: We used data from a population-based HIV survey carried out in 2009 in rural KwaZulu-Natal, South Africa. In this survey, 5565 women (35%) and 2567 men (27%) provided blood for an HIV test. We accounted for missing data using interviewer identity as a selection variable which predicted consent to HIV testing but was unlikely to be independently associated with HIV status. Our approach involved using this selection variable to examine the HIV status of residents who would ordinarily refuse to test, except that they were allocated a persuasive interviewer. Our copula model allows for flexibility when modelling the dependence structure between HIV survey participation and HIV status.
Results: For women, our selection model generated an HIV prevalence estimate of 33% (95% CI 27–40) for all people eligible to consent to HIV testing in the survey. This estimate is higher than the estimate of 24% generated when only information from respondents who participated in testing is used in the analysis, and the estimate of 27% when imputation analysis is used to predict missing data on HIV status. For men, we found an HIV prevalence of 25% (95% CI 15–35) using the selection model, compared to 16% among those who participated in testing, and 18% estimated with imputation. We provide new confidence intervals that correct for the fact that the relationship between testing and HIV status is unknown and requires estimation.
Conclusions: We confirm the feasibility and value of adopting selection models to account for missing data in population-based HIV surveys and surveillance systems. Elements of survey design, such as interviewer identity, present the opportunity to adopt this approach in routine applications. Where non-participation is high, true confidence intervals are much wider than those generated by standard approaches to dealing with missing data suggest.
Resumo:
How can we correlate neural activity in the human brain as it responds to words, with behavioral data expressed as answers to questions about these same words? In short, we want to find latent variables, that explain both the brain activity, as well as the behavioral responses. We show that this is an instance of the Coupled Matrix-Tensor Factorization (CMTF) problem. We propose Scoup-SMT, a novel, fast, and parallel algorithm that solves the CMTF problem and produces a sparse latent low-rank subspace of the data. In our experiments, we find that Scoup-SMT is 50-100 times faster than a state-of-the-art algorithm for CMTF, along with a 5 fold increase in sparsity. Moreover, we extend Scoup-SMT to handle missing data without degradation of performance. We apply Scoup-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Scoup-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Scoup-SMT, by applying it on a Facebook dataset (users, friends, wall-postings); there, Scoup-SMT spots spammer-like anomalies.
Resumo:
Background: Prospective investigations of the association between impaired orthostatic blood pressure (BP) regulation and cognitive decline in older adults are limited, and findings to-date have been mixed. The aim of this study was to determine whether impaired recovery of orthostatic BP was associated with change in cognitive function over a 2-year period, in a population based sample of community dwelling older adults.
Methods: Data from the first two waves of the Irish Longitudinal Study on Ageing were analysed. Orthostatic BP was measured during a lying to standing orthostatic stress protocol at wave 1 using beat-to-beat digital plethysmography, and impaired recovery of BP at 40 s post stand was investigated. Cognitive function was assessed at wave 1 and wave 2 (2 years later) using the Mini-Mental State Exam (MMSE), verbal fluency and word recall tasks.
Results: After adjustment for measured, potential confounders, and multiple imputation for missing data, the change in the number of errors between waves on the MMSE was 10 % higher [IRR (95 % CI) = 1.10 (0.96, 1.26)] in those with impaired recovery at 40 s. However, this was not statistically significant (p = 0.17). Impaired BP recovery was not associated with change in performance on any of the other cognitive measures.
Conclusions: There was no clear evidence for an association between impaired recovery of orthostatic BP and change in cognition over a 2-year period in this nationally representative cohort of older adults. Longer follow-up and more detailed cognitive testing would be advantageous to further investigate the relationship between orthostatic BP and cognitive decline.
Resumo:
Baited cameras are often used for abundance estimation wherever alternative techniques are precluded, e.g. in abyssal systems and areas such as reefs. This method has thus far used models of the arrival process that are deterministic and, therefore, permit no estimate of precision.
Furthermore, errors due to multiple counting of fish and missing those not seen by the camera have restricted the technique to using only the time of first arrival, leaving a lot of data redundant. Here, we reformulate the arrival process using a stochastic model, which allows the precision of abundance
estimates to be quantified. Assuming a non-gregarious, cross-current-scavenging fish, we show that prediction of abundance from first arrival time is extremely uncertain. Using example data, we show
that simple regression-based prediction from the initial (rising) slope of numbers at the bait gives good precision, accepting certain assumptions. The most precise abundance estimates were obtained
by including the declining phase of the time series, using a simple model of departures, and taking account of scavengers beyond the camera’s view, using a hidden Markov model.