910 resultados para Missing data
Resumo:
Amplitude demodulation is an ill-posed problem and so it is natural to treat it from a Bayesian viewpoint, inferring the most likely carrier and envelope under probabilistic constraints. One such treatment is Probabilistic Amplitude Demodulation (PAD), which, whilst computationally more intensive than traditional approaches, offers several advantages. Here we provide methods for estimating the uncertainty in the PAD-derived envelopes and carriers, and for learning free-parameters like the time-scale of the envelope. We show how the probabilistic approach can naturally handle noisy and missing data. Finally, we indicate how to extend the model to signals which contain multiple modulators and carriers.
Resumo:
Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences ($\sim1$s); phonemes ($\sim10$−$1$ s); glottal pulses ($\sim 10$−$2$s); and formants ($\sim 10$−$3$s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscience-inspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difficult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efficient learning and inference algorithms. We demonstrate the success of this approach on a missing data task.
Resumo:
完整性是数据质量的一个重要维度,由于数据本身固有的不确定性、采集的随机性及不准确性,导致现实应用中产生了大量具有如下特点的数据集:1)数据规模庞大;2)数据往往是不完整、不准确的.因此将大规模数据集分段到不同的数据窗口中处理是数据处理的重要方法,但缺失数据估算的相关研究大都忽视了数据集的特点和窗口的应用,而且回定大小的数据窗17容易造成算法的准确性和性能受窗口大小及窗口内数据值分布的影响.假设数据满足一定的领域相关的约束,首先提出了一种新的基于时间的动态自适应数据窗口检测算法,并基于此窗口提出了一种改进的模糊k-均值聚类算法来进行不完整数据的缺失数据估算.实验表明较之其他算法,不仅能更适应数据集的特点,具有较好的性能,而且能够保证准确性.
Resumo:
The dissertation addressed the problems of signals reconstruction and data restoration in seismic data processing, which takes the representation methods of signal as the main clue, and take the seismic information reconstruction (signals separation and trace interpolation) as the core. On the natural bases signal representation, I present the ICA fundamentals, algorithms and its original applications to nature earth quake signals separation and survey seismic signals separation. On determinative bases signal representation, the paper proposed seismic dada reconstruction least square inversion regularization methods, sparseness constraints, pre-conditioned conjugate gradient methods, and their applications to seismic de-convolution, Radon transformation, et. al. The core contents are about de-alias uneven seismic data reconstruction algorithm and its application to seismic interpolation. Although the dissertation discussed two cases of signal representation, they can be integrated into one frame, because they both deal with the signals or information restoration, the former reconstructing original signals from mixed signals, the later reconstructing whole data from sparse or irregular data. The goal of them is same to provide pre-processing methods and post-processing method for seismic pre-stack depth migration. ICA can separate the original signals from mixed signals by them, or abstract the basic structure from analyzed data. I surveyed the fundamental, algorithms and applications of ICA. Compared with KL transformation, I proposed the independent components transformation concept (ICT). On basis of the ne-entropy measurement of independence, I implemented the FastICA and improved it by covariance matrix. By analyzing the characteristics of the seismic signals, I introduced ICA into seismic signal processing firstly in Geophysical community, and implemented the noise separation from seismic signal. Synthetic and real data examples show the usability of ICA to seismic signal processing and initial effects are achieved. The application of ICA to separation quake conversion wave from multiple in sedimentary area is made, which demonstrates good effects, so more reasonable interpretation of underground un-continuity is got. The results show the perspective of application of ICA to Geophysical signal processing. By virtue of the relationship between ICA and Blind Deconvolution , I surveyed the seismic blind deconvolution, and discussed the perspective of applying ICA to seismic blind deconvolution with two possible solutions. The relationship of PC A, ICA and wavelet transform is claimed. It is proved that reconstruction of wavelet prototype functions is Lie group representation. By the way, over-sampled wavelet transform is proposed to enhance the seismic data resolution, which is validated by numerical examples. The key of pre-stack depth migration is the regularization of pre-stack seismic data. As a main procedure, seismic interpolation and missing data reconstruction are necessary. Firstly, I review the seismic imaging methods in order to argue the critical effect of regularization. By review of the seismic interpolation algorithms, I acclaim that de-alias uneven data reconstruction is still a challenge. The fundamental of seismic reconstruction is discussed firstly. Then sparseness constraint on least square inversion and preconditioned conjugate gradient solver are studied and implemented. Choosing constraint item with Cauchy distribution, I programmed PCG algorithm and implement sparse seismic deconvolution, high resolution Radon Transformation by PCG, which is prepared for seismic data reconstruction. About seismic interpolation, dealias even data interpolation and uneven data reconstruction are very good respectively, however they can not be combined each other. In this paper, a novel Fourier transform based method and a algorithm have been proposed, which could reconstruct both uneven and alias seismic data. I formulated band-limited data reconstruction as minimum norm least squares inversion problem where an adaptive DFT-weighted norm regularization term is used. The inverse problem is solved by pre-conditional conjugate gradient method, which makes the solutions stable and convergent quickly. Based on the assumption that seismic data are consisted of finite linear events, from sampling theorem, alias events can be attenuated via LS weight predicted linearly from low frequency. Three application issues are discussed on even gap trace interpolation, uneven gap filling, high frequency trace reconstruction from low frequency data trace constrained by few high frequency traces. Both synthetic and real data numerical examples show the proposed method is valid, efficient and applicable. The research is valuable to seismic data regularization and cross well seismic. To meet 3D shot profile depth migration request for data, schemes must be taken to make the data even and fitting the velocity dataset. The methods of this paper are used to interpolate and extrapolate the shot gathers instead of simply embedding zero traces. So, the aperture of migration is enlarged and the migration effect is improved. The results show the effectiveness and the practicability.
Resumo:
An increasing number of applications, such as distributed interactive simulation, live auctions, distributed games and collaborative systems, require the network to provide a reliable multicast service. This service enables one sender to reliably transmit data to multiple receivers. Reliability is traditionally achieved by having receivers send negative acknowledgments (NACKs) to request from the sender the retransmission of lost (or missing) data packets. However, this Automatic Repeat reQuest (ARQ) approach results in the well-known NACK implosion problem at the sender. Many reliable multicast protocols have been recently proposed to reduce NACK implosion. But, the message overhead due to NACK requests remains significant. Another approach, based on Forward Error Correction (FEC), requires the sender to encode additional redundant information so that a receiver can independently recover from losses. However, due to the lack of feedback from receivers, it is impossible for the sender to determine how much redundancy is needed. In this paper, we propose a new reliable multicast protocol, called ARM for Adaptive Reliable Multicast. Our protocol integrates ARQ and FEC techniques. The objectives of ARM are (1) reduce the message overhead due to NACK requests, (2) reduce the amount of data transmission, and (3) reduce the time it takes for all receivers to receive the data intact (without loss). During data transmission, the sender periodically informs the receivers of the number of packets that are yet to be transmitted. Based on this information, each receiver predicts whether this amount is enough to recover its losses. Only if it is not enough, that the receiver requests the sender to encode additional redundant packets. Using ns simulations, we show the superiority of our hybrid ARQ-FEC protocol over the well-known Scalable Reliable Multicast (SRM) protocol.
Resumo:
The combinatorial Dirichlet problem is formulated, and an algorithm for solving it is presented. This provides an effective method for interpolating missing data on weighted graphs of arbitrary connectivity. Image processing examples are shown, and the relation to anistropic diffusion is discussed.
Resumo:
Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally "validated" in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.
Resumo:
OBJECTIVES: To compare the predictive performance and potential clinical usefulness of risk calculators of the European Randomized Study of Screening for Prostate Cancer (ERSPC RC) with and without information on prostate volume. METHODS: We studied 6 cohorts (5 European and 1 US) with a total of 15,300 men, all biopsied and with pre-biopsy TRUS measurements of prostate volume. Volume was categorized into 3 categories (25, 40, and 60 cc), to reflect use of digital rectal examination (DRE) for volume assessment. Risks of prostate cancer were calculated according to a ERSPC DRE-based RC (including PSA, DRE, prior biopsy, and prostate volume) and a PSA + DRE model (including PSA, DRE, and prior biopsy). Missing data on prostate volume were completed by single imputation. Risk predictions were evaluated with respect to calibration (graphically), discrimination (AUC curve), and clinical usefulness (net benefit, graphically assessed in decision curves). RESULTS: The AUCs of the ERSPC DRE-based RC ranged from 0.61 to 0.77 and were substantially larger than the AUCs of a model based on only PSA + DRE (ranging from 0.56 to 0.72) in each of the 6 cohorts. The ERSPC DRE-based RC provided net benefit over performing a prostate biopsy on the basis of PSA and DRE outcome in five of the six cohorts. CONCLUSIONS: Identifying men at increased risk for having a biopsy detectable prostate cancer should consider multiple factors, including an estimate of prostate volume.
Resumo:
BACKGROUND: Administrative or quality improvement registries may or may not contain the elements needed for investigations by trauma researchers. International Classification of Diseases Program for Injury Categorisation (ICDPIC), a statistical program available through Stata, is a powerful tool that can extract injury severity scores from ICD-9-CM codes. We conducted a validation study for use of the ICDPIC in trauma research. METHODS: We conducted a retrospective cohort validation study of 40,418 patients with injury using a large regional trauma registry. ICDPIC-generated AIS scores for each body region were compared with trauma registry AIS scores (gold standard) in adult and paediatric populations. A separate analysis was conducted among patients with traumatic brain injury (TBI) comparing the ICDPIC tool with ICD-9-CM embedded severity codes. Performance in characterising overall injury severity, by the ISS, was also assessed. RESULTS: The ICDPIC tool generated substantial correlations in thoracic and abdominal trauma (weighted κ 0.87-0.92), and in head and neck trauma (weighted κ 0.76-0.83). The ICDPIC tool captured TBI severity better than ICD-9-CM code embedded severity and offered the advantage of generating a severity value for every patient (rather than having missing data). Its ability to produce an accurate severity score was consistent within each body region as well as overall. CONCLUSIONS: The ICDPIC tool performs well in classifying injury severity and is superior to ICD-9-CM embedded severity for TBI. Use of ICDPIC demonstrates substantial efficiency and may be a preferred tool in determining injury severity for large trauma datasets, provided researchers understand its limitations and take caution when examining smaller trauma datasets.
Resumo:
Objectives: This study examined the validity of a latent class typology of adolescent drinking based on four alcohol dimensions; frequency of drinking, quantity consumed, frequency of binge drinking and the number of alcohol related problems encountered. Method: Data used were from the 1970 British Cohort Study sixteen-year-old follow-up. Partial or complete responses to the selected alcohol measures were provided by 6,516 cohort members. The data were collected via a series of postal questionnaires. Results: A five class LCA typology was constructed. Around 12% of the sample were classified as �hazardous drinkers� reporting frequent drinking, high levels of alcohol consumed, frequent binge drinking and multiple alcohol related problems. Multinomial logistic regression, with multiple imputation for missing data, was used to assess the covariates of adolescent drinking patterns. Hazardous drinking was associated with being white, being male, having heavy drinking parents (in particular fathers), smoking, illicit drug use, and minor and violent offending behaviour. Non-significant associations were found between drinking patterns and general mental health and attention deficient disorder. Conclusion: The latent class typology exhibited concurrent validity in terms of its ability to distinguish respondents across a number of alcohol and non-alcohol indicators. Notwithstanding a number of limitations, latent class analysis offers an alternative data reduction method for the construction of drinking typologies that addresses known weaknesses inherent in more tradition classification methods.
Resumo:
Background: This study assessed the association between adolescent ecstasy use and depressive symptoms in adolescence. Methods: The Belfast Youth Development Study surveyed a cohort annually from age 11 to 16 years. Gender, Strengths and Difficulties Questionnaire emotional subscale, living arrangements, parental affluence, parent and peer attachment, tobacco, alcohol, cannabis and ecstasy use were investigated as predictors of Short Mood and Feelings Questionnaire (SMFQ) outcome. Results: Of 5371 respondents, 301 (5.6%) had an SMFQ > 15, and 1620 (30.2) had missing data for SMFQ. Around 8% of the cohort had used ecstasy by the end of follow-up. Of the non-drug users, ∼2% showed symptoms of depression, compared with 6% of those who had used alcohol, 6% of cannabis users, 6% of ecstasy users and 7% of frequent ecstasy users. Without adjustment, ecstasy users showed around a 4-fold increased odds of depressive symptoms compared with non-drug users [odds ratio (OR) = 0.26; 95% confidence interval (CI) = 0.10, 0.68]. Further adjustment for living arrangements, peer and parental attachment attenuated the association to under a 3-fold increase (OR = 0.37; 95% CI = 0.15, 0.94). There were no differences by frequency of use. Conclusions: Ecstasy use during adolescence may be associated with poorer mental health; however, this association can be explained by the confounding social influence of family dynamics. These findings could be used to aid effective evidence-based drug policies, which concentrate criminal justice and public health resources on reducing harm.
Resumo:
We present TANC, a TAN classifier (tree-augmented naive) based on imprecise probabilities. TANC models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM). A first contribution of this paper is the experimental comparison between EDM and the global Imprecise Dirichlet Model using the naive credal classifier (NCC), with the aim of showing that EDM is a sensible approximation of the global IDM. TANC is able to deal with missing data in a conservative manner by considering all possible completions (without assuming them to be missing-at-random), but avoiding an exponential increase of the computational time. By experiments on real data sets, we show that TANC is more reliable than the Bayesian TAN and that it provides better performance compared to previous TANs based on imprecise probabilities. Yet, TANC is sometimes outperformed by NCC because the learned TAN structures are too complex; this calls for novel algorithms for learning the TAN structures, better suited for an imprecise probability classifier.
Resumo:
In this paper we present TANC, i.e., a tree-augmented naive credal classifier based on imprecise probabilities; it models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM) (Cano et al., 2007) and deals conservatively with missing data in the training set, without assuming them to be missing-at-random. The EDM is an approximation of the global Imprecise Dirichlet Model (IDM), which considerably simplifies the computation of upper and lower probabilities; yet, having been only recently introduced, the quality of the provided approximation needs still to be verified. As first contribution, we extensively compare the output of the naive credal classifier (one of the few cases in which the global IDM can be exactly implemented) when learned with the EDM and the global IDM; the output of the classifier appears to be identical in the vast majority of cases, thus supporting the adoption of the EDM in real classification problems. Then, by experiments we show that TANC is more reliable than the precise TAN (learned with uniform prior), and also that it provides better performance compared to a previous (Zaffalon, 2003) TAN model based on imprecise probabilities. TANC treats missing data by considering all possible completions of the training set, but avoiding an exponential increase of the computational times; eventually, we present some preliminary results with missing data.
Resumo:
Some reasons for registering trials might be considered as self-serving, such as satisfying the requirements of a journal in which the researchers wish to publish their eventual findings or publicising the trial to boost recruitment. Registry entries also help others, including systematic reviewers, to know about ongoing or unpublished studies and contribute to reducing research waste by making it clear what studies are ongoing. Other sources of research waste include inconsistency in outcome measurement across trials in the same area, missing data on important outcomes from some trials, and selective reporting of outcomes. One way to reduce this waste is through the use of core outcome sets: standardised sets of outcomes for research in specific areas of health and social care. These do not restrict the outcomes that will be measured, but provide the minimum to include if a trial is to be of the most use to potential users. We propose that trial registries, such as ISRCTN, encourage researchers to note their use of a core outcome set in their entry. This will help people searching for trials and those worried about selective reporting in closed trials. Trial registries can facilitate these efforts to make new trials as useful as possible and reduce waste. The outcomes section in the entry could prompt the researcher to consider using a core outcome set and facilitate the specification of that core outcome set and its component outcomes through linking to the original core outcome set. In doing this, registries will contribute to the global effort to ensure that trials answer important uncertainties, can be brought together in systematic reviews, and better serve their ultimate aim of improving health and well-being through improving health and social care.
Resumo:
Background: Selection bias in HIV prevalence estimates occurs if non-participation in testing is correlated with HIV status. Longitudinal data suggests that individuals who know or suspect they are HIV positive are less likely to participate in testing in HIV surveys, in which case methods to correct for missing data which are based on imputation and observed characteristics will produce biased results. Methods: The identity of the HIV survey interviewer is typically associated with HIV testing participation, but is unlikely to be correlated with HIV status. Interviewer identity can thus be used as a selection variable allowing estimation of Heckman-type selection models. These models produce asymptotically unbiased HIV prevalence estimates, even when non-participation is correlated with unobserved characteristics, such as knowledge of HIV status. We introduce a new random effects method to these selection models which overcomes non-convergence caused by collinearity, small sample bias, and incorrect inference in existing approaches. Our method is easy to implement in standard statistical software, and allows the construction of bootstrapped standard errors which adjust for the fact that the relationship between testing and HIV status is uncertain and needs to be estimated. Results: Using nationally representative data from the Demographic and Health Surveys, we illustrate our approach with new point estimates and confidence intervals (CI) for HIV prevalence among men in Ghana (2003) and Zambia (2007). In Ghana, we find little evidence of selection bias as our selection model gives an HIV prevalence estimate of 1.4% (95% CI 1.2% – 1.6%), compared to 1.6% among those with a valid HIV test. In Zambia, our selection model gives an HIV prevalence estimate of 16.3% (95% CI 11.0% - 18.4%), compared to 12.1% among those with a valid HIV test. Therefore, those who decline to test in Zambia are found to be more likely to be HIV positive. Conclusions: Our approach corrects for selection bias in HIV prevalence estimates, is possible to implement even when HIV prevalence or non-participation is very high or very low, and provides a practical solution to account for both sampling and parameter uncertainty in the estimation of confidence intervals. The wide confidence intervals estimated in an example with high HIV prevalence indicate that it is difficult to correct statistically for the bias that may occur when a large proportion of people refuse to test.