77 resultados para robust estimation statistics
em University of Queensland eSpace - Australia
Resumo:
We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with excellent properties. The approach is in- spired by the principles of the generalized cross entropy method. The pro- posed density estimation procedure has numerous advantages over the tra- ditional kernel density estimator methods. Firstly, for the first time in the nonparametric literature, the proposed estimator allows for a genuine incor- poration of prior information in the density estimation procedure. Secondly, the approach provides the first data-driven bandwidth selection method that is guaranteed to provide a unique bandwidth for any data. Lastly, simulation examples suggest the proposed approach outperforms the current state of the art in nonparametric density estimation in terms of accuracy and reliability.
Resumo:
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.
Resumo:
A significant problem in the collection of responses to potentially sensitive questions, such as relating to illegal, immoral or embarrassing activities, is non-sampling error due to refusal to respond or false responses. Eichhorn & Hayre (1983) suggested the use of scrambled responses to reduce this form of bias. This paper considers a linear regression model in which the dependent variable is unobserved but for which the sum or product with a scrambling random variable of known distribution, is known. The performance of two likelihood-based estimators is investigated, namely of a Bayesian estimator achieved through a Markov chain Monte Carlo (MCMC) sampling scheme, and a classical maximum-likelihood estimator. These two estimators and an estimator suggested by Singh, Joarder & King (1996) are compared. Monte Carlo results show that the Bayesian estimator outperforms the classical estimators in almost all cases, and the relative performance of the Bayesian estimator improves as the responses become more scrambled.
Resumo:
Binning and truncation of data are common in data analysis and machine learning. This paper addresses the problem of fitting mixture densities to multivariate binned and truncated data. The EM approach proposed by McLachlan and Jones (Biometrics, 44: 2, 571-578, 1988) for the univariate case is generalized to multivariate measurements. The multivariate solution requires the evaluation of multidimensional integrals over each bin at each iteration of the EM procedure. Naive implementation of the procedure can lead to computationally inefficient results. To reduce the computational cost a number of straightforward numerical techniques are proposed. Results on simulated data indicate that the proposed methods can achieve significant computational gains with no loss in the accuracy of the final parameter estimates. Furthermore, experimental results suggest that with a sufficient number of bins and data points it is possible to estimate the true underlying density almost as well as if the data were not binned. The paper concludes with a brief description of an application of this approach to diagnosis of iron deficiency anemia, in the context of binned and truncated bivariate measurements of volume and hemoglobin concentration from an individual's red blood cells.
Resumo:
We present a novel maximum-likelihood-based algorithm for estimating the distribution of alignment scores from the scores of unrelated sequences in a database search. Using a new method for measuring the accuracy of p-values, we show that our maximum-likelihood-based algorithm is more accurate than existing regression-based and lookup table methods. We explore a more sophisticated way of modeling and estimating the score distributions (using a two-component mixture model and expectation maximization), but conclude that this does not improve significantly over simply ignoring scores with small E-values during estimation. Finally, we measure the classification accuracy of p-values estimated in different ways and observe that inaccurate p-values can, somewhat paradoxically, lead to higher classification accuracy. We explain this paradox and argue that statistical accuracy, not classification accuracy, should be the primary criterion in comparisons of similarity search methods that return p-values that adjust for target sequence length.
Resumo:
This article presents Monte Carlo techniques for estimating network reliability. For highly reliable networks, techniques based on graph evolution models provide very good performance. However, they are known to have significant simulation cost. An existing hybrid scheme (based on partitioning the time space) is available to speed up the simulations; however, there are difficulties with optimizing the important parameter associated with this scheme. To overcome these difficulties, a new hybrid scheme (based on partitioning the edge set) is proposed in this article. The proposed scheme shows orders of magnitude improvement of performance over the existing techniques in certain classes of network. It also provides reliability bounds with little overhead.
Resumo:
Objective: To compare percentage body fat (%BF) for a given body mass index (BMI) among New Zealand European, Maori and Pacific Island children. To develop prediction equations based on bioimpedance measurements for the estimation of fat-free mass (FFM) appropriate to children in these three ethnic groups. Design: Cross-sectional study. Purposive sampling of schoolchildren aimed at recruiting three children of each sex and ethnicity for each year of age. Double cross-validation of FFM prediction equations developed by multiple regression. Setting: Local schools in Auckland. Subjects: Healthy European, Maori and Pacific Island children (n = 172, 83 M, 89 F, mean age 9.4 +/- 2.8(s. d.), range 5 - 14 y). Measurements: Height, weight, age, sex and ethnicity were recorded. FFM was derived from measurements of total body water by deuterium dilution and resistance and reactance were measured by bioimpedance analysis. Results: For fixed BMI, the Maori and Pacific Island girls averaged 3.7% lower % BF than European girls. For boys a similar relation was not found since BMI did not significantly influence % BF of European boys ( P = 0.18). Based on bioimpedance measurements a single prediction equation was developed for all children: FFM (kg) = 0.622 height (cm)(2)/ resistance +0.234 weight (kg)+1.166, R-2 = 0.96, s. e. e. = 2.44 kg. Ethnicity, age and sex were not significant predictors. Conclusions: A robust equation for estimation of FFM in New Zealand European, Maori and Pacific Island children in the 5 - 14 y age range that is more suitable than BMI for the determination of body fatness in field studies has been developed. Sponsorship: Maurice and Phyllis Paykel Trust, Auckland University of Technology Contestable Grants Fund and the Ministry of Health.
Resumo:
Genetic assignment methods use genotype likelihoods to draw inference about where individuals were or were not born, potentially allowing direct, real-time estimates of dispersal. We used simulated data sets to test the power and accuracy of Monte Carlo resampling methods in generating statistical thresholds for identifying F-0 immigrants in populations with ongoing gene flow, and hence for providing direct, real-time estimates of migration rates. The identification of accurate critical values required that resampling methods preserved the linkage disequilibrium deriving from recent generations of immigrants and reflected the sampling variance present in the data set being analysed. A novel Monte Carlo resampling method taking into account these aspects was proposed and its efficiency was evaluated. Power and error were relatively insensitive to the frequency assumed for missing alleles. Power to identify F-0 immigrants was improved by using large sample size (up to about 50 individuals) and by sampling all populations from which migrants may have originated. A combination of plotting genotype likelihoods and calculating mean genotype likelihood ratios (D-LR) appeared to be an effective way to predict whether F-0 immigrants could be identified for a particular pair of populations using a given set of markers.
Resumo:
A generic method for the estimation of parameters for Stochastic Ordinary Differential Equations (SODEs) is introduced and developed. This algorithm, called the GePERs method, utilises a genetic optimisation algorithm to minimise a stochastic objective function based on the Kolmogorov-Smirnov statistic. Numerical simulations are utilised to form the KS statistic. Further, the examination of some of the factors that improve the precision of the estimates is conducted. This method is used to estimate parameters of diffusion equations and jump-diffusion equations. It is also applied to the problem of model selection for the Queensland electricity market. (C) 2003 Elsevier B.V. All rights reserved.
Resumo:
In this paper we investigate a Bayesian procedure for the estimation of a flexible generalised distribution, notably the MacGillivray adaptation of the g-and-κ distribution. This distribution, described through its inverse cdf or quantile function, generalises the standard normal through extra parameters which together describe skewness and kurtosis. The standard quantile-based methods for estimating the parameters of generalised distributions are often arbitrary and do not rely on computation of the likelihood. MCMC, however, provides a simulation-based alternative for obtaining the maximum likelihood estimates of parameters of these distributions or for deriving posterior estimates of the parameters through a Bayesian framework. In this paper we adopt the latter approach, The proposed methodology is illustrated through an application in which the parameter of interest is slightly skewed.
Resumo:
Testing for simultaneous vicariance across comparative phylogeographic data sets is a notoriously difficult problem hindered by mutational variance, the coalescent variance, and variability across pairs of sister taxa in parameters that affect genetic divergence. We simulate vicariance to characterize the behaviour of several commonly used summary statistics across a range of divergence times, and to characterize this behaviour in comparative phylogeographic datasets having multiple taxon-pairs. We found Tajima's D to be relatively uncorrelated with other summary statistics across divergence times, and using simple hypothesis testing of simultaneous vicariance given variable population sizes, we counter-intuitively found that the variance across taxon pairs in Nei and Li's net nucleotide divergence (pi(net)), a common measure of population divergence, is often inferior to using the variance in Tajima's D across taxon pairs as a test statistic to distinguish ancient simultaneous vicariance from variable vicariance histories. The opposite and more intuitive pattern is found for testing more recent simultaneous vicariance, and overall we found that depending on the timing of vicariance, one of these two test statistics can achieve high statistical power for rejecting simultaneous vicariance, given a reasonable number of intron loci (> 5 loci, 400 bp) and a range of conditions. These results suggest that components of these two composite summary statistics should be used in future simulation-based methods which can simultaneously use a pool of summary statistics to test comparative the phylogeographic hypotheses we consider here.
Resumo:
All muscle contractions are dependent on the functioning of motor units. In diseases such as amyotrophic lateral sclerosis (ALS), progressive loss of motor units leads to gradual paralysis. A major difficulty in the search for a treatment for these diseases has been the lack of a reliable measure of disease progression. One possible measure would be an estimate of the number of surviving motor units. Despite over 30 years of motor unit number estimation (MUNE), all proposed methods have been met with practical and theoretical objections. Our aim is to develop a method of MUNE that overcomes these objections. We record the compound muscle action potential (CMAP) from a selected muscle in response to a graded electrical stimulation applied to the nerve. As the stimulus increases, the threshold of each motor unit is exceeded, and the size of the CMAP increases until a maximum response is obtained. However, the threshold potential required to excite an axon is not a precise value but fluctuates over a small range leading to probabilistic activation of motor units in response to a given stimulus. When the threshold ranges of motor units overlap, there may be alternation where the number of motor units that fire in response to the stimulus is variable. This means that increments in the value of the CMAP correspond to the firing of different combinations of motor units. At a fixed stimulus, variability in the CMAP, measured as variance, can be used to conduct MUNE using the "statistical" or the "Poisson" method. However, this method relies on the assumptions that the numbers of motor units that are firing probabilistically have the Poisson distribution and that all single motor unit action potentials (MUAP) have a fixed and identical size. These assumptions are not necessarily correct. We propose to develop a Bayesian statistical methodology to analyze electrophysiological data to provide an estimate of motor unit numbers. Our method of MUNE incorporates the variability of the threshold, the variability between and within single MUAPs, and baseline variability. Our model not only gives the most probable number of motor units but also provides information about both the population of units and individual units. We use Markov chain Monte Carlo to obtain information about the characteristics of individual motor units and about the population of motor units and the Bayesian information criterion for MUNE. We test our method of MUNE on three subjects. Our method provides a reproducible estimate for a patient with stable but severe ALS. In a serial study, we demonstrate a decline in the number of motor unit numbers with a patient with rapidly advancing disease. Finally, with our last patient, we show that our method has the capacity to estimate a larger number of motor units.