940 resultados para biased sampling
Resumo:
We consider estimating the total load from frequent flow data but less frequent concentration data. There are numerous load estimation methods available, some of which are captured in various online tools. However, most estimators are subject to large biases statistically, and their associated uncertainties are often not reported. This makes interpretation difficult and the estimation of trends or determination of optimal sampling regimes impossible to assess. In this paper, we first propose two indices for measuring the extent of sampling bias, and then provide steps for obtaining reliable load estimates that minimizes the biases and makes use of informative predictive variables. The key step to this approach is in the development of an appropriate predictive model for concentration. This is achieved using a generalized rating-curve approach with additional predictors that capture unique features in the flow data, such as the concept of the first flush, the location of the event on the hydrograph (e.g. rise or fall) and the discounted flow. The latter may be thought of as a measure of constituent exhaustion occurring during flood events. Forming this additional information can significantly improve the predictability of concentration, and ultimately the precision with which the pollutant load is estimated. We also provide a measure of the standard error of the load estimate which incorporates model, spatial and/or temporal errors. This method also has the capacity to incorporate measurement error incurred through the sampling of flow. We illustrate this approach for two rivers delivering to the Great Barrier Reef, Queensland, Australia. One is a data set from the Burdekin River, and consists of the total suspended sediment (TSS) and nitrogen oxide (NO(x)) and gauged flow for 1997. The other dataset is from the Tully River, for the period of July 2000 to June 2008. For NO(x) Burdekin, the new estimates are very similar to the ratio estimates even when there is no relationship between the concentration and the flow. However, for the Tully dataset, by incorporating the additional predictive variables namely the discounted flow and flow phases (rising or recessing), we substantially improved the model fit, and thus the certainty with which the load is estimated.
Resumo:
Using six kinds of lattice types (4×4 ,5×5 , and6×6 square lattices;3×3×3 cubic lattice; and2+3+4+3+2 and4+5+6+5+4 triangular lattices), three different size alphabets (HP ,HNUP , and 20 letters), and two energy functions, the designability of proteinstructures is calculated based on random samplings of structures and common biased sampling (CBS) of proteinsequence space. Then three quantities stability (average energy gap),foldability, and partnum of the structure, which are defined to elucidate the designability, are calculated. The authors find that whatever the type of lattice, alphabet size, and energy function used, there will be an emergence of highly designable (preferred) structure. For all cases considered, the local interactions reduce degeneracy and make the designability higher. The designability is sensitive to the lattice type, alphabet size, energy function, and sampling method of the sequence space. Compared with the random sampling method, both the CBS and the Metropolis Monte Carlo sampling methods make the designability higher. The correlation coefficients between the designability, stability, and foldability are mostly larger than 0.5, which demonstrate that they have strong correlation relationship. But the correlation relationship between the designability and the partnum is not so strong because the partnum is independent of the energy. The results are useful in practical use of the designability principle, such as to predict the proteintertiary structure.
Resumo:
We investigate the utility to computational Bayesian analyses of a particular family of recursive marginal likelihood estimators characterized by the (equivalent) algorithms known as "biased sampling" or "reverse logistic regression" in the statistics literature and "the density of states" in physics. Through a pair of numerical examples (including mixture modeling of the well-known galaxy dataset) we highlight the remarkable diversity of sampling schemes amenable to such recursive normalization, as well as the notable efficiency of the resulting pseudo-mixture distributions for gauging prior-sensitivity in the Bayesian model selection context. Our key theoretical contributions are to introduce a novel heuristic ("thermodynamic integration via importance sampling") for qualifying the role of the bridging sequence in this procedure, and to reveal various connections between these recursive estimators and the nested sampling technique.
Resumo:
There are numerous load estimation methods available, some of which are captured in various online tools. However, most estimators are subject to large biases statistically, and their associated uncertainties are often not reported. This makes interpretation difficult and the estimation of trends or determination of optimal sampling regimes impossible to assess. In this paper, we first propose two indices for measuring the extent of sampling bias, and then provide steps for obtaining reliable load estimates by minimizing the biases and making use of possible predictive variables. The load estimation procedure can be summarized by the following four steps: - (i) output the flow rates at regular time intervals (e.g. 10 minutes) using a time series model that captures all the peak flows; - (ii) output the predicted flow rates as in (i) at the concentration sampling times, if the corresponding flow rates are not collected; - (iii) establish a predictive model for the concentration data, which incorporates all possible predictor variables and output the predicted concentrations at the regular time intervals as in (i), and; - (iv) obtain the sum of all the products of the predicted flow and the predicted concentration over the regular time intervals to represent an estimate of the load. The key step to this approach is in the development of an appropriate predictive model for concentration. This is achieved using a generalized regression (rating-curve) approach with additional predictors that capture unique features in the flow data, namely the concept of the first flush, the location of the event on the hydrograph (e.g. rise or fall) and cumulative discounted flow. The latter may be thought of as a measure of constituent exhaustion occurring during flood events. The model also has the capacity to accommodate autocorrelation in model errors which are the result of intensive sampling during floods. Incorporating this additional information can significantly improve the predictability of concentration, and ultimately the precision with which the pollutant load is estimated. We also provide a measure of the standard error of the load estimate which incorporates model, spatial and/or temporal errors. This method also has the capacity to incorporate measurement error incurred through the sampling of flow. We illustrate this approach using the concentrations of total suspended sediment (TSS) and nitrogen oxide (NOx) and gauged flow data from the Burdekin River, a catchment delivering to the Great Barrier Reef. The sampling biases for NOx concentrations range from 2 to 10 times indicating severe biases. As we expect, the traditional average and extrapolation methods produce much higher estimates than those when bias in sampling is taken into account.
Resumo:
Troxel, Lipsitz, and Brennan (1997, Biometrics 53, 857-869) considered parameter estimation from survey data with nonignorable nonresponse and proposed weighted estimating equations to remove the biases in the complete-case analysis that ignores missing observations. This paper suggests two alternative modifications for unbiased estimation of regression parameters when a binary outcome is potentially observed at successive time points. The weighting approach of Robins, Rotnitzky, and Zhao (1995, Journal of the American Statistical Association 90, 106-121) is also modified to obtain unbiased estimating functions. The suggested estimating functions are unbiased only when the missingness probability is correctly specified, and misspecification of the missingness model will result in biases in the estimates. Simulation studies are carried out to assess the performance of different methods when the covariate is binary or normal. For the simulation models used, the relative efficiency of the two new methods to the weighting methods is about 3.0 for the slope parameter and about 2.0 for the intercept parameter when the covariate is continuous and the missingness probability is correctly specified. All methods produce substantial biases in the estimates when the missingness model is misspecified or underspecified. Analysis of data from a medical survey illustrates the use and possible differences of these estimating functions.
Resumo:
The paper studies stochastic approximation as a technique for bias reduction. The proposed method does not require approximating the bias explicitly, nor does it rely on having independent identically distributed (i.i.d.) data. The method always removes the leading bias term, under very mild conditions, as long as auxiliary samples from distributions with given parameters are available. Expectation and variance of the bias-corrected estimate are given. Examples in sequential clinical trials (non-i.i.d. case), curved exponential models (i.i.d. case) and length-biased sampling (where the estimates are inconsistent) are used to illustrate the applications of the proposed method and its small sample properties.
Resumo:
A method is given for proving efficiency of NPMLE directly linked to empirical process theory. The conditions in general are appropriate consistency of the NPMLE, differentiability of the model, differentiability of the parameter of interest, local convexity of the parameter space, and a Donsker class condition for the class of efficient influence functions obtained by varying the parameters. For the case that the model is linear in the parameter and the parameter space is convex, as with most nonparametric missing data models, we show that the method leads to an identity for the NPMLE which almost says that the NPMLE is efficient and provides us straightforwardly with a consistency and efficiency proof. This identify is extended to an almost linear class of models which contain biased sampling models. To illustrate, the method is applied to the univariate censoring model, random truncation models, interval censoring case I model, the class of parametric models and to a class of semiparametric models.
Resumo:
Prevalent sampling is an efficient and focused approach to the study of the natural history of disease. Right-censored time-to-event data observed from prospective prevalent cohort studies are often subject to left-truncated sampling. Left-truncated samples are not randomly selected from the population of interest and have a selection bias. Extensive studies have focused on estimating the unbiased distribution given left-truncated samples. However, in many applications, the exact date of disease onset was not observed. For example, in an HIV infection study, the exact HIV infection time is not observable. However, it is known that the HIV infection date occurred between two observable dates. Meeting these challenges motivated our study. We propose parametric models to estimate the unbiased distribution of left-truncated, right-censored time-to-event data with uncertain onset times. We first consider data from a length-biased sampling, a specific case in left-truncated samplings. Then we extend the proposed method to general left-truncated sampling. With a parametric model, we construct the full likelihood, given a biased sample with unobservable onset of disease. The parameters are estimated through the maximization of the constructed likelihood by adjusting the selection bias and unobservable exact onset. Simulations are conducted to evaluate the finite sample performance of the proposed methods. We apply the proposed method to an HIV infection study, estimating the unbiased survival function and covariance coefficients. ^
Resumo:
A distribution of tumor size at detection is derived within the framework of a mechanistic model of carcinogenesis with the object of estimating biologically meaningful parameters of tumor latency. Its limiting form appears to be a generalization of the distribution that arises in the length-biased sampling from stationary point processes. The model renders the associated estimation problems tractable. The usefulness of the proposed approach is illustrated with an application to clinical data on premenopausal breast cancer.
Resumo:
The North American Breeding Bird Survey (BBS) is the principal source of data to inform researchers about the status of and trend for boreal forest birds. Unfortunately, little BBS coverage is available in the boreal forest, where increasing concern over the status of species breeding there has increased interest in northward expansion of the BBS. However, high disturbance rates in the boreal forest may complicate roadside monitoring. If the roadside sampling frame does not capture variation in disturbance rates because of either road placement or the use of roads for resource extraction, biased trend estimates might result. In this study, we examined roadside bias in the proportional representation of habitat disturbance via spatial data on forest “loss,” forest fires, and anthropogenic disturbance. In each of 455 BBS routes, the area disturbed within multiple buffers away from the road was calculated and compared against the area disturbed in degree blocks and BBS strata. We found a nonlinear relationship between bias and distance from the road, suggesting forest loss and forest fires were underrepresented below 75 and 100 m, respectively. In contrast, anthropogenic disturbance was overrepresented at distances below 500 m and underrepresented thereafter. After accounting for distance from road, BBS routes were reasonably representative of the degree blocks they were within, with only a few strata showing biased representation. In general, anthropogenic disturbance is overrepresented in southern strata, and forest fires are underrepresented in almost all strata. Similar biases exist when comparing the entire road network and the subset sampled by BBS routes against the amount of disturbance within BBS strata; however, the magnitude of biases differed. Based on our results, we recommend that spatial stratification and rotating panel designs be used to spread limited BBS and off-road sampling effort in an unbiased fashion and that new BBS routes be established where sufficient road coverage exists.
Resumo:
Historically a significant gap between male and female wages has existed in the Australian labour market. Indeed this wage differential was institutionalised in the 1912 arbitration decision which determined that the basic female wage would be set at between 54 and 66 per cent of the male wage. More recently however, the 1969 and 1972 Equal Pay Cases determined that male/female wage relativities should be based upon the premise of equal pay for work of equal value. It is important to note that the mere observation that average wages differ between males and females is not sine qua non evidence of sex discrimination. Economists restrict the definition of wage discrimination to cases where two distinct groups receive different average remuneration for reasons unrelated to differences in productivity characteristics. This paper extends previous studies of wage discrimination in Australia (Chapman and Mulvey, 1986; Haig, 1982) by correcting the estimated male/female wage differential for the existence of non-random sampling. Previous Australian estimates of male/female human capital basedwage specifications together with estimates of the corresponding wage differential all suffer from a failure to address this issue. If the sample of females observed to be working does not represent a random sample then the estimates of the male/female wage differential will be both biased and inconsistent.
Resumo:
Standard approaches for ellipse fitting are based on the minimization of algebraic or geometric distance between the given data and a template ellipse. When the data are noisy and come from a partial ellipse, the state-of-the-art methods tend to produce biased ellipses. We rely on the sampling structure of the underlying signal and show that the x- and y-coordinate functions of an ellipse are finite-rate-of-innovation (FRI) signals, and that their parameters are estimable from partial data. We consider both uniform and nonuniform sampling scenarios in the presence of noise and show that the data can be modeled as a sum of random amplitude-modulated complex exponentials. A low-pass filter is used to suppress noise and approximate the data as a sum of weighted complex exponentials. The annihilating filter used in FRI approaches is applied to estimate the sampling interval in the closed form. We perform experiments on simulated and real data, and assess both objective and subjective performances in comparison with the state-of-the-art ellipse fitting methods. The proposed method produces ellipses with lesser bias. Furthermore, the mean-squared error is lesser by about 2 to 10 dB. We show the applications of ellipse fitting in iris images starting from partial edge contours, and to free-hand ellipses drawn on a touch-screen tablet.
Resumo:
ENGLISH: A two-stage sampling design is used to estimate the variances of the numbers of yellowfin in different age groups caught in the eastern Pacific Ocean. For purse seiners, the primary sampling unit (n) is a brine well containing fish from a month-area stratum; the number of fish lengths (m) measured from each well are the secondary units. The fish cannot be selected at random from the wells because of practical limitations. The effects of different sampling methods and other factors on the reliability and precision of statistics derived from the length-frequency data were therefore examined. Modifications are recommended where necessary. Lengths of fish measured during the unloading of six test wells revealed two forms of inherent size stratification: 1) short-term disruptions of existing pattern of sizes, and 2) transition zones between long-term trends in sizes. To some degree, all wells exhibited cyclic changes in mean size and variance during unloading. In half of the wells, it was observed that size selection by the unloaders induced a change in mean size. As a result of stratification, the sequence of sizes removed from all wells was non-random, regardless of whether a well contained fish from a single set or from more than one set. The number of modal sizes in a well was not related to the number of sets. In an additional well composed of fish from several sets, an experiment on vertical mixing indicated that a representative sample of the contents may be restricted to the bottom half of the well. The contents of the test wells were used to generate 25 simulated wells and to compare the results of three sampling methods applied to them. The methods were: (1) random sampling (also used as a standard), (2) protracted sampling, in which the selection process was extended over a large portion of a well, and (3) measuring fish consecutively during removal from the well. Repeated sampling by each method and different combinations indicated that, because the principal source of size variation occurred among primary units, increasing n was the most effective way to reduce the variance estimates of both the age-group sizes and the total number of fish in the landings. Protracted sampling largely circumvented the effects of size stratification, and its performance was essentially comparable to that of random sampling. Sampling by this method is recommended. Consecutive-fish sampling produced more biased estimates with greater variances. Analysis of the 1988 length-frequency samples indicated that, for age groups that appear most frequently in the catch, a minimum sampling frequency of one primary unit in six for each month-area stratum would reduce the coefficients of variation (CV) of their size estimates to approximately 10 percent or less. Additional stratification of samples by set type, rather than month-area alone, further reduced the CV's of scarce age groups, such as the recruits, and potentially improved their accuracy. The CV's of recruitment estimates for completely-fished cohorts during the 198184 period were in the vicinity of 3 to 8 percent. Recruitment estimates and their variances were also relatively insensitive to changes in the individual quarterly catches and variances, respectively, of which they were composed. SPANISH: Se usa un diseño de muestreo de dos etapas para estimar las varianzas de los números de aletas amari11as en distintos grupos de edad capturados en el Océano Pacifico oriental. Para barcos cerqueros, la unidad primaria de muestreo (n) es una bodega de salmuera que contenía peces de un estrato de mes-área; el numero de ta11as de peces (m) medidas de cada bodega es la unidad secundaria. Limitaciones de carácter practico impiden la selección aleatoria de peces de las bodegas. Por 10 tanto, fueron examinados los efectos de distintos métodos de muestreo y otros factores sobre la confiabilidad y precisión de las estadísticas derivadas de los datos de frecuencia de ta11a. Se recomiendan modificaciones donde sean necesarias. Las ta11as de peces medidas durante la descarga de seis bodegas de prueba revelaron dos formas de estratificación inherente por ta11a: 1) perturbaciones a corto plazo en la pauta de ta11as existente, y 2) zonas de transición entre las tendencias a largo plazo en las ta11as. En cierto grado, todas las bodegas mostraron cambios cíclicos en ta11a media y varianza durante la descarga. En la mitad de las bodegas, se observo que selección por ta11a por los descargadores indujo un cambio en la ta11a media. Como resultado de la estratificación, la secuencia de ta11as sacadas de todas las bodegas no fue aleatoria, sin considerar si una bodega contenía peces de un solo lance 0 de mas de uno. El numero de ta11as modales en una bodega no estaba relacionado al numero de lances. En una bodega adicional compuesta de peces de varios lances, un experimento de mezcla vertical indico que una muestra representativa del contenido podría estar limitada a la mitad inferior de la bodega. Se uso el contenido de las bodegas de prueba para generar 25 bodegas simuladas y comparar los resultados de tres métodos de muestreo aplicados a estas. Los métodos fueron: (1) muestreo aleatorio (usado también como norma), (2) muestreo extendido, en el cual el proceso de selección fue extendido sobre una porción grande de una bodega, y (3) medición consecutiva de peces durante la descarga de la bodega. EI muestreo repetido con cada método y distintas combinaciones de n y m indico que, puesto que la fuente principal de variación de ta11a ocurría entre las unidades primarias, aumentar n fue la manera mas eficaz de reducir las estimaciones de la varianza de las ta11as de los grupos de edad y el numero total de peces en los desembarcos. El muestreo extendido evito mayormente los efectos de la estratificación por ta11a, y su desempeño fue esencialmente comparable a aquel del muestreo aleatorio. Se recomienda muestrear con este método. El muestreo de peces consecutivos produjo estimaciones mas sesgadas con mayores varianzas. Un análisis de las muestras de frecuencia de ta11a de 1988 indico que, para los grupos de edad que aparecen con mayor frecuencia en la captura, una frecuencia de muestreo minima de una unidad primaria de cada seis para cada estrato de mes-área reduciría los coeficientes de variación (CV) de las estimaciones de ta11a correspondientes a aproximadamente 10% 0 menos. Una estratificación adicional de las muestras por tipo de lance, y no solamente mes-área, redujo aun mas los CV de los grupos de edad escasos, tales como los reclutas, y mejoró potencialmente su precisión. Los CV de las estimaciones del reclutamiento para las cohortes completamente pescadas durante 1981-1984 fueron alrededor de 3-8%. Las estimaciones del reclutamiento y sus varianzas fueron también relativamente insensibles a cambios en las capturas de trimestres individuales y las varianzas, respectivamente, de las cuales fueron derivadas. (PDF contains 70 pages)
Resumo:
Light traps and channel nets are fixed-position devices that involve active and passive sampling, respectively, in the collection of settlement-stage larvae of coral-reef fishes. We compared the abundance, taxonomic composition, and size of such larvae caught by each device deployed simultaneously near two sites that differed substantially in current velocity. Light traps were more selective taxonomically, and the two sampling devices differed significantly in the abundance but not size of taxa caught. Most importantly, light traps and channel nets differed greatly in their catch efficiency between sites: light traps were ineffective in collecting larvae at the relatively high-current site, and channel nets were less efficient in collecting larvae at the low-current site. Use of only one of these sampling methods would clearly result in biased and inaccurate estimates of the spatial variation in larval abundance among locations that differ in current velocity. When selecting a larval sampling device, one must consider not only how well a particular taxon may be represented, but also the environmental conditions under which the device will be deployed.
Resumo:
Considerable attention has been focused on the properties of graphs derived from Internet measurements. Router-level topologies collected via traceroute studies have led some authors to conclude that the router graph of the Internet is a scale-free graph, or more generally a power-law random graph. In such a graph, the degree distribution of nodes follows a distribution with a power-law tail. In this paper we argue that the evidence to date for this conclusion is at best insufficient. We show that graphs appearing to have power-law degree distributions can arise surprisingly easily, when sampling graphs whose true degree distribution is not at all like a power-law. For example, given a classical Erdös-Rényi sparse, random graph, the subgraph formed by a collection of shortest paths from a small set of random sources to a larger set of random destinations can easily appear to show a degree distribution remarkably like a power-law. We explore the reasons for how this effect arises, and show that in such a setting, edges are sampled in a highly biased manner. This insight allows us to distinguish measurements taken from the Erdös-Rényi graphs from those taken from power-law random graphs. When we apply this distinction to a number of well-known datasets, we find that the evidence for sampling bias in these datasets is strong.