920 resultados para Random imputation


Relevância:

20.00% 20.00%

Publicador:

Resumo:

BACKGROUND A cost-effective strategy to increase the density of available markers within a population is to sequence a small proportion of the population and impute whole-genome sequence data for the remaining population. Increased densities of typed markers are advantageous for genome-wide association studies (GWAS) and genomic predictions. METHODS We obtained genotypes for 54 602 SNPs (single nucleotide polymorphisms) in 1077 Franches-Montagnes (FM) horses and Illumina paired-end whole-genome sequencing data for 30 FM horses and 14 Warmblood horses. After variant calling, the sequence-derived SNP genotypes (~13 million SNPs) were used for genotype imputation with the software programs Beagle, Impute2 and FImpute. RESULTS The mean imputation accuracy of FM horses using Impute2 was 92.0%. Imputation accuracy using Beagle and FImpute was 74.3% and 77.2%, respectively. In addition, for Impute2 we determined the imputation accuracy of all individual horses in the validation population, which ranged from 85.7% to 99.8%. The subsequent inclusion of Warmblood sequence data further increased the correlation between true and imputed genotypes for most horses, especially for horses with a high level of admixture. The final imputation accuracy of the horses ranged from 91.2% to 99.5%. CONCLUSIONS Using Impute2, the imputation accuracy was higher than 91% for all horses in the validation population, which indicates that direct imputation of 50k SNP-chip data to sequence level genotypes is feasible in the FM population. The individual imputation accuracy depended mainly on the applied software and the level of admixture.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We study existence of random elements with partially specified distributions. The technique relies on the existence of a positive ex-tension for linear functionals accompanied by additional conditions that ensure the regularity of the extension needed for interpreting it as a probability measure. It is shown in which case the extens ion can be chosen to possess some invariance properties. The results are applied to the existence of point processes with given correlation measure and random closed sets with given two-point covering function or contact distribution function. It is shown that the regularity condition can be efficiently checked in many cases in order to ensure that the obtained point processes are indeed locally finite and random sets have closed realisations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We study pathwise invariances and degeneracies of random fields with motivating applications in Gaussian process modelling. The key idea is that a number of structural properties one may wish to impose a priori on functions boil down to degeneracy properties under well-chosen linear operators. We first show in a second order set-up that almost sure degeneracy of random field paths under some class of linear operators defined in terms of signed measures can be controlled through the two first moments. A special focus is then put on the Gaussian case, where these results are revisited and extended to further linear operators thanks to state-of-the-art representations. Several degeneracy properties are tackled, including random fields with symmetric paths, centred paths, harmonic paths, or sparse paths. The proposed approach delivers a number of promising results and perspectives in Gaussian process modelling. In a first numerical experiment, it is shown that dedicated kernels can be used to infer an axis of symmetry. Our second numerical experiment deals with conditional simulations of a solution to the heat equation, and it is found that adapted kernels notably enable improved predictions of non-linear functionals of the field such as its maximum.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Theory on plant succession predicts a temporal increase in the complexity of spatial community structure and of competitive interactions: initially random occurrences of early colonising species shift towards spatially and competitively structured plant associations in later successional stages. Here we use long-term data on early plant succession in a German post mining area to disentangle the importance of random colonisation, habitat filtering, and competition on the temporal and spatial development of plant community structure. We used species co-occurrence analysis and a recently developed method for assessing competitive strength and hierarchies (transitive versus intransitive competitive orders) in multispecies communities. We found that species turnover decreased through time within interaction neighbourhoods, but increased through time outside interaction neighbourhoods. Successional change did not lead to modular community structure. After accounting for species richness effects, the strength of competitive interactions and the proportion of transitive competitive hierarchies increased through time. Although effects of habitat filtering were weak, random colonization and subsequent competitive interactions had strong effects on community structure. Because competitive strength and transitivity were poorly correlated with soil characteristics, there was little evidence for context dependent competitive strength associated with intransitive competitive hierarchies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The purpose of this study is to investigate the effects of predictor variable correlations and patterns of missingness with dichotomous and/or continuous data in small samples when missing data is multiply imputed. Missing data of predictor variables is multiply imputed under three different multivariate models: the multivariate normal model for continuous data, the multinomial model for dichotomous data and the general location model for mixed dichotomous and continuous data. Subsequent to the multiple imputation process, Type I error rates of the regression coefficients obtained with logistic regression analysis are estimated under various conditions of correlation structure, sample size, type of data and patterns of missing data. The distributional properties of average mean, variance and correlations among the predictor variables are assessed after the multiple imputation process. ^ For continuous predictor data under the multivariate normal model, Type I error rates are generally within the nominal values with samples of size n = 100. Smaller samples of size n = 50 resulted in more conservative estimates (i.e., lower than the nominal value). Correlation and variance estimates of the original data are retained after multiple imputation with less than 50% missing continuous predictor data. For dichotomous predictor data under the multinomial model, Type I error rates are generally conservative, which in part is due to the sparseness of the data. The correlation structure for the predictor variables is not well retained on multiply-imputed data from small samples with more than 50% missing data with this model. For mixed continuous and dichotomous predictor data, the results are similar to those found under the multivariate normal model for continuous data and under the multinomial model for dichotomous data. With all data types, a fully-observed variable included with variables subject to missingness in the multiple imputation process and subsequent statistical analysis provided liberal (larger than nominal values) Type I error rates under a specific pattern of missing data. It is suggested that future studies focus on the effects of multiple imputation in multivariate settings with more realistic data characteristics and a variety of multivariate analyses, assessing both Type I error and power. ^

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper examines how preference correlation and intercorrelation combine to influence the length of a decentralized matching market's path to stability. In simulated experiments, marriage markets with various preference specifications begin at an arbitrary matching of couples and proceed toward stability via the random mechanism proposed by Roth and Vande Vate (1990). The results of these experiments reveal that fundamental preference characteristics are critical in predicting how long the market will take to reach a stable matching. In particular, intercorrelation and correlation are shown to have an exponential impact on the number of blocking pairs that must be randomly satisfied before stability is attained. The magnitude of the impact is dramatically different, however, depending on whether preferences are positively or negatively intercorrelated.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present a framework for fitting multiple random walks to animal movement paths consisting of ordered sets of step lengths and turning angles. Each step and turn is assigned to one of a number of random walks, each characteristic of a different behavioral state. Behavioral state assignments may be inferred purely from movement data or may include the habitat type in which the animals are located. Switching between different behavioral states may be modeled explicitly using a state transition matrix estimated directly from data, or switching probabilities may take into account the proximity of animals to landscape features. Model fitting is undertaken within a Bayesian framework using the WinBUGS software. These methods allow for identification of different movement states using several properties of observed paths and lead naturally to the formulation of movement models. Analysis of relocation data from elk released in east-central Ontario, Canada, suggests a biphasic movement behavior: elk are either in an "encamped" state in which step lengths are small and turning angles are high, or in an "exploratory" state, in which daily step lengths are several kilometers and turning angles are small. Animals encamp in open habitat (agricultural fields and opened forest), but the exploratory state is not associated with any particular habitat type.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Objective. To determine the accuracy of the urine protein:creatinine ratio (pr:cr) in predicting 300 mg of protein in 24-hour urine collection in pregnant patients with suspected preeclampsia. ^ Methods. A systematic review was performed. Articles were identified through electronic databases and the relevant citations were hand searching of textbooks and review articles. Included studies evaluated patients for suspected preeclampsia with a 24-hour urine sample and a pr:cr. Only English language articles were included. The studies that had patients with chronic illness such as chronic hypertension, diabetes mellitus or renal impairment were excluded from the review. Two researchers extracted accuracy data for pr:cr relative to a gold standard of 300 mg of protein in 24-hour sample as well as population and study characteristics. The data was analyzed and summarized in tabular and graphical form. ^ Results. Sixteen studies were identified and only three studies met our inclusion criteria with 510 total patients. The studies evaluated different cut-points for positivity of pr:cr from 130 mg/g to 700 mg/g. Sensitivities and specificities for pr:cr of 130mg/g -150 mg/g were 90-93% and 33-65%, respectively; for a pr:cr of 300 mg/g were 81-95% and 52-80%, respectively; for a pr:cr of 600-700mg/g were 85-87% and 96-97%, respectively. ^ Conclusion. The value of a random pr:cr to exclude pre-eclampsia is limited because even low levels of pr:cr (130-150 mg/g) may miss up to 10% of patients with significant proteinuria. A pr:cr of more than 600 mg/g may obviate a 24-hour collection.^

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Gastroesophageal reflux disease is a common condition affecting 25 to 40% of the population and causes significant morbidity in the U.S., accounting for at least 9 million office visits to physicians with estimated annual costs of $10 billion. Previous research has not clearly established whether infection with Helicobacter pylori, a known cause of peptic ulcer, atrophic gastritis and non cardia adenocarcinoma of the stomach, is associated with gastroesophageal reflux disease. This study is a secondary analysis of data collected in a cross-sectional study of a random sample of adult residents of Ciudad Juarez, Mexico, that was conducted in 2004 (Prevalence and Determinants of Chronic Atrophic Gastritis Study or CAG study, Dr. Victor M. Cardenas, Principal Investigator). In this study, the presence of gastroesophageal reflux disease was based on responses to the previously validated Spanish Language Dyspepsia Questionnaire. Responses to this questionnaire indicating the presence of gastroesophageal reflux symptoms and disease were compared with the presence of H. pylori infection as measured by culture, histology and rapid urease test, and with findings of upper endoscopy (i.e., hiatus hernia and erosive and atrophic esophagitis). The prevalence ratio was calculated using bivariate, stratified and multivariate negative binomial logistic regression analyses in order to assess the relation between active H. pylori infection and the prevalence of gastroesophageal reflux typical syndrome and disease, while controlling for known risk factors of gastroesophageal reflux disease such as obesity. In a random sample of 174 adults 48 (27.6%) of the study participants had typical reflux syndrome and only 5% (or 9/174) had gastroesophageal reflux disease per se according to the Montreal consensus, which defines reflux syndromes and disease based on whether the symptoms are perceived as troublesome by the subject. There was no association between H. pylori infection and typical reflux syndrome or gastroesophageal reflux disease. However, we found that in this Northern Mexican population, there was a moderate association (Prevalence Ratio=2.5; 95% CI=1.3, 4.7) between obesity (≥30 kg/m2) and typical reflux syndrome. Management and prevention of obesity will significantly curb the growing numbers of persons affected by gastroesophageal reflux symptoms and disease in Northern Mexico. ^

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Maximizing data quality may be especially difficult in trauma-related clinical research. Strategies are needed to improve data quality and assess the impact of data quality on clinical predictive models. This study had two objectives. The first was to compare missing data between two multi-center trauma transfusion studies: a retrospective study (RS) using medical chart data with minimal data quality review and the PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study with standardized quality assurance. The second objective was to assess the impact of missing data on clinical prediction algorithms by evaluating blood transfusion prediction models using PROMMTT data. RS (2005-06) and PROMMTT (2009-10) investigated trauma patients receiving ≥ 1 unit of red blood cells (RBC) from ten Level I trauma centers. Missing data were compared for 33 variables collected in both studies using mixed effects logistic regression (including random intercepts for study site). Massive transfusion (MT) patients received ≥ 10 RBC units within 24h of admission. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation based on the multivariate normal distribution. A sensitivity analysis for missing data was conducted to estimate the upper and lower bounds of correct classification using assumptions about missing data under best and worst case scenarios. Most variables (17/33=52%) had <1% missing data in RS and PROMMTT. Of the remaining variables, 50% demonstrated less missingness in PROMMTT, 25% had less missingness in RS, and 25% were similar between studies. Missing percentages for MT prediction variables in PROMMTT ranged from 2.2% (heart rate) to 45% (respiratory rate). For variables missing >1%, study site was associated with missingness (all p≤0.021). Survival time predicted missingness for 50% of RS and 60% of PROMMTT variables. MT models complete case proportions ranged from 41% to 88%. Complete case analysis and multiple imputation demonstrated similar correct classification results. Sensitivity analysis upper-lower bound ranges for the three MT models were 59-63%, 36-46%, and 46-58%. Prospective collection of ten-fold more variables with data quality assurance reduced overall missing data. Study site and patient survival were associated with missingness, suggesting that data were not missing completely at random, and complete case analysis may lead to biased results. Evaluating clinical prediction model accuracy may be misleading in the presence of missing data, especially with many predictor variables. The proposed sensitivity analysis estimating correct classification under upper (best case scenario)/lower (worst case scenario) bounds may be more informative than multiple imputation, which provided results similar to complete case analysis.^