9 resultados para Missing values

em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Objective: The combination of twho anthropometric parameters has been more appropriate to assess body composition and proportions in children, with special attention to the Body Mass Index (BMI), as it relates weight and length. However the BMI values for the neonatal period have not been determined yet. This study shows the BMI for newborns at different gestational ages represented in a normal smoothed percentile curve. Methods: Retrospective study including 2,406 appropriate for gestational age newborns following the Alexander et al curve (1996) from 29 to 42 weeks of gestational age. Weight and lenght were measured following standard procedures. For the construction aof a normal smoothed percentile curve, the 3(rd) 5(th), 10(th), 25(th), 5(th), 75(th), 90(th) and 95(th) percentiles were determined and a statistical procedure based on the mathematical model ""sinosuoidal fit"" was applied to establish a curve that estimates biological growth parameters. Results: The Body Mass Index values for gestational age in all percentiles shows a steady increase up to 38 weeks, levels off up to the 40(th) week, followed by a slight decrease to the 42(nd) week in both genders. Conclusion: The results show a direct correlation between gestational age and Body Mass Index for both genders in the nine percentiles, and can provide a useful reference to assess intra-uterine proportional growth.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Purpose: Peak expiratory flow (PEF) was measured in healthy children aged five to ten years in order to provide baseline values and to determine correlations between PEF and factors such as gender, age and type of school. Methods: After the Ethical Committee of Research in Human of the School of Medicine of ABC - FMABC approval, PEF and height were measured in 1942 children between five and ten years old from nine public schools and nine private schools throughout Sao Bernardo do Campo City. PEF was measured using the Mini-Wright Peak Flow Meter (Clement Clarke International Ltd.) and. height was measured using a Sanny professional stadiometer. Results: Significant differences were found in values for PEF: higher values were seen in older students in comparison with younger students, in males in comparison with females and in students from private schools in comparison with public schools, with average values ranging from 206 L/min to 248 L/min,. Linear correlations were seen for PEF values with both height and age (Spearman Coefficient). Conclusions: Differences were seen for PEF between genders and between types of school, and a linear correlation was seen for PEF with both age and height in healthy children from five to ten years old.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data obtained during routine diagnosis of human T-cell lymphotropic virus type 1 (HTLV-1) and 2 (HTLV-2) in ""at-risk"" individuals from Sao Paulo, Brazil using signal-to-cutoff (S/C) values obtained by first, second, and third generation enzyme immunoassay (EIA) kits, were compared. The highest S/C values were obtained with third generation EIA kits, but no correlation was detected between these values and specific antibody reactivity to HTLV-1, HTLV-2, or untyped HTLV (p = 0.302). In addition, use of these third generation kits resulted in HTLV-1/2 false-positive samples. In contrast, first and second generation EIA kits showed high specificity, and the second generation EIA kits showed the highest efficiency, despite lower S/C values. Using first and second generation EIA kits, significant differences in specific antibody detection of HTLV-1, relative to HTLV-2 (p = 0.019 for first generation and p < 0.001 for second generation EIA kits) and relative to untyped HTLV (p = 0.025 for first generation EIA kits), were observed. These results were explained by the composition and format of the assays. In addition, using receiver operating characteristics (ROC) analysis, a slight adjustment in cutoff values for third generation EIA kits improved their specificities and should be used when HTLV ""at-risk"" populations from this geographic area are to be evaluated. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Support vector machines (SVMs) were originally formulated for the solution of binary classification problems. In multiclass problems, a decomposition approach is often employed, in which the multiclass problem is divided into multiple binary subproblems, whose results are combined. Generally, the performance of SVM classifiers is affected by the selection of values for their parameters. This paper investigates the use of genetic algorithms (GAs) to tune the parameters of the binary SVMs in common multiclass decompositions. The developed GA may search for a set of parameter values common to all binary classifiers or for differentiated values for each binary classifier. (C) 2008 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We discuss the applicability, within the random matrix theory, of perturbative treatment of symmetry breaking to the experimental data on the flip symmetry breaking in quartz crystal. We found that the values of the parameter that measures this breaking are different for the spacing distribution as compared to those for the spectral rigidity. We consider both two-fold and three-fold symmetries. The latter was found to account better for the spectral rigidity than the former. Both cases, however, underestimate the experimental spectral rigidity at large L. This discrepancy can be resolved if an appropriate number of eigenfrequencies is considered to be missing in the sample. Our findings are relevant for symmetry violation studies in general. (C) 2008 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Predictors of random effects are usually based on the popular mixed effects (ME) model developed under the assumption that the sample is obtained from a conceptual infinite population; such predictors are employed even when the actual population is finite. Two alternatives that incorporate the finite nature of the population are obtained from the superpopulation model proposed by Scott and Smith (1969. Estimation in multi-stage surveys. J. Amer. Statist. Assoc. 64, 830-840) or from the finite population mixed model recently proposed by Stanek and Singer (2004. Predicting random effects from finite population clustered samples with response error. J. Amer. Statist. Assoc. 99, 1119-1130). Predictors derived under the latter model with the additional assumptions that all variance components are known and that within-cluster variances are equal have smaller mean squared error (MSE) than the competitors based on either the ME or Scott and Smith`s models. As population variances are rarely known, we propose method of moment estimators to obtain empirical predictors and conduct a simulation study to evaluate their performance. The results suggest that the finite population mixed model empirical predictor is more stable than its competitors since, in terms of MSE, it is either the best or the second best and when second best, its performance lies within acceptable limits. When both cluster and unit intra-class correlation coefficients are very high (e.g., 0.95 or more), the performance of the empirical predictors derived under the three models is similar. (c) 2007 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.