100 resultados para Predictive mean matching imputation
Resumo:
The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.
Resumo:
Radial transport in the tokamap, which has been proposed as a simple model for the motion in a stochastic plasma, is investigated. A theory for previous numerical findings is presented. The new results are stimulated by the fact that the radial diffusion coefficients is space-dependent. The space-dependence of the transport coefficient has several interesting effects which have not been elucidated so far. Among the new findings are the analytical predictions for the scaling of the mean radial displacement with time and the relation between the Fokker-Planck diffusion coefficient and the diffusion coefficient from the mean square displacement. The applicability to other systems is also discussed. (c) 2009 WILEY-VCH GmbH & Co. KGaA, Weinheim
Resumo:
The study of pharmacokinetic properties (PK) is of great importance in drug discovery and development. In the present work, PK/DB (a new freely available database for PK) was designed with the aim of creating robust databases for pharmacokinetic studies and in silico absorption, distribution, metabolism and excretion (ADME) prediction. Comprehensive, web-based and easy to access, PK/DB manages 1203 compounds which represent 2973 pharmacokinetic measurements, including five models for in silico ADME prediction (human intestinal absorption, human oral bioavailability, plasma protein binding, bloodbrain barrier and water solubility).
Resumo:
2D electrophoresis is a well-known method for protein separation which is extremely useful in the field of proteomics. Each spot in the image represents a protein accumulation and the goal is to perform a differential analysis between pairs of images to study changes in protein content. It is thus necessary to register two images by finding spot correspondences. Although it may seem a simple task, generally, the manual processing of this kind of images is very cumbersome, especially when strong variations between corresponding sets of spots are expected (e.g. strong non-linear deformations and outliers). In order to solve this problem, this paper proposes a new quadratic assignment formulation together with a correspondence estimation algorithm based on graph matching which takes into account the structural information between the detected spots. Each image is represented by a graph and the task is to find a maximum common subgraph. Successful experimental results using real data are presented, including an extensive comparative performance evaluation with ground-truth data. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Canalizing genes possess such broad regulatory power, and their action sweeps across a such a wide swath of processes that the full set of affected genes are not highly correlated under normal conditions. When not active, the controlling gene will not be predictable to any significant degree by its subject genes, either alone or in groups, since their behavior will be highly varied relative to the inactive controlling gene. When the controlling gene is active, its behavior is not well predicted by any one of its targets, but can be very well predicted by groups of genes under its control. To investigate this question, we introduce in this paper the concept of intrinsically multivariate predictive (IMP) genes, and present a mathematical study of IMP in the context of binary genes with respect to the coefficient of determination (CoD), which measures the predictive power of a set of genes with respect to a target gene. A set of predictor genes is said to be IMP for a target gene if all properly contained subsets of the predictor set are bad predictors of the target but the full predictor set predicts the target with great accuracy. We show that logic of prediction, predictive power, covariance between predictors, and the entropy of the joint probability distribution of the predictors jointly affect the appearance of IMP genes. In particular, we show that high-predictive power, small covariance among predictors, a large entropy of the joint probability distribution of predictors, and certain logics, such as XOR in the 2-predictor case, are factors that favor the appearance of IMP. The IMP concept is applied to characterize the behavior of the gene DUSP1, which exhibits control over a central, process-integrating signaling pathway, thereby providing preliminary evidence that IMP can be used as a criterion for discovery of canalizing genes.
Resumo:
In this paper, a simple relation between the Leimkuhler curve and the mean residual life is established. The result is illustrated with several models commonly used in informetrics, such as exponential, Pareto and lognormal. Finally, relationships with some other reliability concepts are also presented. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
We discuss the estimation of the expected value of the quality-adjusted survival, based on multistate models. We generalize an earlier work, considering the sojourn times in health states are not identically distributed, for a given vector of covariates. Approaches based on semiparametric and parametric (exponential and Weibull distributions) methodologies are considered. A simulation study is conducted to evaluate the performance of the proposed estimator and the jackknife resampling method is used to estimate the variance of such estimator. An application to a real data set is also included.
Resumo:
Regression models for the mean quality-adjusted survival time are specified from hazard functions of transitions between two states and the mean quality-adjusted survival time may be a complex function of covariates. We discuss a regression model for the mean quality-adjusted survival (QAS) time based on pseudo-observations, which has the advantage of directly modeling the effect of covariates in the QAS time. Both Monte Carlo Simulations and a real data set are studied. Copyright (C) 2009 John Wiley & Sons, Ltd.
Resumo:
In clinical trials, it may be of interest taking into account physical and emotional well-being in addition to survival when comparing treatments. Quality-adjusted survival time has the advantage of incorporating information about both survival time and quality-of-life. In this paper, we discuss the estimation of the expected value of the quality-adjusted survival, based on multistate models for the sojourn times in health states. Semiparametric and parametric (with exponential distribution) approaches are considered. A simulation study is presented to evaluate the performance of the proposed estimator and the jackknife resampling method is used to compute bias and variance of the estimator. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
Objective: To identify social, demographic and clinical characteristics that influence survival of patients with systemic lupus erythematosus (SLE). Methods: Sixty-three patients with a diagnosis of SLE were studied at our medical services in 1999 and then reviewed in 2005. We utilized a protocol to obtain demographic and clinical traits, activity and damage indices, and health-related quality of life via the SF-36. All statistical tests were performed using a significance level of 5%. Results: Out of the 63 patients examined in 1999, six died, four were lost for the follow-up and the previous protocol was applied to the remaining 53 patients. The six patients who died presented the worst recorded health-related quality of fife, in all aspects. The most important observed predictor of death was a mean lower score in the Role-Emotional Domain of the mental health component of the SF-36 (p<0.01). Conclusion: Health-related quality of life may be used as possible predictive factor of mortality among patients with SLE.