879 resultados para Predictive regression
Resumo:
Credit scoring modelling comprises one of the leading formal tools for supporting the granting of credit. Its core objective consists of the generation of a score by means of which potential clients can be listed in the order of the probability of default. A critical factor is whether a credit scoring model is accurate enough in order to provide correct classification of the client as a good or bad payer. In this context the concept of bootstraping aggregating (bagging) arises. The basic idea is to generate multiple classifiers by obtaining the predicted values from the fitted models to several replicated datasets and then combining them into a single predictive classification in order to improve the classification accuracy. In this paper we propose a new bagging-type variant procedure, which we call poly-bagging, consisting of combining predictors over a succession of resamplings. The study is derived by credit scoring modelling. The proposed poly-bagging procedure was applied to some different artificial datasets and to a real granting of credit dataset up to three successions of resamplings. We observed better classification accuracy for the two-bagged and the three-bagged models for all considered setups. These results lead to a strong indication that the poly-bagging approach may promote improvement on the modelling performance measures, while keeping a flexible and straightforward bagging-type structure easy to implement. (C) 2011 Elsevier Ltd. All rights reserved.
Resumo:
In this paper, the generalized log-gamma regression model is modified to allow the possibility that long-term survivors may be present in the data. This modification leads to a generalized log-gamma regression model with a cure rate, encompassing, as special cases, the log-exponential, log-Weibull and log-normal regression models with a cure rate typically used to model such data. The models attempt to simultaneously estimate the effects of explanatory variables on the timing acceleration/deceleration of a given event and the surviving fraction, that is, the proportion of the population for which the event never occurs. The normal curvatures of local influence are derived under some usual perturbation schemes and two martingale-type residuals are proposed to assess departures from the generalized log-gamma error assumption as well as to detect outlying observations. Finally, a data set from the medical area is analyzed.
Resumo:
In survival analysis applications, the failure rate function may frequently present a unimodal shape. In such case, the log-normal or log-logistic distributions are used. In this paper, we shall be concerned only with parametric forms, so a location-scale regression model based on the Burr XII distribution is proposed for modeling data with a unimodal failure rate function as an alternative to the log-logistic regression model. Assuming censored data, we consider a classic analysis, a Bayesian analysis and a jackknife estimator for the parameters of the proposed model. For different parameter settings, sample sizes and censoring percentages, various simulation studies are performed and compared to the performance of the log-logistic and log-Burr XII regression models. Besides, we use sensitivity analysis to detect influential or outlying observations, and residual analysis is used to check the assumptions in the model. Finally, we analyze a real data set under log-Buff XII regression models. (C) 2008 Published by Elsevier B.V.
Resumo:
Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications. (C) 2010 Elsevier Inc. All rights reserved.
Resumo:
The study of pharmacokinetic properties (PK) is of great importance in drug discovery and development. In the present work, PK/DB (a new freely available database for PK) was designed with the aim of creating robust databases for pharmacokinetic studies and in silico absorption, distribution, metabolism and excretion (ADME) prediction. Comprehensive, web-based and easy to access, PK/DB manages 1203 compounds which represent 2973 pharmacokinetic measurements, including five models for in silico ADME prediction (human intestinal absorption, human oral bioavailability, plasma protein binding, bloodbrain barrier and water solubility).
Resumo:
Canalizing genes possess such broad regulatory power, and their action sweeps across a such a wide swath of processes that the full set of affected genes are not highly correlated under normal conditions. When not active, the controlling gene will not be predictable to any significant degree by its subject genes, either alone or in groups, since their behavior will be highly varied relative to the inactive controlling gene. When the controlling gene is active, its behavior is not well predicted by any one of its targets, but can be very well predicted by groups of genes under its control. To investigate this question, we introduce in this paper the concept of intrinsically multivariate predictive (IMP) genes, and present a mathematical study of IMP in the context of binary genes with respect to the coefficient of determination (CoD), which measures the predictive power of a set of genes with respect to a target gene. A set of predictor genes is said to be IMP for a target gene if all properly contained subsets of the predictor set are bad predictors of the target but the full predictor set predicts the target with great accuracy. We show that logic of prediction, predictive power, covariance between predictors, and the entropy of the joint probability distribution of the predictors jointly affect the appearance of IMP genes. In particular, we show that high-predictive power, small covariance among predictors, a large entropy of the joint probability distribution of predictors, and certain logics, such as XOR in the 2-predictor case, are factors that favor the appearance of IMP. The IMP concept is applied to characterize the behavior of the gene DUSP1, which exhibits control over a central, process-integrating signaling pathway, thereby providing preliminary evidence that IMP can be used as a criterion for discovery of canalizing genes.
Resumo:
In this article, we compare three residuals based on the deviance component in generalised log-gamma regression models with censored observations. For different parameter settings, sample sizes and censoring percentages, various simulation studies are performed and the empirical distribution of each residual is displayed and compared with the standard normal distribution. For all cases studied, the empirical distributions of the proposed residuals are in general symmetric around zero, but only a martingale-type residual presented negligible kurtosis for the majority of the cases studied. These studies suggest that the residual analysis usually performed in normal linear regression models can be straightforwardly extended for the martingale-type residual in generalised log-gamma regression models with censored data. A lifetime data set is analysed under log-gamma regression models and a model checking based on the martingale-type residual is performed.
Resumo:
We obtain adjustments to the profile likelihood function in Weibull regression models with and without censoring. Specifically, we consider two different modified profile likelihoods: (i) the one proposed by Cox and Reid [Cox, D.R. and Reid, N., 1987, Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society B, 49, 1-39.], and (ii) an approximation to the one proposed by Barndorff-Nielsen [Barndorff-Nielsen, O.E., 1983, On a formula for the distribution of the maximum likelihood estimator. Biometrika, 70, 343-365.], the approximation having been obtained using the results by Fraser and Reid [Fraser, D.A.S. and Reid, N., 1995, Ancillaries and third-order significance. Utilitas Mathematica, 47, 33-53.] and by Fraser et al. [Fraser, D.A.S., Reid, N. and Wu, J., 1999, A simple formula for tail probabilities for frequentist and Bayesian inference. Biometrika, 86, 655-661.]. We focus on point estimation and likelihood ratio tests on the shape parameter in the class of Weibull regression models. We derive some distributional properties of the different maximum likelihood estimators and likelihood ratio tests. The numerical evidence presented in the paper favors the approximation to Barndorff-Nielsen`s adjustment.
Resumo:
Most studies involving statistical time series analysis rely on assumptions of linearity, which by its simplicity facilitates parameter interpretation and estimation. However, the linearity assumption may be too restrictive for many practical applications. The implementation of nonlinear models in time series analysis involves the estimation of a large set of parameters, frequently leading to overfitting problems. In this article, a predictability coefficient is estimated using a combination of nonlinear autoregressive models and the use of support vector regression in this model is explored. We illustrate the usefulness and interpretability of results by using electroencephalographic records of an epileptic patient.
Resumo:
We propose two new residuals for the class of beta regression models, and numerically evaluate their behaviour relative to the residuals proposed by Ferrari and Cribari-Neto. Monte Carlo simulation results and empirical applications using real and simulated data are provided. The results favour one of the residuals we propose.
Resumo:
The class of symmetric linear regression models has the normal linear regression model as a special case and includes several models that assume that the errors follow a symmetric distribution with longer-than-normal tails. An important member of this class is the t linear regression model, which is commonly used as an alternative to the usual normal regression model when the data contain extreme or outlying observations. In this article, we develop second-order asymptotic theory for score tests in this class of models. We obtain Bartlett-corrected score statistics for testing hypotheses on the regression and the dispersion parameters. The corrected statistics have chi-squared distributions with errors of order O(n(-3/2)), n being the sample size. The corrections represent an improvement over the corresponding original Rao`s score statistics, which are chi-squared distributed up to errors of order O(n(-1)). Simulation results show that the corrected score tests perform much better than their uncorrected counterparts in samples of small or moderate size.
Resumo:
The Birnbaum-Saunders distribution has been used quite effectively to model times to failure for materials subject to fatigue and for modeling lifetime data. In this paper we obtain asymptotic expansions, up to order n(-1/2) and under a sequence of Pitman alternatives, for the non-null distribution functions of the likelihood ratio, Wald, score and gradient test statistics in the Birnbaum-Saunders regression model. The asymptotic distributions of all four statistics are obtained for testing a subset of regression parameters and for testing the shape parameter. Monte Carlo simulation is presented in order to compare the finite-sample performance of these tests. We also present two empirical applications. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The Birnbaum-Saunders regression model is becoming increasingly popular in lifetime analyses and reliability studies. In this model, the signed likelihood ratio statistic provides the basis for testing inference and construction of confidence limits for a single parameter of interest. We focus on the small sample case, where the standard normal distribution gives a poor approximation to the true distribution of the statistic. We derive three adjusted signed likelihood ratio statistics that lead to very accurate inference even for very small samples. Two empirical applications are presented. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.
Resumo:
We consider the issue of performing accurate small-sample likelihood-based inference in beta regression models, which are useful for modelling continuous proportions that are affected by independent variables. We derive small-sample adjustments to the likelihood ratio statistic in this class of models. The adjusted statistics can be easily implemented from standard statistical software. We present Monte Carlo simulations showing that inference based on the adjusted statistics we propose is much more reliable than that based on the usual likelihood ratio statistic. A real data example is presented.