923 resultados para model selection in binary regression


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The objective of this paper is to model variations in test-day milk yields of first lactations of Holstein cows by RR using B-spline functions and Bayesian inference in order to fit adequate and parsimonious models for the estimation of genetic parameters. They used 152,145 test day milk yield records from 7317 first lactations of Holstein cows. The model established in this study was additive, permanent environmental and residual random effects. In addition, contemporary group and linear and quadratic effects of the age of cow at calving were included as fixed effects. Authors modeled the average lactation curve of the population with a fourth-order orthogonal Legendre polynomial. They concluded that a cubic B-spline with seven random regression coefficients for both the additive genetic and permanent environment effects was to be the best according to residual mean square and residual variance estimates. Moreover they urged a lower order model (quadratic B-spline with seven random regression coefficients for both random effects) could be adopted because it yielded practically the same genetic parameter estimates with parsimony. (C) 2012 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most studies on selection in plants estimate female fitness components and neglect male mating success, although the latter might also be fundamental to understand adaptive evolution. Information from molecular genetic markers can be used to assess determinants of male mating success through parentage analyses. We estimated paternal selection gradients on floral traits in a large natural population of the herb Mimulus guttatus using a paternity probability model and maximum likelihood methods. This analysis revealed more significant selection gradients than a previous analysis based on regression of estimated male fertilities on floral traits. There were differences between results of univariate and multivariate analyses most likely due to the underlying covariance structure of the traits. Multivariate analysis, which corrects for the covariance structure of the traits, indicated that male mating success declined with distance from and depended on the direction to the mother plants. Moreover, there was directional selection for plants with fewer open flowers which have smaller corollas, a smaller anther-stigma separation, more red dots on the corolla and a larger fluctuating asymmetry therein. For most of these traits, however, there was also stabilizing selection indicating that there are intermediate optima for these traits. The large number of significant selection gradients in this study shows that even in relatively large natural populations where not all males can be sampled, it is possible to detect significant paternal selection gradients, and that such studies can give us valuable information required to better understand adaptive plant evolution.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVES: This paper is concerned with checking goodness-of-fit of binary logistic regression models. For the practitioners of data analysis, the broad classes of procedures for checking goodness-of-fit available in the literature are described. The challenges of model checking in the context of binary logistic regression are reviewed. As a viable solution, a simple graphical procedure for checking goodness-of-fit is proposed. METHODS: The graphical procedure proposed relies on pieces of information available from any logistic analysis; the focus is on combining and presenting these in an informative way. RESULTS: The information gained using this approach is presented with three examples. In the discussion, the proposed method is put into context and compared with other graphical procedures for checking goodness-of-fit of binary logistic models available in the literature. CONCLUSION: A simple graphical method can significantly improve the understanding of any logistic regression analysis and help to prevent faulty conclusions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Strategies are compared for the development of a linear regression model with stochastic (multivariate normal) regressor variables and the subsequent assessment of its predictive ability. Bias and mean squared error of four estimators of predictive performance are evaluated in simulated samples of 32 population correlation matrices. Models including all of the available predictors are compared with those obtained using selected subsets. The subset selection procedures investigated include two stopping rules, C$\sb{\rm p}$ and S$\sb{\rm p}$, each combined with an 'all possible subsets' or 'forward selection' of variables. The estimators of performance utilized include parametric (MSEP$\sb{\rm m}$) and non-parametric (PRESS) assessments in the entire sample, and two data splitting estimates restricted to a random or balanced (Snee's DUPLEX) 'validation' half sample. The simulations were performed as a designed experiment, with population correlation matrices representing a broad range of data structures.^ The techniques examined for subset selection do not generally result in improved predictions relative to the full model. Approaches using 'forward selection' result in slightly smaller prediction errors and less biased estimators of predictive accuracy than 'all possible subsets' approaches but no differences are detected between the performances of C$\sb{\rm p}$ and S$\sb{\rm p}$. In every case, prediction errors of models obtained by subset selection in either of the half splits exceed those obtained using all predictors and the entire sample.^ Only the random split estimator is conditionally (on $\\beta$) unbiased, however MSEP$\sb{\rm m}$ is unbiased on average and PRESS is nearly so in unselected (fixed form) models. When subset selection techniques are used, MSEP$\sb{\rm m}$ and PRESS always underestimate prediction errors, by as much as 27 percent (on average) in small samples. Despite their bias, the mean squared errors (MSE) of these estimators are at least 30 percent less than that of the unbiased random split estimator. The DUPLEX split estimator suffers from large MSE as well as bias, and seems of little value within the context of stochastic regressor variables.^ To maximize predictive accuracy while retaining a reliable estimate of that accuracy, it is recommended that the entire sample be used for model development, and a leave-one-out statistic (e.g. PRESS) be used for assessment. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The problem of analyzing data with updated measurements in the time-dependent proportional hazards model arises frequently in practice. One available option is to reduce the number of intervals (or updated measurements) to be included in the Cox regression model. We empirically investigated the bias of the estimator of the time-dependent covariate while varying the effect of failure rate, sample size, true values of the parameters and the number of intervals. We also evaluated how often a time-dependent covariate needs to be collected and assessed the effect of sample size and failure rate on the power of testing a time-dependent effect.^ A time-dependent proportional hazards model with two binary covariates was considered. The time axis was partitioned into k intervals. The baseline hazard was assumed to be 1 so that the failure times were exponentially distributed in the ith interval. A type II censoring model was adopted to characterize the failure rate. The factors of interest were sample size (500, 1000), type II censoring with failure rates of 0.05, 0.10, and 0.20, and three values for each of the non-time-dependent and time-dependent covariates (1/4,1/2,3/4).^ The mean of the bias of the estimator of the coefficient of the time-dependent covariate decreased as sample size and number of intervals increased whereas the mean of the bias increased as failure rate and true values of the covariates increased. The mean of the bias of the estimator of the coefficient was smallest when all of the updated measurements were used in the model compared with two models that used selected measurements of the time-dependent covariate. For the model that included all the measurements, the coverage rates of the estimator of the coefficient of the time-dependent covariate was in most cases 90% or more except when the failure rate was high (0.20). The power associated with testing a time-dependent effect was highest when all of the measurements of the time-dependent covariate were used. An example from the Systolic Hypertension in the Elderly Program Cooperative Research Group is presented. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Species selection for forest restoration is often supported by expert knowledge on local distribution patterns of native tree species. This approach is not applicable to largely deforested regions unless enough data on pre-human tree species distribution is available. In such regions, ecological niche models may provide essential information to support species selection in the framework of forest restoration planning. In this study we used ecological niche models to predict habitat suitability for native tree species in "Tierra de Campos" region, an almost totally deforested area of the Duero Basin (Spain). Previously available models provide habitat suitability predictions for dominant native tree species, but including non-dominant tree species in the forest restoration planning may be desirable to promote biodiversity, specially in largely deforested areas were near seed sources are not expected. We used the Forest Map of Spain as species occurrence data source to maximize the number of modeled tree species. Penalized logistic regression was used to train models using climate and lithological predictors. Using model predictions a set of tools were developed to support species selection in forest restoration planning. Model predictions were used to build ordered lists of suitable species for each cell of the study area. The suitable species lists were summarized drawing maps that showed the two most suitable species for each cell. Additionally, potential distribution maps of the suitable species for the study area were drawn. For a scenario with two dominant species, the models predicted a mixed forest (Quercus ilex and a coniferous tree species) for almost one half of the study area. According to the models, 22 non-dominant native tree species are suitable for the study area, with up to six suitable species per cell. The model predictions pointed to Crataegus monogyna, Juniperus communis, J.oxycedrus and J.phoenicea as the most suitable non-dominant native tree species in the study area. Our results encourage further use of ecological niche models for forest restoration planning in largely deforested regions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We consider the problem of interaction neighborhood estimation from the partial observation of a finite number of realizations of a random field. We introduce a model selection rule to choose estimators of conditional probabilities among natural candidates. Our main result is an oracle inequality satisfied by the resulting estimator. We use then this selection rule in a two-step procedure to evaluate the interacting neighborhoods. The selection rule selects a small prior set of possible interacting points and a cutting step remove from this prior set the irrelevant points. We also prove that the Ising models satisfy the assumptions of the main theorems, without restrictions on the temperature, on the structure of the interacting graph or on the range of the interactions. It provides therefore a large class of applications for our results. We give a computationally efficient procedure in these models. We finally show the practical efficiency of our approach in a simulation study.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We introduce the log-beta Weibull regression model based on the beta Weibull distribution (Famoye et al., 2005; Lee et al., 2007). We derive expansions for the moment generating function which do not depend on complicated functions. The new regression model represents a parametric family of models that includes as sub-models several widely known regression models that can be applied to censored survival data. We employ a frequentist analysis, a jackknife estimator, and a parametric bootstrap for the parameters of the proposed model. We derive the appropriate matrices for assessing local influences on the parameter estimates under different perturbation schemes and present some ways to assess global influences. Further, for different parameter settings, sample sizes, and censoring percentages, several simulations are performed. In addition, the empirical distribution of some modified residuals are displayed and compared with the standard normal distribution. These studies suggest that the residual analysis usually performed in normal linear regression models can be extended to a modified deviance residual in the proposed regression model applied to censored data. We define martingale and deviance residuals to evaluate the model assumptions. The extended regression model is very useful for the analysis of real data and could give more realistic fits than other special regression models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We are concerned with providing more empirical evidence on forecast failure, developing forecast models, and examining the impact of events such as audit reports. A joint consideration of classic financial ratios and relevant external indicators leads us to build a basic prediction model focused in non-financial Galician SMEs. Explanatory variables are relevant financial indicators from the viewpoint of the financial logic and financial failure theory. The paper explores three mathematical models: discriminant analysis, Logit, and linear multivariate regression. We conclude that, even though they both offer high explanatory and predictive abilities, Logit and MDA models should be used and interpreted jointly.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The role of land cover change as a significant component of global change has become increasingly recognized in recent decades. Large databases measuring land cover change, and the data which can potentially be used to explain the observed changes, are also becoming more commonly available. When developing statistical models to investigate observed changes, it is important to be aware that the chosen sampling strategy and modelling techniques can influence results. We present a comparison of three sampling strategies and two forms of grouped logistic regression models (multinomial and ordinal) in the investigation of patterns of successional change after agricultural land abandonment in Switzerland. Results indicated that both ordinal and nominal transitional change occurs in the landscape and that the use of different sampling regimes and modelling techniques as investigative tools yield different results. Synthesis and applications. Our multimodel inference identified successfully a set of consistently selected indicators of land cover change, which can be used to predict further change, including annual average temperature, the number of already overgrown neighbouring areas of land and distance to historically destructive avalanche sites. This allows for more reliable decision making and planning with respect to landscape management. Although both model approaches gave similar results, ordinal regression yielded more parsimonious models that identified the important predictors of land cover change more efficiently. Thus, this approach is favourable where land cover change pattern can be interpreted as an ordinal process. Otherwise, multinomial logistic regression is a viable alternative.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background Multiple logistic regression is precluded from many practical applications in ecology that aim to predict the geographic distributions of species because it requires absence data, which are rarely available or are unreliable. In order to use multiple logistic regression, many studies have simulated "pseudo-absences" through a number of strategies, but it is unknown how the choice of strategy influences models and their geographic predictions of species. In this paper we evaluate the effect of several prevailing pseudo-absence strategies on the predictions of the geographic distribution of a virtual species whose "true" distribution and relationship to three environmental predictors was predefined. We evaluated the effect of using a) real absences b) pseudo-absences selected randomly from the background and c) two-step approaches: pseudo-absences selected from low suitability areas predicted by either Ecological Niche Factor Analysis: (ENFA) or BIOCLIM. We compared how the choice of pseudo-absence strategy affected model fit, predictive power, and information-theoretic model selection results. Results Models built with true absences had the best predictive power, best discriminatory power, and the "true" model (the one that contained the correct predictors) was supported by the data according to AIC, as expected. Models based on random pseudo-absences had among the lowest fit, but yielded the second highest AUC value (0.97), and the "true" model was also supported by the data. Models based on two-step approaches had intermediate fit, the lowest predictive power, and the "true" model was not supported by the data. Conclusion If ecologists wish to build parsimonious GLM models that will allow them to make robust predictions, a reasonable approach is to use a large number of randomly selected pseudo-absences, and perform model selection based on an information theoretic approach. However, the resulting models can be expected to have limited fit.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Over thirty years ago, Leamer (1983) - among many others - expressed doubts about the quality and usefulness of empirical analyses for the economic profession by stating that "hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else's data analyses seriously" (p.37). Improvements in data quality, more robust estimation methods and the evolution of better research designs seem to make that assertion no longer justifiable (see Angrist and Pischke (2010) for a recent response to Leamer's essay). The economic profes- sion and policy makers alike often rely on empirical evidence as a means to investigate policy relevant questions. The approach of using scientifically rigorous and systematic evidence to identify policies and programs that are capable of improving policy-relevant outcomes is known under the increasingly popular notion of evidence-based policy. Evidence-based economic policy often relies on randomized or quasi-natural experiments in order to identify causal effects of policies. These can require relatively strong assumptions or raise concerns of external validity. In the context of this thesis, potential concerns are for example endogeneity of policy reforms with respect to the business cycle in the first chapter, the trade-off between precision and bias in the regression-discontinuity setting in chapter 2 or non-representativeness of the sample due to self-selection in chapter 3. While the identification strategies are very useful to gain insights into the causal effects of specific policy questions, transforming the evidence into concrete policy conclusions can be challenging. Policy develop- ment should therefore rely on the systematic evidence of a whole body of research on a specific policy question rather than on a single analysis. In this sense, this thesis cannot and should not be viewed as a comprehensive analysis of specific policy issues but rather as a first step towards a better understanding of certain aspects of a policy question. The thesis applies new and innovative identification strategies to policy-relevant and topical questions in the fields of labor economics and behavioral environmental economics. Each chapter relies on a different identification strategy. In the first chapter, we employ a difference- in-differences approach to exploit the quasi-experimental change in the entitlement of the max- imum unemployment benefit duration to identify the medium-run effects of reduced benefit durations on post-unemployment outcomes. Shortening benefit duration carries a double- dividend: It generates fiscal benefits without deteriorating the quality of job-matches. On the contrary, shortened benefit durations improve medium-run earnings and employment possibly through containing the negative effects of skill depreciation or stigmatization. While the first chapter provides only indirect evidence on the underlying behavioral channels, in the second chapter I develop a novel approach that allows to learn about the relative impor- tance of the two key margins of job search - reservation wage choice and search effort. In the framework of a standard non-stationary job search model, I show how the exit rate from un- employment can be decomposed in a way that is informative on reservation wage movements over the unemployment spell. The empirical analysis relies on a sharp discontinuity in unem- ployment benefit entitlement, which can be exploited in a regression-discontinuity approach to identify the effects of extended benefit durations on unemployment and survivor functions. I find evidence that calls for an important role of reservation wage choices for job search be- havior. This can have direct implications for the optimal design of unemployment insurance policies. The third chapter - while thematically detached from the other chapters - addresses one of the major policy challenges of the 21st century: climate change and resource consumption. Many governments have recently put energy efficiency on top of their agendas. While pricing instru- ments aimed at regulating the energy demand have often been found to be short-lived and difficult to enforce politically, the focus of energy conservation programs has shifted towards behavioral approaches - such as provision of information or social norm feedback. The third chapter describes a randomized controlled field experiment in which we discuss the effective- ness of different types of feedback on residential electricity consumption. We find that detailed and real-time feedback caused persistent electricity reductions on the order of 3 to 5 % of daily electricity consumption. Also social norm information can generate substantial electricity sav- ings when designed appropriately. The findings suggest that behavioral approaches constitute effective and relatively cheap way of improving residential energy-efficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we study the relevance of multiple kernel learning (MKL) for the automatic selection of time series inputs. Recently, MKL has gained great attention in the machine learning community due to its flexibility in modelling complex patterns and performing feature selection. In general, MKL constructs the kernel as a weighted linear combination of basis kernels, exploiting different sources of information. An efficient algorithm wrapping a Support Vector Regression model for optimizing the MKL weights, named SimpleMKL, is used for the analysis. In this sense, MKL performs feature selection by discarding inputs/kernels with low or null weights. The approach proposed is tested with simulated linear and nonlinear time series (AutoRegressive, Henon and Lorenz series).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Aim This study used data from temperate forest communities to assess: (1) five different stepwise selection methods with generalized additive models, (2) the effect of weighting absences to ensure a prevalence of 0.5, (3) the effect of limiting absences beyond the environmental envelope defined by presences, (4) four different methods for incorporating spatial autocorrelation, and (5) the effect of integrating an interaction factor defined by a regression tree on the residuals of an initial environmental model. Location State of Vaud, western Switzerland. Methods Generalized additive models (GAMs) were fitted using the grasp package (generalized regression analysis and spatial predictions, http://www.cscf.ch/grasp). Results Model selection based on cross-validation appeared to be the best compromise between model stability and performance (parsimony) among the five methods tested. Weighting absences returned models that perform better than models fitted with the original sample prevalence. This appeared to be mainly due to the impact of very low prevalence values on evaluation statistics. Removing zeroes beyond the range of presences on main environmental gradients changed the set of selected predictors, and potentially their response curve shape. Moreover, removing zeroes slightly improved model performance and stability when compared with the baseline model on the same data set. Incorporating a spatial trend predictor improved model performance and stability significantly. Even better models were obtained when including local spatial autocorrelation. A novel approach to include interactions proved to be an efficient way to account for interactions between all predictors at once. Main conclusions Models and spatial predictions of 18 forest communities were significantly improved by using either: (1) cross-validation as a model selection method, (2) weighted absences, (3) limited absences, (4) predictors accounting for spatial autocorrelation, or (5) a factor variable accounting for interactions between all predictors. The final choice of model strategy should depend on the nature of the available data and the specific study aims. Statistical evaluation is useful in searching for the best modelling practice. However, one should not neglect to consider the shapes and interpretability of response curves, as well as the resulting spatial predictions in the final assessment.