923 resultados para model selection in binary regression


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many of the most interesting questions ecologists ask lead to analyses of spatial data. Yet, perhaps confused by the large number of statistical models and fitting methods available, many ecologists seem to believe this is best left to specialists. Here, we describe the issues that need consideration when analysing spatial data and illustrate these using simulation studies. Our comparative analysis involves using methods including generalized least squares, spatial filters, wavelet revised models, conditional autoregressive models and generalized additive mixed models to estimate regression coefficients from synthetic but realistic data sets, including some which violate standard regression assumptions. We assess the performance of each method using two measures and using statistical error rates for model selection. Methods that performed well included generalized least squares family of models and a Bayesian implementation of the conditional auto-regressive model. Ordinary least squares also performed adequately in the absence of model selection, but had poorly controlled Type I error rates and so did not show the improvements in performance under model selection when using the above methods. Removing large-scale spatial trends in the response led to poor performance. These are empirical results; hence extrapolation of these findings to other situations should be performed cautiously. Nevertheless, our simulation-based approach provides much stronger evidence for comparative analysis than assessments based on single or small numbers of data sets, and should be considered a necessary foundation for statements of this type in future.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The objective of this work was to compare the relative efficiency of initial selection and genetic parameter estimation, using augmented blocks design (ABD), augmented blocks twice replicated design (DABD) and group of randomised block design experiments with common treatments (ERBCT), by simulations, considering fixed effect model and mixed model with regular treatment effects as random. For the simulations, eight different conditions (scenarios) were considered. From the 600 simulations in each scenario, the mean percentage selection coincidence, the Pearsons´s correlation estimates between adjusted means for the fixed effects model, and the heritability estimates for the mixed model were evaluated. DABD and ERBCT were very similar in their comparisons and slightly superior to ABD. Considering the initial stages of selection in a plant breeding program, ABD is a good alternative for selecting superior genotypes, although none of the designs had been effective to estimate heritability in all the different scenarios evaluated.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND: The bacterial flagellum is the most important organelle of motility in bacteria and plays a key role in many bacterial lifestyles, including virulence. The flagellum also provides a paradigm of how hierarchical gene regulation, intricate protein-protein interactions and controlled protein secretion can result in the assembly of a complex multi-protein structure tightly orchestrated in time and space. As if to stress its importance, plants and animals produce receptors specifically dedicated to the recognition of flagella. Aside from motility, the flagellum also moonlights as an adhesion and has been adapted by humans as a tool for peptide display. Flagellar sequence variation constitutes a marker with widespread potential uses for studies of population genetics and phylogeny of bacterial species. RESULTS: We sequenced the complete flagellin gene (flaA) in 18 different species and subspecies of Aeromonas. Sequences ranged in size from 870 (A. allosaccharophila) to 921 nucleotides (A. popoffii). The multiple alignment displayed 924 sites, 66 of which presented alignment gaps. The phylogenetic tree revealed the existence of two groups of species exhibiting different FlaA flagellins (FlaA1 and FlaA2). Maximum likelihood models of codon substitution were used to analyze flaA sequences. Likelihood ratio tests suggested a low variation in selective pressure among lineages, with an omega ratio of less than 1 indicating the presence of purifying selection in almost all cases. Only one site under potential diversifying selection was identified (isoleucine in position 179). However, 17 amino acid positions were inferred as sites that are likely to be under positive selection using the branch-site model. Ancestral reconstruction revealed that these 17 amino acids were among the amino acid changes detected in the ancestral sequence. CONCLUSION: The models applied to our set of sequences allowed us to determine the possible evolutionary pathway followed by the flaA gene in Aeromonas, suggesting that this gene have probably been evolving independently in the two groups of Aeromonas species since the divergence of a distant common ancestor after one or several episodes of positive selection. REVIEWERS: This article was reviewed by Alexey Kondrashov, John Logsdon and Olivier Tenaillon (nominated by Laurence D Hurst).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In general, models of ecological systems can be broadly categorized as ’top-down’ or ’bottom-up’ models, based on the hierarchical level that the model processes are formulated on. The structure of a top-down, also known as phenomenological, population model can be interpreted in terms of population characteristics, but it typically lacks an interpretation on a more basic level. In contrast, bottom-up, also known as mechanistic, population models are derived from assumptions and processes on a more basic level, which allows interpretation of the model parameters in terms of individual behavior. Both approaches, phenomenological and mechanistic modelling, can have their advantages and disadvantages in different situations. However, mechanistically derived models might be better at capturing the properties of the system at hand, and thus give more accurate predictions. In particular, when models are used for evolutionary studies, mechanistic models are more appropriate, since natural selection takes place on the individual level, and in mechanistic models the direct connection between model parameters and individual properties has already been established. The purpose of this thesis is twofold. Firstly, a systematical way to derive mechanistic discrete-time population models is presented. The derivation is based on combining explicitly modelled, continuous processes on the individual level within a reproductive period with a discrete-time maturation process between reproductive periods. Secondly, as an example of how evolutionary studies can be carried out in mechanistic models, the evolution of the timing of reproduction is investigated. Thus, these two lines of research, derivation of mechanistic population models and evolutionary studies, are complementary to each other.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Objective: To establish a prediction model of the degree of disability in adults with Spinal CordInjury (SCI ) based on the use of the WHO-DAS II . Methods: The disability degree was correlatedwith three variable groups: clinical, sociodemographic and those related with rehabilitation services.A model of multiple linear regression was built to predict disability. 45 people with sci exhibitingdiverse etiology, neurological level and completeness participated. Patients were older than 18 andthey had more than a six-month post-injury. The WHO-DAS II and the ASIA impairment scale(AIS ) were used. Results: Variables that evidenced a significant relationship with disability were thefollowing: occupational situation, type of affiliation to the public health care system, injury evolutiontime, neurological level, partial preservation zone, ais motor and sensory scores and number ofclinical complications during the last year. Complications significantly associated to disability werejoint pain, urinary infections, intestinal problems and autonomic disreflexia. None of the variablesrelated to rehabilitation services showed significant association with disability. The disability degreeexhibited significant differences in favor of the groups that received the following services: assistivedevices supply and vocational, job or educational counseling. Conclusions: The best predictiondisability model in adults with sci with more than six months post-injury was built with variablesof injury evolution time, AIS sensory score and injury-related unemployment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A physically motivated statistical model is used to diagnose variability and trends in wintertime ( October - March) Global Precipitation Climatology Project (GPCP) pentad (5-day mean) precipitation. Quasi-geostrophic theory suggests that extratropical precipitation amounts should depend multiplicatively on the pressure gradient, saturation specific humidity, and the meridional temperature gradient. This physical insight has been used to guide the development of a suitable statistical model for precipitation using a mixture of generalized linear models: a logistic model for the binary occurrence of precipitation and a Gamma distribution model for the wet day precipitation amount. The statistical model allows for the investigation of the role of each factor in determining variations and long-term trends. Saturation specific humidity q(s) has a generally negative effect on global precipitation occurrence and with the tropical wet pentad precipitation amount, but has a positive relationship with the pentad precipitation amount at mid- and high latitudes. The North Atlantic Oscillation, a proxy for the meridional temperature gradient, is also found to have a statistically significant positive effect on precipitation over much of the Atlantic region. Residual time trends in wet pentad precipitation are extremely sensitive to the choice of the wet pentad threshold because of increasing trends in low-amplitude precipitation pentads; too low a choice of threshold can lead to a spurious decreasing trend in wet pentad precipitation amounts. However, for not too small thresholds, it is found that the meridional temperature gradient is an important factor for explaining part of the long-term trend in Atlantic precipitation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nonlinear adjustment toward long-run price equilibrium relationships in the sugar-ethanol-oil nexus in Brazil is examined. We develop generalized bivariate error correction models that allow for cointegration between sugar, ethanol, and oil prices, where dynamic adjustments are potentially nonlinear functions of the disequilibrium errors. A range of models are estimated using Bayesian Monte Carlo Markov Chain algorithms and compared using Bayesian model selection methods. The results suggest that the long-run drivers of Brazilian sugar prices are oil prices and that there are nonlinearities in the adjustment processes of sugar and ethanol prices to oil price but linear adjustment between ethanol and sugar prices.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A physically motivated statistical model is used to diagnose variability and trends in wintertime ( October - March) Global Precipitation Climatology Project (GPCP) pentad (5-day mean) precipitation. Quasi-geostrophic theory suggests that extratropical precipitation amounts should depend multiplicatively on the pressure gradient, saturation specific humidity, and the meridional temperature gradient. This physical insight has been used to guide the development of a suitable statistical model for precipitation using a mixture of generalized linear models: a logistic model for the binary occurrence of precipitation and a Gamma distribution model for the wet day precipitation amount. The statistical model allows for the investigation of the role of each factor in determining variations and long-term trends. Saturation specific humidity q(s) has a generally negative effect on global precipitation occurrence and with the tropical wet pentad precipitation amount, but has a positive relationship with the pentad precipitation amount at mid- and high latitudes. The North Atlantic Oscillation, a proxy for the meridional temperature gradient, is also found to have a statistically significant positive effect on precipitation over much of the Atlantic region. Residual time trends in wet pentad precipitation are extremely sensitive to the choice of the wet pentad threshold because of increasing trends in low-amplitude precipitation pentads; too low a choice of threshold can lead to a spurious decreasing trend in wet pentad precipitation amounts. However, for not too small thresholds, it is found that the meridional temperature gradient is an important factor for explaining part of the long-term trend in Atlantic precipitation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The note proposes an efficient nonlinear identification algorithm by combining a locally regularized orthogonal least squares (LROLS) model selection with a D-optimality experimental design. The proposed algorithm aims to achieve maximized model robustness and sparsity via two effective and complementary approaches. The LROLS method alone is capable of producing a very parsimonious model with excellent generalization performance. The D-optimality design criterion further enhances the model efficiency and robustness. An added advantage is that the user only needs to specify a weighting for the D-optimality cost in the combined model selecting criterion and the entire model construction procedure becomes automatic. The value of this weighting does not influence the model selection procedure critically and it can be chosen with ease from a wide range of values.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The paper introduces an efficient construction algorithm for obtaining sparse linear-in-the-weights regression models based on an approach of directly optimizing model generalization capability. This is achieved by utilizing the delete-1 cross validation concept and the associated leave-one-out test error also known as the predicted residual sums of squares (PRESS) statistic, without resorting to any other validation data set for model evaluation in the model construction process. Computational efficiency is ensured using an orthogonal forward regression, but the algorithm incrementally minimizes the PRESS statistic instead of the usual sum of the squared training errors. A local regularization method can naturally be incorporated into the model selection procedure to further enforce model sparsity. The proposed algorithm is fully automatic, and the user is not required to specify any criterion to terminate the model construction procedure. Comparisons with some of the existing state-of-art modeling methods are given, and several examples are included to demonstrate the ability of the proposed algorithm to effectively construct sparse models that generalize well.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An automatic nonlinear predictive model-construction algorithm is introduced based on forward regression and the predicted-residual-sums-of-squares (PRESS) statistic. The proposed algorithm is based on the fundamental concept of evaluating a model's generalisation capability through crossvalidation. This is achieved by using the PRESS statistic as a cost function to optimise model structure. In particular, the proposed algorithm is developed with the aim of achieving computational efficiency, such that the computational effort, which would usually be extensive in the computation of the PRESS statistic, is reduced or minimised. The computation of PRESS is simplified by avoiding a matrix inversion through the use of the orthogonalisation procedure inherent in forward regression, and is further reduced significantly by the introduction of a forward-recursive formula. Based on the properties of the PRESS statistic, the proposed algorithm can achieve a fully automated procedure without resort to any other validation data set for iterative model evaluation. Numerical examples are used to demonstrate the efficacy of the algorithm.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we propose an efficient two-level model identification method for a large class of linear-in-the-parameters models from the observational data. A new elastic net orthogonal forward regression (ENOFR) algorithm is employed at the lower level to carry out simultaneous model selection and elastic net parameter estimation. The two regularization parameters in the elastic net are optimized using a particle swarm optimization (PSO) algorithm at the upper level by minimizing the leave one out (LOO) mean square error (LOOMSE). Illustrative examples are included to demonstrate the effectiveness of the new approaches.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the calibration and evaluation of flood inundation models are a prerequisite for their successful application, there is a clear need to ensure that the performance measures that quantify how well models match the available observations are fit for purpose. This paper evaluates the binary pattern performance measures that are frequently used to compare flood inundation models with observations of flood extent. This evaluation considers whether these measures are able to calibrate and evaluate model predictions in a credible and consistent way, i.e. identifying the underlying model behaviour for a number of different purposes such as comparing models of floods of different magnitudes or on different catchments. Through theoretical examples, it is shown that the binary pattern measures are not consistent for floods of different sizes, such that for the same vertical error in water level, a model of a flood of large magnitude appears to perform better than a model of a smaller magnitude flood. Further, the commonly used Critical Success Index (usually referred to as F<2 >) is biased in favour of overprediction of the flood extent, and is also biased towards correctly predicting areas of the domain with smaller topographic gradients. Consequently, it is recommended that future studies consider carefully the implications of reporting conclusions using these performance measures. Additionally, future research should consider whether a more robust and consistent analysis could be achieved by using elevation comparison methods instead.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An efficient two-level model identification method aiming at maximising a model׳s generalisation capability is proposed for a large class of linear-in-the-parameters models from the observational data. A new elastic net orthogonal forward regression (ENOFR) algorithm is employed at the lower level to carry out simultaneous model selection and elastic net parameter estimation. The two regularisation parameters in the elastic net are optimised using a particle swarm optimisation (PSO) algorithm at the upper level by minimising the leave one out (LOO) mean square error (LOOMSE). There are two elements of original contributions. Firstly an elastic net cost function is defined and applied based on orthogonal decomposition, which facilitates the automatic model structure selection process with no need of using a predetermined error tolerance to terminate the forward selection process. Secondly it is shown that the LOOMSE based on the resultant ENOFR models can be analytically computed without actually splitting the data set, and the associate computation cost is small due to the ENOFR procedure. Consequently a fully automated procedure is achieved without resort to any other validation data set for iterative model evaluation. Illustrative examples are included to demonstrate the effectiveness of the new approaches.