923 resultados para model selection in binary regression
Resumo:
Latent class regression models are useful tools for assessing associations between covariates and latent variables. However, evaluation of key model assumptions cannot be performed using methods from standard regression models due to the unobserved nature of latent outcome variables. This paper presents graphical diagnostic tools to evaluate whether or not latent class regression models adhere to standard assumptions of the model: conditional independence and non-differential measurement. An integral part of these methods is the use of a Markov Chain Monte Carlo estimation procedure. Unlike standard maximum likelihood implementations for latent class regression model estimation, the MCMC approach allows us to calculate posterior distributions and point estimates of any functions of parameters. It is this convenience that allows us to provide the diagnostic methods that we introduce. As a motivating example we present an analysis focusing on the association between depression and socioeconomic status, using data from the Epidemiologic Catchment Area study. We consider a latent class regression analysis investigating the association between depression and socioeconomic status measures, where the latent variable depression is regressed on education and income indicators, in addition to age, gender, and marital status variables. While the fitted latent class regression model yields interesting results, the model parameters are found to be invalid due to the violation of model assumptions. The violation of these assumptions is clearly identified by the presented diagnostic plots. These methods can be applied to standard latent class and latent class regression models, and the general principle can be extended to evaluate model assumptions in other types of models.
Resumo:
Background and Aim In patients with cystic fibrosis (CF) the architecture of the developing lungs and the ventilation of lung units are progressively affected, influencing intrapulmonary gas mixing and gas exchange. We examined the long-term course of blood gas measurements in relation to characteristics of lung function and the influence of different CFTR genotype upon this process. Methods Serial annual measurements of PaO2 and PaCO2 assessed in relation to lung function, providing functional residual capacity (FRCpleth), lung clearance index (LCI), trapped gas (VTG), airway resistance (sReff), and forced expiratory indices (FEV1, FEF50), were collected in 178 children (88 males; 90 females) with CF, over an age range of 5 to 18 years. Linear mixed model analysis and binary logistic regression analysis were used to define predominant lung function parameters influencing oxygenation and carbon dioxide elimination. Results PaO2 decreased linearly from age 5 to 18 years, and was mainly associated with FRCpleth, (p < 0.0001), FEV1 (p < 0.001), FEF50 (p < 0.002), and LCI (p < 0.002), indicating that oxygenation was associated with the degree of pulmonary hyperinflation, ventilation inhomogeneities and impeded airway function. PaCO2 showed a transitory phase of low PaCO2 values, mainly during the age range of 5 to 12 years. Both PaO2 and PaCO2 presented with different progression slopes within specific CFTR genotypes. Conclusion In the long-term evaluation of gas exchange characteristics, an association with different lung function patterns was found and was closely related to specific genotypes. Early examination of blood gases may reveal hypocarbia, presumably reflecting compensatory mechanisms to improve oxygenation.
Resumo:
Background: Speciation reversal: the erosion of species differentiation via an increase in introgressive hybridization due to the weakening of previously divergent selection regimes, is thought to be an important, yet poorly understood, driver of biodiversity loss. Our study system, the Alpine whitefish (Coregonus spp.) species complex is a classic example of a recent postglacial adaptive radiation: forming an array of endemic lake flocks, with the independent origination of similar ecotypes among flocks. However, many of the lakes of the Alpine radiation have been seriously impacted by anthropogenic nutrient enrichment, resulting in a collapse in neutral genetic and phenotypic differentiation within the most polluted lakes. Here we investigate the effects of eutrophication on the selective forces that have shaped this radiation, using population genomics. We studied eight sympatric species assemblages belonging to five independent parallel adaptive radiations, and one species pair in secondary contact. We used AFLP markers, and applied FST outlier (BAYESCAN, DFDIST) and logistic regression analyses (MATSAM), to identify candidate regions for disruptive selection in the genome and their associations with adaptive traits within each lake flock. The number of outlier and adaptive trait associated loci identified per lake were then regressed against two variables (historical phosphorus concentration and contemporary oxygen concentration) representing the strength of eutrophication. Results: Whilst we identify disruptive selection candidate regions in all lake flocks, we find similar trends, across analysis methods, towards fewer disruptive selection candidate regions and fewer adaptive trait/candidate loci associations in the more polluted lakes. Conclusions: Weakened disruptive selection and a concomitant breakdown in reproductive isolating mechanisms in more polluted lakes has lead to increased gene flow between coexisting Alpine whitefish species. We hypothesize that the resulting higher rates of interspecific recombination reduce either the number or extent of genomic islands of divergence surrounding loci evolving under disruptive natural selection. This produces the negative trend seen in the number of selection candidate loci recovered during genome scans of whitefish species flocks, with increasing levels of anthropogenic eutrophication: as the likelihood decreases that AFLP restriction sites will fall within regions of heightened genomic divergence and therefore be classified as FST outlier loci. This study explores for the first time the potential effects of human-mediated relaxation of disruptive selection on heterogeneous genomic divergence between coexisting species.
Resumo:
The maintenance of genetic variation in a spatially heterogeneous environment has been one of the main research themes in theoretical population genetics. Despite considerable progress in understanding the consequences of spatially structured environments on genetic variation, many problems remain unsolved. One of them concerns the relationship between the number of demes, the degree of dominance, and the maximum number of alleles that can be maintained by selection in a subdivided population. In this work, we study the potential of maintaining genetic variation in a two-deme model with deme-independent degree of intermediate dominance, which includes absence of G x E interaction as a special case. We present a thorough numerical analysis of a two-deme three-allele model, which allows us to identify dominance and selection patterns that harbor the potential for stable triallelic equilibria. The information gained by this approach is then used to construct an example in which existence and asymptotic stability of a fully polymorphic equilibrium can be proved analytically. Noteworthy, in this example the parameter range in which three alleles can coexist is maximized for intermediate migration rates. Our results can be interpreted in a specialist-generalist context and (among others) show when two specialists can coexist with a generalist in two demes if the degree of dominance is deme independent and intermediate. The dominance relation between the generalist allele and the specialist alleles play a decisive role. We also discuss linear selection on a quantitative trait and show that G x E interaction is not necessary for the maintenance of more than two alleles in two demes.
Resumo:
Parameter estimates from commonly used multivariable parametric survival regression models do not directly quantify differences in years of life expectancy. Gaussian linear regression models give results in terms of absolute mean differences, but are not appropriate in modeling life expectancy, because in many situations time to death has a negative skewed distribution. A regression approach using a skew-normal distribution would be an alternative to parametric survival models in the modeling of life expectancy, because parameter estimates can be interpreted in terms of survival time differences while allowing for skewness of the distribution. In this paper we show how to use the skew-normal regression so that censored and left-truncated observations are accounted for. With this we model differences in life expectancy using data from the Swiss National Cohort Study and from official life expectancy estimates and compare the results with those derived from commonly used survival regression models. We conclude that a censored skew-normal survival regression approach for left-truncated observations can be used to model differences in life expectancy across covariates of interest.
Resumo:
Objectives. This paper seeks to assess the effect on statistical power of regression model misspecification in a variety of situations. ^ Methods and results. The effect of misspecification in regression can be approximated by evaluating the correlation between the correct specification and the misspecification of the outcome variable (Harris 2010).In this paper, three misspecified models (linear, categorical and fractional polynomial) were considered. In the first section, the mathematical method of calculating the correlation between correct and misspecified models with simple mathematical forms was derived and demonstrated. In the second section, data from the National Health and Nutrition Examination Survey (NHANES 2007-2008) were used to examine such correlations. Our study shows that comparing to linear or categorical models, the fractional polynomial models, with the higher correlations, provided a better approximation of the true relationship, which was illustrated by LOESS regression. In the third section, we present the results of simulation studies that demonstrate overall misspecification in regression can produce marked decreases in power with small sample sizes. However, the categorical model had greatest power, ranging from 0.877 to 0.936 depending on sample size and outcome variable used. The power of fractional polynomial model was close to that of linear model, which ranged from 0.69 to 0.83, and appeared to be affected by the increased degrees of freedom of this model.^ Conclusion. Correlations between alternative model specifications can be used to provide a good approximation of the effect on statistical power of misspecification when the sample size is large. When model specifications have known simple mathematical forms, such correlations can be calculated mathematically. Actual public health data from NHANES 2007-2008 were used as examples to demonstrate the situations with unknown or complex correct model specification. Simulation of power for misspecified models confirmed the results based on correlation methods but also illustrated the effect of model degrees of freedom on power.^
Resumo:
This paper studies feature subset selection in classification using a multiobjective estimation of distribution algorithm. We consider six functions, namely area under ROC curve, sensitivity, specificity, precision, F1 measure and Brier score, for evaluation of feature subsets and as the objectives of the problem. One of the characteristics of these objective functions is the existence of noise in their values that should be appropriately handled during optimization. Our proposed algorithm consists of two major techniques which are specially designed for the feature subset selection problem. The first one is a solution ranking method based on interval values to handle the noise in the objectives of this problem. The second one is a model estimation method for learning a joint probabilistic model of objectives and variables which is used to generate new solutions and advance through the search space. To simplify model estimation, l1 regularized regression is used to select a subset of problem variables before model learning. The proposed algorithm is compared with a well-known ranking method for interval-valued objectives and a standard multiobjective genetic algorithm. Particularly, the effects of the two new techniques are experimentally investigated. The experimental results show that the proposed algorithm is able to obtain comparable or better performance on the tested datasets.
Resumo:
En los últimos años la externalización de TI ha ganado mucha importancia en el mercado y, por ejemplo, el mercado externalización de servicios de TI sigue creciendo cada año. Ahora más que nunca, las organizaciones son cada vez más los compradores de las capacidades necesarias mediante la obtención de productos y servicios de los proveedores, desarrollando cada vez menos estas capacidades dentro de la empresa. La selección de proveedores de TI es un problema de decisión complejo. Los gerentes que enfrentan una decisión sobre la selección de proveedores de TI tienen dificultades en la elaboración de lo que hay que pensar, además en sus discursos. También de acuerdo con un estudio del SEI (Software Engineering Institute) [40], del 20 al 25 por ciento de los grandes proyectos de adquisición de TI fracasan en dos años y el 50 por ciento fracasan dentro de cinco años. La mala gestión, la mala definición de requisitos, la falta de evaluaciones exhaustivas, que pueden ser utilizadas para llegar a los mejores candidatos para la contratación externa, la selección de proveedores y los procesos de contratación inadecuados, la insuficiencia de procedimientos de selección tecnológicos, y los cambios de requisitos no controlados son factores que contribuyen al fracaso del proyecto. La mayoría de los fracasos podrían evitarse si el cliente aprendiese a comprender los problemas de decisión, hacer un mejor análisis de decisiones, y el buen juicio. El objetivo principal de este trabajo es el desarrollo de un modelo de decisión para la selección de proveedores de TI que tratará de reducir la cantidad de fracasos observados en las relaciones entre el cliente y el proveedor. La mayor parte de estos fracasos son causados por una mala selección, por parte del cliente, del proveedor. Además de estos problemas mostrados anteriormente, la motivación para crear este trabajo es la inexistencia de cualquier modelo de decisión basado en un multi modelo (mezcla de modelos adquisición y métodos de decisión) para el problema de la selección de proveedores de TI. En el caso de estudio, nueve empresas españolas fueron analizadas de acuerdo con el modelo de decisión para la selección de proveedores de TI desarrollado en este trabajo. Dos softwares se utilizaron en este estudio de caso: Expert Choice, y D-Sight. ABSTRACT In the past few years IT outsourcing has gained a lot of importance in the market and, for example, the IT services outsourcing market is still growing every year. Now more than ever, organizations are increasingly becoming acquirers of needed capabilities by obtaining products and services from suppliers and developing less and less of these capabilities in-house. IT supplier selection is a complex and opaque decision problem. Managers facing a decision about IT supplier selection have difficulty in framing what needs to be thought about further in their discourses. Also according to a study from SEI (Software Engineering Institute) [40], 20 to 25 percent of large information technology (IT) acquisition projects fail within two years and 50 percent fail within five years. Mismanagement, poor requirements definition, lack of comprehensive evaluations, which can be used to come up with the best candidates for outsourcing, inadequate supplier selection and contracting processes, insufficient technology selection procedures, and uncontrolled requirements changes are factors that contribute to project failure. The majority of project failures could be avoided if the acquirer learns how to understand the decision problems, make better decision analysis, and good judgment. The main objective of this work is the development of a decision model for IT supplier selection that will try to decrease the amount of failures seen in the relationships between the client-supplier. Most of these failures are caused by a not well selection of the supplier. Besides these problems showed above, the motivation to create this work is the inexistence of any decision model based on multi model (mixture of acquisition models and decision methods) for the problem of IT supplier selection. In the case study, nine different Spanish companies were analyzed based on the IT supplier selection decision model developed in this work. Two software products were used in this case study, Expert Choice and D-Sight.
Resumo:
Despite the critical role that terrestrial vegetation plays in the Earth's carbon cycle, very little is known about the potential evolutionary responses of plants to anthropogenically induced increases in concentrations of atmospheric CO2. We present experimental evidence that rising CO2 concentration may have a direct impact on the genetic composition and diversity of plant populations but is unlikely to result in selection favoring genotypes that exhibit increased productivity in a CO2-enriched atmosphere. Experimental populations of an annual plant (Abutilon theophrasti, velvetleaf) and a temperate forest tree (Betula alleghaniensis, yellow birch) displayed responses to increased CO2 that were both strongly density-dependent and genotype-specific. In competitive stands, a higher concentration of CO2 resulted in pronounced shifts in genetic composition, even though overall CO2-induced productivity enhancements were small. For the annual species, quantitative estimates of response to selection under competition were 3 times higher at the elevated CO2 level. However, genotypes that displayed the highest growth responses to CO2 when grown in the absence of competition did not have the highest fitness in competitive stands. We suggest that increased CO2 intensified interplant competition and that selection favored genotypes with a greater ability to compete for resources other than CO2. Thus, while increased CO2 may enhance rates of selection in populations of competing plants, it is unlikely to result in the evolution of increased CO2 responsiveness or to operate as an important feedback in the global carbon cycle. However, the increased intensity of selection and drift driven by rising CO2 levels may have an impact on the genetic diversity in plant populations.
Resumo:
Phase equilibrium data regression is an unavoidable task necessary to obtain the appropriate values for any model to be used in separation equipment design for chemical process simulation and optimization. The accuracy of this process depends on different factors such as the experimental data quality, the selected model and the calculation algorithm. The present paper summarizes the results and conclusions achieved in our research on the capabilities and limitations of the existing GE models and about strategies that can be included in the correlation algorithms to improve the convergence and avoid inconsistencies. The NRTL model has been selected as a representative local composition model. New capabilities of this model, but also several relevant limitations, have been identified and some examples of the application of a modified NRTL equation have been discussed. Furthermore, a regression algorithm has been developed that allows for the advisable simultaneous regression of all the condensed phase equilibrium regions that are present in ternary systems at constant T and P. It includes specific strategies designed to avoid some of the pitfalls frequently found in commercial regression tools for phase equilibrium calculations. Most of the proposed strategies are based on the geometrical interpretation of the lowest common tangent plane equilibrium criterion, which allows an unambiguous comprehension of the behavior of the mixtures. The paper aims to show all the work as a whole in order to reveal the necessary efforts that must be devoted to overcome the difficulties that still exist in the phase equilibrium data regression problem.
Resumo:
PURPOSE To identify the prevalence and progression of macular atrophy (MA) in neovascular age-related macular degeneration (AMD) patients under long-term anti-vascular endothelial growth factor (VEGF) therapy and to determine risk factors. METHOD This retrospective study included patients with neovascular AMD and ≥30 anti-VEGF injections. Macular atrophy (MA) was measured using near infrared and spectral-domain optical coherence tomography (SD-OCT). Yearly growth rate was estimated using square-root transformation to adjust for baseline area and allow for linearization of growth rate. Multiple regression with Akaike information criterion (AIC) as model selection criterion was used to estimate the influence of various parameters on MA area. RESULTS Forty-nine eyes (47 patients, mean age 77 ± 14) were included with a mean of 48 ± 13 intravitreal anti-VEGF injections (ranibizumab:37 ± 11, aflibercept:11 ± 6, mean number of injections/year 8 ± 2.1) over a mean treatment period of 6.2 ± 1.3 years (range 4-8.5). Mean best-corrected visual acuity improved from 57 ± 17 letters at baseline (= treatment start) to 60 ± 16 letters at last follow-up. The MA prevalence within and outside the choroidal neovascularization (CNV) border at initial measurement was 45% and increased to 74%. Mean MA area increased from 1.8 ± 2.7 mm(2) within and 0.5 ± 0.98 mm(2) outside the CNV boundary to 2.7 ± 3.4 mm(2) and 1.7 ± 1.8 mm(2) , respectively. Multivariate regression determined posterior vitreous detachment (PVD) and presence/development of intraretinal cysts (IRCs) as significant factors for total MA size (R(2) = 0.16, p = 0.02). Macular atrophy (MA) area outside the CNV border was best explained by the presence of reticular pseudodrusen (RPD) and IRC (R(2) = 0.24, p = 0.02). CONCLUSION A majority of patients show MA after long-term anti-VEGF treatment. Reticular pseudodrusen (RPD), IRC and PVD but not number of injections or treatment duration seem to be associated with the MA size.
Resumo:
A dual resistance model with distribution of either barrier or pore diffusional activation energy is proposed in this work for gas transport in carbon molecular sieve (CMS) micropores. This is a novel approach in which the equilibrium is homogeneous, but the kinetics is heterogeneous. The model seems to provide a possible explanation for the concentration dependence of the thermodynamically corrected barrier and pore diffusion coefficients observed in previous studies from this laboratory on gas diffusion in CMS.(1.2) The energy distribution is assumed to follow the gamma distribution function. It is shown that the energy distribution model can fully capture the behavior described by the empirical model established in earlier studies to account for the concentration dependence of thermodynamically corrected barrier and pore diffusion coefficients. A methodology is proposed for extracting energy distribution parameters, and it is further shown that the extracted energy distribution parameters can effectively predict integral uptake and column breakthrough profiles over a wide range of operating pressures.
Resumo:
A structurally-based quasi-chemical viscosity model for fully liquid slags in the Al2O3 CaO-'FeO'-MgO-SiO2 system has been developed. The model links the slag viscosities to the internal structures of the melts through the concentrations of various Si0.5O, Me2/nn+O and Me1/nn+Si0.25O viscous flow structural units. The concentrations of these structural units are derived from a quasi-chemical thermodynamic model of the system. The model described in this series of papers enables the viscosities of liquid slags to be predicted within experimental uncertainties over the whole range of temperatures and compositions in the Al2O3 CaOMgO-SiO2 system.
Resumo:
A theoretical model is presented which describes selection in a genetic algorithm (GA) under a stochastic fitness measure and correctly accounts for finite population effects. Although this model describes a number of selection schemes, we only consider Boltzmann selection in detail here as results for this form of selection are particularly transparent when fitness is corrupted by additive Gaussian noise. Finite population effects are shown to be of fundamental importance in this case, as the noise has no effect in the infinite population limit. In the limit of weak selection we show how the effects of any Gaussian noise can be removed by increasing the population size appropriately. The theory is tested on two closely related problems: the one-max problem corrupted by Gaussian noise and generalization in a perceptron with binary weights. The averaged dynamics can be accurately modelled for both problems using a formalism which describes the dynamics of the GA using methods from statistical mechanics. The second problem is a simple example of a learning problem and by considering this problem we show how the accurate characterization of noise in the fitness evaluation may be relevant in machine learning. The training error (negative fitness) is the number of misclassified training examples in a batch and can be considered as a noisy version of the generalization error if an independent batch is used for each evaluation. The noise is due to the finite batch size and in the limit of large problem size and weak selection we show how the effect of this noise can be removed by increasing the population size. This allows the optimal batch size to be determined, which minimizes computation time as well as the total number of training examples required.
Resumo:
We discuss aggregation of data from neuropsychological patients and the process of evaluating models using data from a series of patients. We argue that aggregation can be misleading but not aggregating can also result in information loss. The basis for combining data needs to be theoretically defined, and the particular method of aggregation depends on the theoretical question and characteristics of the data. We present examples, often drawn from our own research, to illustrate these points. We also argue that statistical models and formal methods of model selection are a useful way to test theoretical accounts using data from several patients in multiple-case studies or case series. Statistical models can often measure fit in a way that explicitly captures what a theory allows; the parameter values that result from model fitting often measure theoretically important dimensions and can lead to more constrained theories or new predictions; and model selection allows the strength of evidence for models to be quantified without forcing this into the artificial binary choice that characterizes hypothesis testing methods. Methods that aggregate and then formally model patient data, however, are not automatically preferred to other methods. Which method is preferred depends on the question to be addressed, characteristics of the data, and practical issues like availability of suitable patients, but case series, multiple-case studies, single-case studies, statistical models, and process models should be complementary methods when guided by theory development.