31 resultados para Sample selection model
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Compartmentalization of self-replicating molecules (templates) in protocells is a necessary step towards the evolution of modern cells. However, coexistence between distinct template types inside a protocell can be achieved only if there is a selective pressure favoring protocells with a mixed template composition. Here we study analytically a group selection model for the coexistence between two template types using the diffusion approximation of population genetics. The model combines competition at the template and protocell levels as well as genetic drift inside protocells. At the steady state, we find a continuous phase transition separating the coexistence and segregation regimes, with the order parameter vanishing linearly with the distance to the critical point. In addition, we derive explicit analytical expressions for the critical steadystate probability density of protocell compositions.
Resumo:
Neurofeedback (NF) is a training to enhance self-regulatory capacity over brain activity patterns and consequently over brain mental states. Recent findings suggest that NF is a promising alternative for the treatment of attention-deficit/hyperactivity disorder (ADHD). We comprehensively reviewed literature searching for studies on the effectiveness and specificity of NF for the treatment of ADHD. In addition, clinically informative evidence-based data are discussed. We found 3 systematic review on the use of NF for ADHD and 6 randomized controlled trials that have not been included in these reviews. Most nonrandomized controlled trials found positive results with medium-to-large effect sizes, but the evidence for effectiveness are less robust when only randomized controlled studies are considered. The direct comparison of NF and sham-NF in 3 published studies have found no group differences, nevertheless methodological caveats, such as the quality of the training protocol used, sample size, and sample selection may have contributed to the negative results. Further data on specificity comes from electrophysiological studies reporting that NF effectively changes brain activity patterns. No safety issues have emerged from clinical trials and NF seems to be well tolerated and accepted. Follow-up studies support long-term effects of NF. Currently there is no available data to guide clinicians on the predictors of response to NF and on optimal treatment protocol. In conclusion, NF is a valid option for the treatment for ADHD, but further evidence is required to guide its use.
Resumo:
Neste estudo, analisou-se a relação entre a despesa domiciliar com a compra de computadores e as características demográficas e socioeconômicas dos domicílios brasileiros. Foram utilizados os microdados de duas Pesquisas de Orçamentos Familiares (POF), elaboradas pelo Instituto Brasileiro de Geografia e Estatística (IBGE): 2002-2003 e 2008-2009. Essas bases permitiram que se utilizasse a despesa total per capita como variável definidora do poder aquisitivo do domicílio. Foi adotada uma abordagem econométrica para a natureza desse tipo de análise, isto é, o modelo de seleção de Heckman, que envolve dois estágios. No primeiro, analisaram-se os fatores associados à probabilidade de ocorrência da despesa e, no segundo, foram avaliados os fatores associados aos valores da despesa efetuada. Os principais resultados indicaram que o perfil do chefe (gênero e idade) e a composição dos domicílios e escolaridade dos moradores são fatores relevantes tanto para a decisão de gastar quanto para a decisão sobre o valor a ser gasto. A redução da elasticidade que relaciona as despesas com computador ao poder aquisitivo do domicílio (em 2002-2003 foi 0,56763, enquanto em 2008-2009 caiu para 0,41546) pode ser explicada pela queda no preço dos computadores e pelo aumento do poder de compra das famílias.
Resumo:
The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
Resumo:
A data set of a commercial Nellore beef cattle selection program was used to compare breeding models that assumed or not markers effects to estimate the breeding values, when a reduced number of animals have phenotypic, genotypic and pedigree information available. This herd complete data set was composed of 83,404 animals measured for weaning weight (WW), post-weaning gain (PWG), scrotal circumference (SC) and muscle score (MS), corresponding to 116,652 animals in the relationship matrix. Single trait analyses were performed by MTDFREML software to estimate fixed and random effects solutions using this complete data. The additive effects estimated were assumed as the reference breeding values for those animals. The individual observed phenotype of each trait was adjusted for fixed and random effects solutions, except for direct additive effects. The adjusted phenotype composed of the additive and residual parts of observed phenotype was used as dependent variable for models' comparison. Among all measured animals of this herd, only 3160 animals were genotyped for 106 SNP markers. Three models were compared in terms of changes on animals' rank, global fit and predictive ability. Model 1 included only polygenic effects, model 2 included only markers effects and model 3 included both polygenic and markers effects. Bayesian inference via Markov chain Monte Carlo methods performed by TM software was used to analyze the data for model comparison. Two different priors were adopted for markers effects in models 2 and 3, the first prior assumed was a uniform distribution (U) and, as a second prior, was assumed that markers effects were distributed as normal (N). Higher rank correlation coefficients were observed for models 3_U and 3_N, indicating a greater similarity of these models animals' rank and the rank based on the reference breeding values. Model 3_N presented a better global fit, as demonstrated by its low DIC. The best models in terms of predictive ability were models 1 and 3_N. Differences due prior assumed to markers effects in models 2 and 3 could be attributed to the better ability of normal prior in handle with collinear effects. The models 2_U and 2_N presented the worst performance, indicating that this small set of markers should not be used to genetically evaluate animals with no data, since its predictive ability is restricted. In conclusion, model 3_N presented a slight superiority when a reduce number of animals have phenotypic, genotypic and pedigree information. It could be attributed to the variation retained by markers and polygenic effects assumed together and the normal prior assumed to markers effects, that deals better with the collinearity between markers. (C) 2012 Elsevier B.V. All rights reserved.
Resumo:
Item response theory (IRT) comprises a set of statistical models which are useful in many fields, especially when there is an interest in studying latent variables (or latent traits). Usually such latent traits are assumed to be random variables and a convenient distribution is assigned to them. A very common choice for such a distribution has been the standard normal. Recently, Azevedo et al. [Bayesian inference for a skew-normal IRT model under the centred parameterization, Comput. Stat. Data Anal. 55 (2011), pp. 353-365] proposed a skew-normal distribution under the centred parameterization (SNCP) as had been studied in [R. B. Arellano-Valle and A. Azzalini, The centred parametrization for the multivariate skew-normal distribution, J. Multivariate Anal. 99(7) (2008), pp. 1362-1382], to model the latent trait distribution. This approach allows one to represent any asymmetric behaviour concerning the latent trait distribution. Also, they developed a Metropolis-Hastings within the Gibbs sampling (MHWGS) algorithm based on the density of the SNCP. They showed that the algorithm recovers all parameters properly. Their results indicated that, in the presence of asymmetry, the proposed model and the estimation algorithm perform better than the usual model and estimation methods. Our main goal in this paper is to propose another type of MHWGS algorithm based on a stochastic representation (hierarchical structure) of the SNCP studied in [N. Henze, A probabilistic representation of the skew-normal distribution, Scand. J. Statist. 13 (1986), pp. 271-275]. Our algorithm has only one Metropolis-Hastings step, in opposition to the algorithm developed by Azevedo et al., which has two such steps. This not only makes the implementation easier but also reduces the number of proposal densities to be used, which can be a problem in the implementation of MHWGS algorithms, as can be seen in [R.J. Patz and B.W. Junker, A straightforward approach to Markov Chain Monte Carlo methods for item response models, J. Educ. Behav. Stat. 24(2) (1999), pp. 146-178; R. J. Patz and B. W. Junker, The applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses, J. Educ. Behav. Stat. 24(4) (1999), pp. 342-366; A. Gelman, G.O. Roberts, and W.R. Gilks, Efficient Metropolis jumping rules, Bayesian Stat. 5 (1996), pp. 599-607]. Moreover, we consider a modified beta prior (which generalizes the one considered in [3]) and a Jeffreys prior for the asymmetry parameter. Furthermore, we study the sensitivity of such priors as well as the use of different kernel densities for this parameter. Finally, we assess the impact of the number of examinees, number of items and the asymmetry level on the parameter recovery. Results of the simulation study indicated that our approach performed equally as well as that in [3], in terms of parameter recovery, mainly using the Jeffreys prior. Also, they indicated that the asymmetry level has the highest impact on parameter recovery, even though it is relatively small. A real data analysis is considered jointly with the development of model fitting assessment tools. The results are compared with the ones obtained by Azevedo et al. The results indicate that using the hierarchical approach allows us to implement MCMC algorithms more easily, it facilitates diagnosis of the convergence and also it can be very useful to fit more complex skew IRT models.
Resumo:
The purpose of this study was to examine the reliability, validity and classification accuracy of the South Oaks Gambling Screen (SOGS) in a sample of the Brazilian population. Participants in this study were drawn from three sources: 71 men and women from the general population interviewed at a metropolitan train station; 116 men and women encountered at a bingo venue; and 54 men and women undergoing treatment for gambling. The SOGS and a DSM-IV-based instrument were applied by trained researchers. The internal consistency of the SOGS was 0.75 according to the Cronbach`s alpha model, and construct validity was good. A significant difference among groups was demonstrated by ANOVA (F ((2.238)) = 221.3, P < 0.001). The SOGS items and DSM-IV symptoms were highly correlated (r = 0.854, P < 0.01). The SOGS also presented satisfactory psychometric properties: sensitivity (100), specificity (74.7), positive predictive rate (60.7), negative predictive rate (100) and misclassification rate (0.18). However, a cut-off score of eight improved classification accuracy and reduced the rate of false positives: sensitivity (95.4), specificity (89.8), positive predictive rate (78.5), negative predictive rate (98) and misclassification rate (0.09). Thus, the SOGS was found to be reliable and valid in the Brazilian population.
Resumo:
Sugarcane-breeding programs take at least 12 years to develop new commercial cultivars. Molecular markers offer a possibility to study the genetic architecture of quantitative traits in sugarcane, and they may be used in marker-assisted selection to speed up artificial selection. Although the performance of sugarcane progenies in breeding programs are commonly evaluated across a range of locations and harvest years, many of the QTL detection methods ignore two- and three-way interactions between QTL, harvest, and location. In this work, a strategy for QTL detection in multi-harvest-location trial data, based on interval mapping and mixed models, is proposed and applied to map QTL effects on a segregating progeny from a biparental cross of pre-commercial Brazilian cultivars, evaluated at two locations and three consecutive harvest years for cane yield (tonnes per hectare), sugar yield (tonnes per hectare), fiber percent, and sucrose content. In the mixed model, we have included appropriate (co)variance structures for modeling heterogeneity and correlation of genetic effects and non-genetic residual effects. Forty-six QTLs were found: 13 QTLs for cane yield, 14 for sugar yield, 11 for fiber percent, and 8 for sucrose content. In addition, QTL by harvest, QTL by location, and QTL by harvest by location interaction effects were significant for all evaluated traits (30 QTLs showed some interaction, and 16 none). Our results contribute to a better understanding of the genetic architecture of complex traits related to biomass production and sucrose content in sugarcane.
Resumo:
Backgrounds Ea aims: The boundaries between the categories of body composition provided by vectorial analysis of bioimpedance are not well defined. In this paper, fuzzy sets theory was used for modeling such uncertainty. Methods: An Italian database with 179 cases 18-70 years was divided randomly into developing (n = 20) and testing samples (n = 159). From the 159 registries of the testing sample, 99 contributed with unequivocal diagnosis. Resistance/height and reactance/height were the input variables in the model. Output variables were the seven categories of body composition of vectorial analysis. For each case the linguistic model estimated the membership degree of each impedance category. To compare such results to the previously established diagnoses Kappa statistics was used. This demanded singling out one among the output set of seven categories of membership degrees. This procedure (defuzzification rule) established that the category with the highest membership degree should be the most likely category for the case. Results: The fuzzy model showed a good fit to the development sample. Excellent agreement was achieved between the defuzzified impedance diagnoses and the clinical diagnoses in the testing sample (Kappa = 0.85, p < 0.001). Conclusions: fuzzy linguistic model was found in good agreement with clinical diagnoses. If the whole model output is considered, information on to which extent each BIVA category is present does better advise clinical practice with an enlarged nosological framework and diverse therapeutic strategies. (C) 2012 Elsevier Ltd and European Society for Clinical Nutrition and Metabolism. All rights reserved.
Resumo:
We present a photometric catalogue of compact groups of galaxies (p2MCGs) automatically extracted from the Two-Micron All Sky Survey (2MASS) extended source catalogue. A total of 262 p2MCGs are identified, following the criteria defined by Hickson, of which 230 survive visual inspection (given occasional galaxy fragmentation and blends in the 2MASS parent catalogue). Only one quarter of these 230 groups were previously known compact groups (CGs). Among the 144 p2MCGs that have all their galaxies with known redshifts, 85 (59?per cent) have four or more accordant galaxies. This v2MCG sample of velocity-filtered p2MCGs constitutes the largest sample of CGs (with N = 4) catalogued to date, with both well-defined selection criteria and velocity filtering, and is the first CG sample selected by stellar mass. It is fairly complete up to Kgroup similar to 9 and radial velocity of similar to 6000?km?s-1. We compared the properties of the 78 v2MCGs with median velocities greater than 3000?km?s-1 with the properties of other CG samples, as well as those (mvCGs) extracted from the semi-analytical model (SAM) of Guo et al. run on the high-resolution Millennium-II simulation. This mvCG sample is similar (i.e. with 2/3 of physically dense CGs) to those we had previously extracted on three other SAMs run on the Millennium simulation with 125 times worse spatial and mass resolutions. The space density of v2MCGs within 6000?km?s-1 is 8.0 X 10-5?h3?Mpc-3, i.e. four times that of the Hickson sample [Hickson Compact Group (HCG)] up to the same distance and with the same criteria used in this work, but still 40?per cent less than that of mvCGs. The v2MCG constitutes the first group catalogue to show a statistically large firstsecond ranked galaxy magnitude gap according to TremaineRichstone statistics, as expected if the first ranked group members tend to be the products of galaxy mergers, and as confirmed in the mvCGs. The v2MCG is also the first observed sample to show that first-ranked galaxies tend to be centrally located, again consistent with the predictions obtained from mvCGs. We found no significant correlation of group apparent elongation and velocity dispersion in the quartets among the v2MCGs, and the velocity dispersions of apparently round quartets are not significantly larger than those of chain-like ones, in contrast to what has been previously reported in HCGs. By virtue of its automatic selection with the popular Hickson criteria, its size, its selection on stellar mass, and its statistical signs of mergers and centrally located brightest galaxies, the v2MCG catalogue appears to be the laboratory of choice to study physically dense groups of four or more galaxies of comparable luminosity.
Resumo:
In this paper we propose a hybrid hazard regression model with threshold stress which includes the proportional hazards and the accelerated failure time models as particular cases. To express the behavior of lifetimes the generalized-gamma distribution is assumed and an inverse power law model with a threshold stress is considered. For parameter estimation we develop a sampling-based posterior inference procedure based on Markov Chain Monte Carlo techniques. We assume proper but vague priors for the parameters of interest. A simulation study investigates the frequentist properties of the proposed estimators obtained under the assumption of vague priors. Further, some discussions on model selection criteria are given. The methodology is illustrated on simulated and real lifetime data set.
Resumo:
A thin-layer electrochemical flow cell coupled to capillary electrophoresis with contactless conductivity detection (EC-CE-(CD)-D-4) was applied for the first time to the derivatization and quantification of neutral species using aliphatic alcohols as model compounds. The simultaneous electrooxidation of four alcohols (ethanol, 1-propanol, 1-butanol, and 1-pentanol) to the corresponding carboxylates was carried out on a platinum working electrode in acid medium. The derivatization step required 1 min at 1.6 V vs. Ag/AgCl under stopped flow conditions, which was preceded by a 10 s activation at 0 V. The solution close to the electrode surface was then hydrodynamically injected into the capillary, and a 2.5 min electrophoretic separation was carried out. The fully automated flow system operated at a frequency of 12 analyses per hour. Simultaneous determination of the four alcohols presented detection limits of about 5 x 10(-5) mol As a practical application with a complex matrix, ethanol concentrations were determined in diluted pale lager beer and in nonalcoholic beer. No statistically significant difference was observed between the EC-CE-(CD)-D-4 and gas chromatography with flame ionization detection (GC-FID) results for these samples. The derivatization efficiency remained constant over several hours of continuous operation with lager beer samples (n = 40).
Resumo:
The purpose of this paper is to develop a Bayesian analysis for the right-censored survival data when immune or cured individuals may be present in the population from which the data is taken. In our approach the number of competing causes of the event of interest follows the Conway-Maxwell-Poisson distribution which generalizes the Poisson distribution. Markov chain Monte Carlo (MCMC) methods are used to develop a Bayesian procedure for the proposed model. Also, some discussions on the model selection and an illustration with a real data set are considered.
Resumo:
The log-Burr XII regression model for grouped survival data is evaluated in the presence of many ties. The methodology for grouped survival data is based on life tables, where the times are grouped in k intervals, and we fit discrete lifetime regression models to the data. The model parameters are estimated by maximum likelihood and jackknife methods. To detect influential observations in the proposed model, diagnostic measures based on case deletion, so-called global influence, and influence measures based on small perturbations in the data or in the model, referred to as local influence, are used. In addition to these measures, the total local influence and influential estimates are also used. We conduct Monte Carlo simulation studies to assess the finite sample behavior of the maximum likelihood estimators of the proposed model for grouped survival. A real data set is analyzed using a regression model for grouped data.