955 resultados para Variable-selection Problems
Resumo:
A combinatorial protocol (CP) is introduced here to interface it with the multiple linear regression (MLR) for variable selection. The efficiency of CP-MLR is primarily based on the restriction of entry of correlated variables to the model development stage. It has been used for the analysis of Selwood et al data set [16], and the obtained models are compared with those reported from GFA [8] and MUSEUM [9] approaches. For this data set CP-MLR could identify three highly independent models (27, 28 and 31) with Q2 value in the range of 0.632-0.518. Also, these models are divergent and unique. Even though, the present study does not share any models with GFA [8], and MUSEUM [9] results, there are several descriptors common to all these studies, including the present one. Also a simulation is carried out on the same data set to explain the model formation in CP-MLR. The results demonstrate that the proposed method should be able to offer solutions to data sets with 50 to 60 descriptors in reasonable time frame. By carefully selecting the inter-parameter correlation cutoff values in CP-MLR one can identify divergent models and handle data sets larger than the present one without involving excessive computer time.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
This paper addresses the question of maximizing classifier accuracy for classifying task-related mental activity from Magnetoencelophalography (MEG) data. We propose the use of different sources of information and introduce an automatic channel selection procedure. To determine an informative set of channels, our approach combines a variety of machine learning algorithms: feature subset selection methods, classifiers based on regularized logistic regression, information fusion, and multiobjective optimization based on probabilistic modeling of the search space. The experimental results show that our proposal is able to improve classification accuracy compared to approaches whose classifiers use only one type of MEG information or for which the set of channels is fixed a priori.
Resumo:
The procedure used should guarantee that the risk of decision asserted from the observations is at most some specified value P*.
Resumo:
Many studies on birds focus on the collection of data through an experimental design, suitable for investigation in a classical analysis of variance (ANOVA) framework. Although many findings are confirmed by one or more experts, expert information is rarely used in conjunction with the survey data to enhance the explanatory and predictive power of the model. We explore this neglected aspect of ecological modelling through a study on Australian woodland birds, focusing on the potential impact of different intensities of commercial cattle grazing on bird density in woodland habitat. We examine a number of Bayesian hierarchical random effects models, which cater for overdispersion and a high frequency of zeros in the data using WinBUGS and explore the variation between and within different grazing regimes and species. The impact and value of expert information is investigated through the inclusion of priors that reflect the experience of 20 experts in the field of bird responses to disturbance. Results indicate that expert information moderates the survey data, especially in situations where there are little or no data. When experts agreed, credible intervals for predictions were tightened considerably. When experts failed to agree, results were similar to those evaluated in the absence of expert information. Overall, we found that without expert opinion our knowledge was quite weak. The fact that the survey data is quite consistent, in general, with expert opinion shows that we do know something about birds and grazing and we could learn a lot faster if we used this approach more in ecology, where data are scarce. Copyright (c) 2005 John Wiley & Sons, Ltd.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
La dernière décennie a connu un intérêt croissant pour les problèmes posés par les variables instrumentales faibles dans la littérature économétrique, c’est-à-dire les situations où les variables instrumentales sont faiblement corrélées avec la variable à instrumenter. En effet, il est bien connu que lorsque les instruments sont faibles, les distributions des statistiques de Student, de Wald, du ratio de vraisemblance et du multiplicateur de Lagrange ne sont plus standard et dépendent souvent de paramètres de nuisance. Plusieurs études empiriques portant notamment sur les modèles de rendements à l’éducation [Angrist et Krueger (1991, 1995), Angrist et al. (1999), Bound et al. (1995), Dufour et Taamouti (2007)] et d’évaluation des actifs financiers (C-CAPM) [Hansen et Singleton (1982,1983), Stock et Wright (2000)], où les variables instrumentales sont faiblement corrélées avec la variable à instrumenter, ont montré que l’utilisation de ces statistiques conduit souvent à des résultats peu fiables. Un remède à ce problème est l’utilisation de tests robustes à l’identification [Anderson et Rubin (1949), Moreira (2002), Kleibergen (2003), Dufour et Taamouti (2007)]. Cependant, il n’existe aucune littérature économétrique sur la qualité des procédures robustes à l’identification lorsque les instruments disponibles sont endogènes ou à la fois endogènes et faibles. Cela soulève la question de savoir ce qui arrive aux procédures d’inférence robustes à l’identification lorsque certaines variables instrumentales supposées exogènes ne le sont pas effectivement. Plus précisément, qu’arrive-t-il si une variable instrumentale invalide est ajoutée à un ensemble d’instruments valides? Ces procédures se comportent-elles différemment? Et si l’endogénéité des variables instrumentales pose des difficultés majeures à l’inférence statistique, peut-on proposer des procédures de tests qui sélectionnent les instruments lorsqu’ils sont à la fois forts et valides? Est-il possible de proposer les proédures de sélection d’instruments qui demeurent valides même en présence d’identification faible? Cette thèse se focalise sur les modèles structurels (modèles à équations simultanées) et apporte des réponses à ces questions à travers quatre essais. Le premier essai est publié dans Journal of Statistical Planning and Inference 138 (2008) 2649 – 2661. Dans cet essai, nous analysons les effets de l’endogénéité des instruments sur deux statistiques de test robustes à l’identification: la statistique d’Anderson et Rubin (AR, 1949) et la statistique de Kleibergen (K, 2003), avec ou sans instruments faibles. D’abord, lorsque le paramètre qui contrôle l’endogénéité des instruments est fixe (ne dépend pas de la taille de l’échantillon), nous montrons que toutes ces procédures sont en général convergentes contre la présence d’instruments invalides (c’est-à-dire détectent la présence d’instruments invalides) indépendamment de leur qualité (forts ou faibles). Nous décrivons aussi des cas où cette convergence peut ne pas tenir, mais la distribution asymptotique est modifiée d’une manière qui pourrait conduire à des distorsions de niveau même pour de grands échantillons. Ceci inclut, en particulier, les cas où l’estimateur des double moindres carrés demeure convergent, mais les tests sont asymptotiquement invalides. Ensuite, lorsque les instruments sont localement exogènes (c’est-à-dire le paramètre d’endogénéité converge vers zéro lorsque la taille de l’échantillon augmente), nous montrons que ces tests convergent vers des distributions chi-carré non centrées, que les instruments soient forts ou faibles. Nous caractérisons aussi les situations où le paramètre de non centralité est nul et la distribution asymptotique des statistiques demeure la même que dans le cas des instruments valides (malgré la présence des instruments invalides). Le deuxième essai étudie l’impact des instruments faibles sur les tests de spécification du type Durbin-Wu-Hausman (DWH) ainsi que le test de Revankar et Hartley (1973). Nous proposons une analyse en petit et grand échantillon de la distribution de ces tests sous l’hypothèse nulle (niveau) et l’alternative (puissance), incluant les cas où l’identification est déficiente ou faible (instruments faibles). Notre analyse en petit échantillon founit plusieurs perspectives ainsi que des extensions des précédentes procédures. En effet, la caractérisation de la distribution de ces statistiques en petit échantillon permet la construction des tests de Monte Carlo exacts pour l’exogénéité même avec les erreurs non Gaussiens. Nous montrons que ces tests sont typiquement robustes aux intruments faibles (le niveau est contrôlé). De plus, nous fournissons une caractérisation de la puissance des tests, qui exhibe clairement les facteurs qui déterminent la puissance. Nous montrons que les tests n’ont pas de puissance lorsque tous les instruments sont faibles [similaire à Guggenberger(2008)]. Cependant, la puissance existe tant qu’au moins un seul instruments est fort. La conclusion de Guggenberger (2008) concerne le cas où tous les instruments sont faibles (un cas d’intérêt mineur en pratique). Notre théorie asymptotique sous les hypothèses affaiblies confirme la théorie en échantillon fini. Par ailleurs, nous présentons une analyse de Monte Carlo indiquant que: (1) l’estimateur des moindres carrés ordinaires est plus efficace que celui des doubles moindres carrés lorsque les instruments sont faibles et l’endogenéité modérée [conclusion similaire à celle de Kiviet and Niemczyk (2007)]; (2) les estimateurs pré-test basés sur les tests d’exogenété ont une excellente performance par rapport aux doubles moindres carrés. Ceci suggère que la méthode des variables instrumentales ne devrait être appliquée que si l’on a la certitude d’avoir des instruments forts. Donc, les conclusions de Guggenberger (2008) sont mitigées et pourraient être trompeuses. Nous illustrons nos résultats théoriques à travers des expériences de simulation et deux applications empiriques: la relation entre le taux d’ouverture et la croissance économique et le problème bien connu du rendement à l’éducation. Le troisième essai étend le test d’exogénéité du type Wald proposé par Dufour (1987) aux cas où les erreurs de la régression ont une distribution non-normale. Nous proposons une nouvelle version du précédent test qui est valide même en présence d’erreurs non-Gaussiens. Contrairement aux procédures de test d’exogénéité usuelles (tests de Durbin-Wu-Hausman et de Rvankar- Hartley), le test de Wald permet de résoudre un problème courant dans les travaux empiriques qui consiste à tester l’exogénéité partielle d’un sous ensemble de variables. Nous proposons deux nouveaux estimateurs pré-test basés sur le test de Wald qui performent mieux (en terme d’erreur quadratique moyenne) que l’estimateur IV usuel lorsque les variables instrumentales sont faibles et l’endogénéité modérée. Nous montrons également que ce test peut servir de procédure de sélection de variables instrumentales. Nous illustrons les résultats théoriques par deux applications empiriques: le modèle bien connu d’équation du salaire [Angist et Krueger (1991, 1999)] et les rendements d’échelle [Nerlove (1963)]. Nos résultats suggèrent que l’éducation de la mère expliquerait le décrochage de son fils, que l’output est une variable endogène dans l’estimation du coût de la firme et que le prix du fuel en est un instrument valide pour l’output. Le quatrième essai résout deux problèmes très importants dans la littérature économétrique. D’abord, bien que le test de Wald initial ou étendu permette de construire les régions de confiance et de tester les restrictions linéaires sur les covariances, il suppose que les paramètres du modèle sont identifiés. Lorsque l’identification est faible (instruments faiblement corrélés avec la variable à instrumenter), ce test n’est en général plus valide. Cet essai développe une procédure d’inférence robuste à l’identification (instruments faibles) qui permet de construire des régions de confiance pour la matrices de covariances entre les erreurs de la régression et les variables explicatives (possiblement endogènes). Nous fournissons les expressions analytiques des régions de confiance et caractérisons les conditions nécessaires et suffisantes sous lesquelles ils sont bornés. La procédure proposée demeure valide même pour de petits échantillons et elle est aussi asymptotiquement robuste à l’hétéroscédasticité et l’autocorrélation des erreurs. Ensuite, les résultats sont utilisés pour développer les tests d’exogénéité partielle robustes à l’identification. Les simulations Monte Carlo indiquent que ces tests contrôlent le niveau et ont de la puissance même si les instruments sont faibles. Ceci nous permet de proposer une procédure valide de sélection de variables instrumentales même s’il y a un problème d’identification. La procédure de sélection des instruments est basée sur deux nouveaux estimateurs pré-test qui combinent l’estimateur IV usuel et les estimateurs IV partiels. Nos simulations montrent que: (1) tout comme l’estimateur des moindres carrés ordinaires, les estimateurs IV partiels sont plus efficaces que l’estimateur IV usuel lorsque les instruments sont faibles et l’endogénéité modérée; (2) les estimateurs pré-test ont globalement une excellente performance comparés à l’estimateur IV usuel. Nous illustrons nos résultats théoriques par deux applications empiriques: la relation entre le taux d’ouverture et la croissance économique et le modèle de rendements à l’éducation. Dans la première application, les études antérieures ont conclu que les instruments n’étaient pas trop faibles [Dufour et Taamouti (2007)] alors qu’ils le sont fortement dans la seconde [Bound (1995), Doko et Dufour (2009)]. Conformément à nos résultats théoriques, nous trouvons les régions de confiance non bornées pour la covariance dans le cas où les instruments sont assez faibles.
Resumo:
Vector Autoregressive Moving Average (VARMA) models have many theoretical properties which should make them popular among empirical macroeconomists. However, they are rarely used in practice due to over-parameterization concerns, difficulties in ensuring identification and computational challenges. With the growing interest in multivariate time series models of high dimension, these problems with VARMAs become even more acute, accounting for the dominance of VARs in this field. In this paper, we develop a Bayesian approach for inference in VARMAs which surmounts these problems. It jointly ensures identification and parsimony in the context of an efficient Markov chain Monte Carlo (MCMC) algorithm. We use this approach in a macroeconomic application involving up to twelve dependent variables. We find our algorithm to work successfully and provide insights beyond those provided by VARs.
Resumo:
This paper argues that the strategic use of debt favours the revelationof information in dynamic adverse selection problems. Our argument is basedon the idea that debt is a credible commitment to end long term relationships.Consequently, debt encourages a privately informed party to disclose itsinformation at early stages of a relationship. We illustrate our pointwith the financing decision of a monopolist selling a good to a buyerwhose valuation is private information. A high level of (renegotiable)debt, by increasing the scope for liquidation, may induce the highvaluation buyer to buy early at a high price and thus increase themonopolist's expected payoff. By affecting the buyer's strategy, it mayreduce the probability of excessive liquidation. We investigate theconsequences of good durability and we examine the way debt mayalleviate the ratchet effect.
Resumo:
This paper is concerned with the selection of inputs for classification models based on ratios of measured quantities. For this purpose, all possible ratios are built from the quantities involved and variable selection techniques are used to choose a convenient subset of ratios. In this context, two selection techniques are proposed: one based on a pre-selection procedure and another based on a genetic algorithm. In an example involving the financial distress prediction of companies, the models obtained from ratios selected by the proposed techniques compare favorably to a model using ratios usually found in the financial distress literature.
Resumo:
Analyzes the use of linear and neural network models for financial distress classification, with emphasis on the issues of input variable selection and model pruning. A data-driven method for selecting input variables (financial ratios, in this case) is proposed. A case study involving 60 British firms in the period 1997-2000 is used for illustration. It is shown that the use of the Optimal Brain Damage pruning technique can considerably improve the generalization ability of a neural model. Moreover, the set of financial ratios obtained with the proposed selection procedure is shown to be an appropriate alternative to the ratios usually employed by practitioners.
Resumo:
The objective of this dissertation is to re-examine classical issues in corporate finance, applying a new analytical tool. The single-crossing property, also called Spence-irrlees condition, is not required in the models developed here. This property has been a standard assumption in adverse selection and signaling models developed so far. The classical papers by Guesnerie and Laffont (1984) and Riley (1979) assume it. In the simplest case, for a consumer with a privately known taste, the single-crossing property states that the marginal utility of a good is monotone with respect to the taste. This assumption has an important consequence to the result of the model: the relationship between the private parameter and the quantity of the good assigned to the agent is monotone. While single crossing is a reasonable property for the utility of an ordinary consumer, this property is frequently absent in the objective function of the agents for more elaborate models. The lack of a characterization for the non-single crossing context has hindered the exploration of models that generate objective functions without this property. The first work that characterizes the optimal contract without the single-crossing property is Araújo and Moreira (2001a) and, for the competitive case, Araújo and Moreira (2001b). The main implication is that a partial separation of types may be observed. Two sets of disconnected types of agents may choose the same contract, in adverse selection problems, or signal with the same levei of signal, in signaling models.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
High-throughput gene expression technologies such as microarrays have been utilized in a variety of scientific applications. Most of the work has been on assessing univariate associations between gene expression with clinical outcome (variable selection) or on developing classification procedures with gene expression data (supervised learning). We consider a hybrid variable selection/classification approach that is based on linear combinations of the gene expression profiles that maximize an accuracy measure summarized using the receiver operating characteristic curve. Under a specific probability model, this leads to consideration of linear discriminant functions. We incorporate an automated variable selection approach using LASSO. An equivalence between LASSO estimation with support vector machines allows for model fitting using standard software. We apply the proposed method to simulated data as well as data from a recently published prostate cancer study.
Resumo:
The purpose of this research and development project was to develop a method, a design, and a prototype for gathering, managing, and presenting data about occupational injuries.^ State-of-the-art systems analysis and design methodologies were applied to the long standing problem in the field of occupational safety and health of processing workplace injuries data into information for safety and health program management as well as preliminary research about accident etiologies. The top-down planning and bottom-up implementation approach was utilized to design an occupational injury management information system. A description of a managerial control system and a comprehensive system to integrate safety and health program management was provided.^ The project showed that current management information systems (MIS) theory and methods could be applied successfully to the problems of employee injury surveillance and control program performance evaluation. The model developed in the first section was applied at The University of Texas Health Science Center at Houston (UTHSCH).^ The system in current use at the UTHSCH was described and evaluated, and a prototype was developed for the UTHSCH. The prototype incorporated procedures for collecting, storing, and retrieving records of injuries and the procedures necessary to prepare reports, analyses, and graphics for management in the Health Science Center. Examples of reports, analyses, and graphics presenting UTHSCH and computer generated data were included.^ It was concluded that a pilot test of this MIS should be implemented and evaluated at the UTHSCH and other settings. Further research and development efforts for the total safety and health management information systems, control systems, component systems, and variable selection should be pursued. Finally, integration of the safety and health program MIS into the comprehensive or executive MIS was recommended. ^