843 resultados para Model selection
Resumo:
The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
Resumo:
In this article, we propose a new Bayesian flexible cure rate survival model, which generalises the stochastic model of Klebanov et al. [Klebanov LB, Rachev ST and Yakovlev AY. A stochastic-model of radiation carcinogenesis - latent time distributions and their properties. Math Biosci 1993; 113: 51-75], and has much in common with the destructive model formulated by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)]. In our approach, the accumulated number of lesions or altered cells follows a compound weighted Poisson distribution. This model is more flexible than the promotion time cure model in terms of dispersion. Moreover, it possesses an interesting and realistic interpretation of the biological mechanism of the occurrence of the event of interest as it includes a destructive process of tumour cells after an initial treatment or the capacity of an individual exposed to irradiation to repair altered cells that results in cancer induction. In other words, what is recorded is only the damaged portion of the original number of altered cells not eliminated by the treatment or repaired by the repair system of an individual. Markov Chain Monte Carlo (MCMC) methods are then used to develop Bayesian inference for the proposed model. Also, some discussions on the model selection and an illustration with a cutaneous melanoma data set analysed by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)] are presented.
Resumo:
In the setting of high-dimensional linear models with Gaussian noise, we investigate the possibility of confidence statements connected to model selection. Although there exist numerous procedures for adaptive (point) estimation, the construction of adaptive confidence regions is severely limited (cf. Li in Ann Stat 17:1001–1008, 1989). The present paper sheds new light on this gap. We develop exact and adaptive confidence regions for the best approximating model in terms of risk. One of our constructions is based on a multiscale procedure and a particular coupling argument. Utilizing exponential inequalities for noncentral χ2-distributions, we show that the risk and quadratic loss of all models within our confidence region are uniformly bounded by the minimal risk times a factor close to one.
Resumo:
Racing algorithms have recently been proposed as a general-purpose method for performing model selection in machine teaming algorithms. In this paper, we present an empirical study of the Hoeffding racing algorithm for selecting the k parameter in a simple k-nearest neighbor classifier. Fifteen widely-used classification datasets from UCI are used and experiments conducted across different confidence levels for racing. The results reveal a significant amount of sensitivity of the k-nn classifier to its model parameter value. The Hoeffding racing algorithm also varies widely in its performance, in terms of the computational savings gained over an exhaustive evaluation. While in some cases the savings gained are quite small, the racing algorithm proved to be highly robust to the possibility of erroneously eliminating the optimal models. All results were strongly dependent on the datasets used.
Resumo:
We discuss aggregation of data from neuropsychological patients and the process of evaluating models using data from a series of patients. We argue that aggregation can be misleading but not aggregating can also result in information loss. The basis for combining data needs to be theoretically defined, and the particular method of aggregation depends on the theoretical question and characteristics of the data. We present examples, often drawn from our own research, to illustrate these points. We also argue that statistical models and formal methods of model selection are a useful way to test theoretical accounts using data from several patients in multiple-case studies or case series. Statistical models can often measure fit in a way that explicitly captures what a theory allows; the parameter values that result from model fitting often measure theoretically important dimensions and can lead to more constrained theories or new predictions; and model selection allows the strength of evidence for models to be quantified without forcing this into the artificial binary choice that characterizes hypothesis testing methods. Methods that aggregate and then formally model patient data, however, are not automatically preferred to other methods. Which method is preferred depends on the question to be addressed, characteristics of the data, and practical issues like availability of suitable patients, but case series, multiple-case studies, single-case studies, statistical models, and process models should be complementary methods when guided by theory development.
Resumo:
The use of chemical control measures to reduce the impact of parasite and pest species has frequently resulted in the development of resistance. Thus, resistance management has become a key concern in human and veterinary medicine, and in agricultural production. Although it is known that factors such as gene flow between susceptible and resistant populations, drug type, application methods, and costs of resistance can affect the rate of resistance evolution, less is known about the impacts of density-dependent eco-evolutionary processes that could be altered by drug-induced mortality. The overall aim of this thesis was to take an experimental evolution approach to assess how life history traits respond to drug selection, using a free-living dioecious worm (Caenorhabditis remanei) as a model. In Chapter 2, I defined the relationship between C. remanei survival and Ivermectin dose over a range of concentrations, in order to control the intensity of selection used in the selection experiment described in Chapter 4. The dose-response data were also used to appraise curve-fitting methods, using Akaike Information Criterion (AIC) model selection to compare a series of nonlinear models. The type of model fitted to the dose response data had a significant effect on the estimates of LD50 and LD99, suggesting that failure to fit an appropriate model could give misleading estimates of resistance status. In addition, simulated data were used to establish that a potential cost of resistance could be predicted by comparing survival at the upper asymptote of dose-response curves for resistant and susceptible populations, even when differences were as low as 4%. This approach to dose-response modeling ensures that the maximum amount of useful information relating to resistance is gathered in one study. In Chapter 3, I asked how simulations could be used to inform important design choices used in selection experiments. Specifically, I focused on the effects of both within- and between-line variation on estimated power, when detecting small, medium and large effect sizes. Using mixed-effect models on simulated data, I demonstrated that commonly used designs with realistic levels of variation could be underpowered for substantial effect sizes. Thus, use of simulation-based power analysis provides an effective way to avoid under or overpowering a study designs incorporating variation due to random effects. In Chapter 4, I 3 investigated how Ivermectin dosage and changes in population density affect the rate of resistance evolution. I exposed replicate lines of C. remanei to two doses of Ivermectin (high and low) to assess relative survival of lines selected in drug-treated environments compared to untreated controls over 10 generations. Additionally, I maintained lines where mortality was imposed randomly to control for differences in density between drug treatments and to distinguish between the evolutionary consequences of drug treatment versus ecological processes affected by changes in density-dependent feedback. Intriguingly, both drug-selected and random-mortality lines showed an increase in survivorship when challenged with Ivermectin; the magnitude of this increase varied with the intensity of selection and life-history stage. The results suggest that interactions between density-dependent processes and life history may mediate evolved changes in susceptibility to control measures, which could result in misleading conclusions about the evolution of heritable resistance following drug treatment. In Chapter 5, I investigated whether the apparent changes in drug susceptibility found in Chapter 4 were related to evolved changes in life-history of C. remanei populations after selection in drug-treated and random-mortality environments. Rapid passage of lines in the drug-free environment had no effect on the measured life-history traits. In the drug-free environment, adult size and fecundity of drug-selected lines increased compared to the controls but drug selection did not affect lifespan. In the treated environment, drug-selected lines showed increased lifespan and fecundity relative to controls. Adult size of randomly culled lines responded in a similar way to drug-selected lines in the drug-free environment, but no change in fecundity or lifespan was observed in either environment. The results suggest that life histories of nematodes can respond to selection as a result of the application of control measures. Failure to take these responses into account when applying control measures could result in adverse outcomes, such as larger and more fecund parasites, as well as over-estimation of the development of genetically controlled resistance. In conclusion, my thesis shows that there may be a complex relationship between drug selection, density-dependent regulatory processes and life history of populations challenged with control measures. This relationship could have implications for how resistance is monitored and managed if life histories of parasitic species show such eco-evolutionary responses to drug application.
Resumo:
Spoken term detection (STD) popularly involves performing word or sub-word level speech recognition and indexing the result. This work challenges the assumption that improved speech recognition accuracy implies better indexing for STD. Using an index derived from phone lattices, this paper examines the effect of language model selection on the relationship between phone recognition accuracy and STD accuracy. Results suggest that language models usually improve phone recognition accuracy but their inclusion does not always translate to improved STD accuracy. The findings suggest that using phone recognition accuracy to measure the quality of an STD index can be problematic, and highlight the need for an alternative that is more closely aligned with the goals of the specific detection task.
Resumo:
Over recent years a significant amount of research has been undertaken to develop prognostic models that can be used to predict the remaining useful life of engineering assets. Implementations by industry have only had limited success. By design, models are subject to specific assumptions and approximations, some of which are mathematical, while others relate to practical implementation issues such as the amount of data required to validate and verify a proposed model. Therefore, appropriate model selection for successful practical implementation requires not only a mathematical understanding of each model type, but also an appreciation of how a particular business intends to utilise a model and its outputs. This paper discusses business issues that need to be considered when selecting an appropriate modelling approach for trial. It also presents classification tables and process flow diagrams to assist industry and research personnel select appropriate prognostic models for predicting the remaining useful life of engineering assets within their specific business environment. The paper then explores the strengths and weaknesses of the main prognostics model classes to establish what makes them better suited to certain applications than to others and summarises how each have been applied to engineering prognostics. Consequently, this paper should provide a starting point for young researchers first considering options for remaining useful life prediction. The models described in this paper are Knowledge-based (expert and fuzzy), Life expectancy (stochastic and statistical), Artificial Neural Networks, and Physical models.
Resumo:
Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space - classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm -using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem.
Resumo:
Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive definite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space -- classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semi-definite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm -- using the labelled part of the data one can learn an embedding also for the unlabelled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method to learn the 2-norm soft margin parameter in support vector machines, solving another important open problem. Finally, the novel approach presented in the paper is supported by positive empirical results.
Resumo:
Hybrid system representations have been applied to many challenging modeling situations. In these hybrid system representations, a mixture of continuous and discrete states is used to capture the dominating behavioural features of a nonlinear, possible uncertain, model under approximation. Unfortunately, the problem of how to best design a suitable hybrid system model has not yet been fully addressed. This paper proposes a new joint state measurement relative entropy rate based approach for this design purpose. Design examples and simulation studies are presented which highlight the benefits of our proposed design approaches.
Resumo:
The quick detection of abrupt (unknown) parameter changes in an observed hidden Markov model (HMM) is important in several applications. Motivated by the recent application of relative entropy concepts in the robust sequential change detection problem (and the related model selection problem), this paper proposes a sequential unknown change detection algorithm based on a relative entropy based HMM parameter estimator. Our proposed approach is able to overcome the lack of knowledge of post-change parameters, and is illustrated to have similar performance to the popular cumulative sum (CUSUM) algorithm (which requires knowledge of the post-change parameter values) when examined, on both simulated and real data, in a vision-based aircraft manoeuvre detection problem.
Resumo:
We investigate the utility to computational Bayesian analyses of a particular family of recursive marginal likelihood estimators characterized by the (equivalent) algorithms known as "biased sampling" or "reverse logistic regression" in the statistics literature and "the density of states" in physics. Through a pair of numerical examples (including mixture modeling of the well-known galaxy dataset) we highlight the remarkable diversity of sampling schemes amenable to such recursive normalization, as well as the notable efficiency of the resulting pseudo-mixture distributions for gauging prior-sensitivity in the Bayesian model selection context. Our key theoretical contributions are to introduce a novel heuristic ("thermodynamic integration via importance sampling") for qualifying the role of the bridging sequence in this procedure, and to reveal various connections between these recursive estimators and the nested sampling technique.
Resumo:
Research problem: Overfitting and collinearity problems commonly exist in current construction cost estimation applications and obstruct researchers and practitioners in achieving better modelling results. Research objective and method: A hybrid approach of Akaike information criterion (AIC) stepwise regression and principal component regression (PCR) is proposed to help solve overfitting and collinearity problems. Utilization of this approach in linear regression is validated by comparing it with other commonly used approaches. The mean square error obtained by leave-one-out cross validation (MSELOOCV) is used in model selection in deciding predictive variables.