28 resultados para leave one out cross validation
Resumo:
In this chapter, we elaborate on the well-known relationship between Gaussian processes (GP) and Support Vector Machines (SVM). Secondly, we present approximate solutions for two computational problems arising in GP and SVM. The first one is the calculation of the posterior mean for GP classifiers using a `naive' mean field approach. The second one is a leave-one-out estimator for the generalization error of SVM based on a linear response method. Simulation results on a benchmark dataset show similar performances for the GP mean field algorithm and the SVM algorithm. The approximate leave-one-out estimator is found to be in very good agreement with the exact leave-one-out error.
Resumo:
Background - Vaccine development in the post-genomic era often begins with the in silico screening of genome information, with the most probable protective antigens being predicted rather than requiring causative microorganisms to be grown. Despite the obvious advantages of this approach – such as speed and cost efficiency – its success remains dependent on the accuracy of antigen prediction. Most approaches use sequence alignment to identify antigens. This is problematic for several reasons. Some proteins lack obvious sequence similarity, although they may share similar structures and biological properties. The antigenicity of a sequence may be encoded in a subtle and recondite manner not amendable to direct identification by sequence alignment. The discovery of truly novel antigens will be frustrated by their lack of similarity to antigens of known provenance. To overcome the limitations of alignment-dependent methods, we propose a new alignment-free approach for antigen prediction, which is based on auto cross covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties. Results - Bacterial, viral and tumour protein datasets were used to derive models for prediction of whole protein antigenicity. Every set consisted of 100 known antigens and 100 non-antigens. The derived models were tested by internal leave-one-out cross-validation and external validation using test sets. An additional five training sets for each class of antigens were used to test the stability of the discrimination between antigens and non-antigens. The models performed well in both validations showing prediction accuracy of 70% to 89%. The models were implemented in a server, which we call VaxiJen. Conclusion - VaxiJen is the first server for alignment-independent prediction of protective antigens. It was developed to allow antigen classification solely based on the physicochemical properties of proteins without recourse to sequence alignment. The server can be used on its own or in combination with alignment-based prediction methods.
Resumo:
The accurate identification of T-cell epitopes remains a principal goal of bioinformatics within immunology. As the immunogenicity of peptide epitopes is dependent on their binding to major histocompatibility complex (MHC) molecules, the prediction of binding affinity is a prerequisite to the reliable prediction of epitopes. The iterative self-consistent (ISC) partial-least-squares (PLS)-based additive method is a recently developed bioinformatic approach for predicting class II peptide−MHC binding affinity. The ISC−PLS method overcomes many of the conceptual difficulties inherent in the prediction of class II peptide−MHC affinity, such as the binding of a mixed population of peptide lengths due to the open-ended class II binding site. The method has applications in both the accurate prediction of class II epitopes and the manipulation of affinity for heteroclitic and competitor peptides. The method is applied here to six class II mouse alleles (I-Ab, I-Ad, I-Ak, I-As, I-Ed, and I-Ek) and included peptides up to 25 amino acids in length. A series of regression equations highlighting the quantitative contributions of individual amino acids at each peptide position was established. The initial model for each allele exhibited only moderate predictivity. Once the set of selected peptide subsequences had converged, the final models exhibited a satisfactory predictive power. Convergence was reached between the 4th and 17th iterations, and the leave-one-out cross-validation statistical terms - q2, SEP, and NC - ranged between 0.732 and 0.925, 0.418 and 0.816, and 1 and 6, respectively. The non-cross-validated statistical terms r2 and SEE ranged between 0.98 and 0.995 and 0.089 and 0.180, respectively. The peptides used in this study are available from the AntiJen database (http://www.jenner.ac.uk/AntiJen). The PLS method is available commercially in the SYBYL molecular modeling software package. The resulting models, which can be used for accurate T-cell epitope prediction, will be made freely available online (http://www.jenner.ac.uk/MHCPred).
Resumo:
An interoperable web processing service (WPS) for the automatic interpolation of environmental data has been developed in the frame of the INTAMAP project. In order to assess the performance of the interpolation method implemented, a validation WPS has also been developed. This validation WPS can be used to perform leave one out and K-fold cross validation: a full dataset is submitted and a range of validation statistics and diagnostic plots (e.g. histograms, variogram of residuals, mean errors) is received in return. This paper presents the architecture of the validation WPS and a case study is used to briefly illustrate its use in practice. We conclude with a discussion on the current limitations of the system and make proposals for further developments.
Resumo:
It is known theoretically that an algorithm cannot be good for an arbitrary prior. We show that in practical terms this also applies to the technique of ``cross validation'', which has been widely regarded as defying this general rule. Numerical examples are analysed in detail. Their implications to researches on learning algorithms are discussed.
Resumo:
We derive a mean field algorithm for binary classification with Gaussian processes which is based on the TAP approach originally proposed in Statistical Physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler 'naive' mean field theory and support vector machines (SVM) as limiting cases. For both mean field algorithms and support vectors machines, simulation results for three small benchmark data sets are presented. They show 1. that one may get state of the art performance by using the leave-one-out estimator for model selection and 2. the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The latter result is a taken as a strong support for the internal consistency of the mean field approach.
Resumo:
Background: Identifying biological markers to aid diagnosis of bipolar disorder (BD) is critically important. To be considered a possible biological marker, neural patterns in BD should be discriminant from those in healthy individuals (HI). We examined patterns of neuromagnetic responses revealed by magnetoencephalography (MEG) during implicit emotion-processing using emotional (happy, fearful, sad) and neutral facial expressions, in sixteen BD and sixteen age- and gender-matched healthy individuals. Methods: Neuromagnetic data were recorded using a 306-channel whole-head MEG ELEKTA Neuromag System, and preprocessed using Signal Space Separation as implemented in MaxFilter (ELEKTA). Custom Matlab programs removed EOG and ECG signals from filtered MEG data, and computed means of epoched data (0-250ms, 250-500ms, 500-750ms). A generalized linear model with three factors (individual, emotion intensity and time) compared BD and HI. A principal component analysis of normalized mean channel data in selected brain regions identified principal components that explained 95% of data variation. These components were used in a quadratic support vector machine (SVM) pattern classifier. SVM classifier performance was assessed using the leave-one-out approach. Results: BD and HI showed significantly different patterns of activation for 0-250ms within both left occipital and temporal regions, specifically for neutral facial expressions. PCA analysis revealed significant differences between BD and HI for mild fearful, happy, and sad facial expressions within 250-500ms. SVM quadratic classifier showed greatest accuracy (84%) and sensitivity (92%) for neutral faces, in left occipital regions within 500-750ms. Conclusions: MEG responses may be used in the search for disease specific neural markers.
Resumo:
This project explored how consumers in emerging economies evaluate brand extension by using China as a case. Two separate but related studies were conducted, and university students were used as respondents in both the studies. Study one or replication study tested Aaker and Keller's brand extension model in China. Assuming similar methods to Aaker and Keller's, six well-recognised brands were chosen as parent brand and each was extended to three product categories. Totally, 469 respondents completed the survey questionnaire. As each was to evaluate six extensions, this made the cases 2814. The data was analysed using Optimal Least Square regression approach and "residual centred" approach respectively. The result confirmed most of the findings observed in developed countries. Specifically, consumer's attitude towards the extension is primarily driven by the brand affect, the fit between the two product categories, the difficulty of making the extension and moderated via the interactions between the brand affect and the fit variables. Study two refined and extended Aaker and Keller's model by adding new variables and making methodological adjustments. The same stimuli and data analysis techniques as those in the replication were employed. 252 respondents participated in the survey and each evaluated six extensions, making cases 1512. In addition to re-verifying the findings of the replication and providing cross validation to these findings, the extended study found that the image consistency between the parent brand and the extension, the competition intensity of the extension product market were important in determining the success of the extension. Further, consumer differed in evaluating durable extensions and non-durable extensions. The thesis detailed the two studies above, and discussed the findings and their implications by relating to branding literature, to the general situation of the emerging economies as well as the reality of China. It also presented the limitations of the research and the future research directions.
Resumo:
The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
Resumo:
The Dirichlet process mixture model (DPMM) is a ubiquitous, flexible Bayesian nonparametric statistical model. However, full probabilistic inference in this model is analytically intractable, so that computationally intensive techniques such as Gibbs sampling are required. As a result, DPMM-based methods, which have considerable potential, are restricted to applications in which computational resources and time for inference is plentiful. For example, they would not be practical for digital signal processing on embedded hardware, where computational resources are at a serious premium. Here, we develop a simplified yet statistically rigorous approximate maximum a-posteriori (MAP) inference algorithm for DPMMs. This algorithm is as simple as DP-means clustering, solves the MAP problem as well as Gibbs sampling, while requiring only a fraction of the computational effort. (For freely available code that implements the MAP-DP algorithm for Gaussian mixtures see http://www.maxlittle.net/.) Unlike related small variance asymptotics (SVA), our method is non-degenerate and so inherits the “rich get richer” property of the Dirichlet process. It also retains a non-degenerate closed-form likelihood which enables out-of-sample calculations and the use of standard tools such as cross-validation. We illustrate the benefits of our algorithm on a range of examples and contrast it to variational, SVA and sampling approaches from both a computational complexity perspective as well as in terms of clustering performance. We demonstrate the wide applicabiity of our approach by presenting an approximate MAP inference method for the infinite hidden Markov model whose performance contrasts favorably with a recently proposed hybrid SVA approach. Similarly, we show how our algorithm can applied to a semiparametric mixed-effects regression model where the random effects distribution is modelled using an infinite mixture model, as used in longitudinal progression modelling in population health science. Finally, we propose directions for future research on approximate MAP inference in Bayesian nonparametrics.
Resumo:
The thrust of this report concerns spline theory and some of the background to spline theory and follows the development in (Wahba, 1991). We also review methods for determining hyper-parameters, such as the smoothing parameter, by Generalised Cross Validation. Splines have an advantage over Gaussian Process based procedures in that we can readily impose atmospherically sensible smoothness constraints and maintain computational efficiency. Vector splines enable us to penalise gradients of vorticity and divergence in wind fields. Two similar techniques are summarised and improvements based on robust error functions and restricted numbers of basis functions given. A final, brief discussion of the application of vector splines to the problem of scatterometer data assimilation highlights the problems of ambiguous solutions.
Resumo:
This Letter addresses image segmentation via a generative model approach. A Bayesian network (BNT) in the space of dyadic wavelet transform coefficients is introduced to model texture images. The model is similar to a Hidden Markov model (HMM), but with non-stationary transitive conditional probability distributions. It is composed of discrete hidden variables and observable Gaussian outputs for wavelet coefficients. In particular, the Gabor wavelet transform is considered. The introduced model is compared with the simplest joint Gaussian probabilistic model for Gabor wavelet coefficients for several textures from the Brodatz album [1]. The comparison is based on cross-validation and includes probabilistic model ensembles instead of single models. In addition, the robustness of the models to cope with additive Gaussian noise is investigated. We further study the feasibility of the introduced generative model for image segmentation in the novelty detection framework [2]. Two examples are considered: (i) sea surface pollution detection from intensity images and (ii) image segmentation of the still images with varying illumination across the scene.
Resumo:
The work present in this thesis was aimed at assessing the efficacy of lithium in the acute treatment of mania and for the prophylaxis of bipolar disorder, and investigating the value of plasma haloperidol concentration for predicting response to treatment in schizophrenia. The pharmacogenetics of psychotropic drugs is critically appraised to provide insights into interindividual variability in response to pharmacotherapy, In clinical trials of acute mania, a number of measures have been used to assess the severity of illness and its response to treatment. Rating instruments need to be validated in order for a clinical study to provide reliable and meaningful estimates of treatment effects, Eight symptom-rating scales were identified and critically assessed, The Mania Rating Scale (MRS) was the most commonly used for assessing treatment response, The advantage of the MRS is that there is a relatively extensive database of studies based on it and this will no doubt ensure that it remains a gold standard for the foreseeable future. Other useful rating scales are available for measuring mania but further cross-validation and validation against clinically meaningful global changes are required. A total of 658 patients from 12 trials were included in an evaluation of the efficacy of lithium in the treatment of acute mania. Treatment periods ranged from 3 to 4 weeks. Efficacy was estimated using (i) the differences in the reduction in mania severity scores, and (ii) the ratio and difference in improvement response rates. The response rate ratio for lithium against placebo was 1.95 (95% CI 1.17 to 3.23). The mean number needed to treat was 5 (95% CI 3 to 20). Patients were twice as likely to obtain remission with lithium than with chlorpromazine (rate ratio = 1.96, 95% CI 1.02 to 3.77). The mean number needed to treat (NNT) was 4 (95% CI 3 to 9). Neither carbamazepine nor valproate was more effective than lithium. The response rate ratios were 1.01 (95% CI 0.54 to 1.88) for lithium compared to carbarnazepine and 1.22 (95% CI 0.91 to 1.64) for lithium against valproate. Haloperidol was no better than lithium on the basis of improvement based on assessment of global severity. The differences in effects between lithium and risperidone were -2.79 (95% CI -4.22 to -1.36) in favour of risperidone with respect to symptom severity improvement and -0.76 (95% CI -1.11 to -0,41) on the basis of reduction in global severity of disease. Symptom and global severity was at least as well controlIed with lithium as with verapamil. Lithium caused more side-effects than placebo and verapamil, but no more than carbamazepine or valproate. A total of 554 patients from 13 trials were included in the statistical analysis of lithium's efficacy in the prophylaxis of bipolar disorder. The mean follow-up period was 5-34 months. The relapse risk ratio for lithium versus placebo was 0.47 (95% CI 0.26 to 0.86) and the NNT was 3 (95% CI 2 to 7). The relapse risk ratio for lithium versus imipramine was 0.62 (95% CI 0.46 to 0.84) and the NNT was 4 (951% Cl 3 to 7), The combination of lithium and imipramine was no more effective than lithium alone. The risk of relapse was greater with lithium alone than with the lithium-divalproate combination. A risk difference of 0.60 (95% CI 0.21 to 0.99) and an NNT of 2 (95% CI 1 to 5) were obtained. Lithium was as effective as carbamazepine. Based on individual data concerning plasma haloperidol concentration and percent improvement in psychotic symptoms, our results suggest an acceptable concentration range of 11.20-30.30 ng/mL A minimum of 2 weeks should be allowed before evaluating therapeutic response. Monitoring of drug plasma levels seems not to be necessary unless behavioural toxicity or noncompliance is suspected. Pharmacokinetics and pharmacodynamics, which are mainly determined by genetic factors, contribute to interindividual and interethnic variations in clinical response to drugs. These variations are primarily due to differences in drug metabolism. Variability in pharmacokinetics of a number of drugs is associated with oxidation polymorphism. Debrisoquine/sparteine hydroxylase (CYP2D6) and the S-mephenytoin hydroxylase (CYP2C19) are polymorphic P450 enzymes with particular importance in psychopharmacotherapy. The enzymes are responsible for the metabolism of many commonly used antipsychotic and antidepressant drugs. The incidence of poor metabolisers of debrisoquine and S-mephenytoin varies widely among populations. Ethnic variations in polymorphic isoenzymes may, at least in part, explain ethnic differences in response to pharmacotherapy of antipsychotics and antidepressant drugs.
Resumo:
Fifteen Miscanthus genotypes grown in five locations across Europe were analysed to investigate the influence of genetic and environmental factors on cell wall composition. Chemometric techniques combining near infrared reflectance spectroscopy (NIRS) and conventional chemical analyses were used to construct calibration models for determination of acid detergent lignin (ADL), acid detergent fibre (ADF), and neutral detergent fibre (NDF) from sample spectra. Results generated were subsequently converted to lignin, cellulose and hemicellulose content and used to assess the genetic and environmental variation in cell wall composition of Miscanthus and to identify genotypes which display quality traits suitable for exploitation in a range of energy conversion systems. The NIRS calibration models developed were found to predict concentrations with a good degree of accuracy based on the coefficient of determination (R2), standard error of calibration (SEC), and standard error of cross-validation (SECV) values. Across all sites mean lignin, cellulose and hemicellulose values in the winter harvest ranged from 76–115 g kg-1, 412–529 g kg-1, and 235–338 g kg-1 respectively. Overall, of the 15 genotypes Miscanthus x giganteus and Miscanthus sacchariflorus contained higher lignin and cellulose concentrations in the winter harvest. The degree of observed genotypic variation in cell wall composition indicates good potential for plant breeding and matching feedstocks to be optimised to different energy conversion processes.
Resumo:
We investigate two numerical procedures for the Cauchy problem in linear elasticity, involving the relaxation of either the given boundary displacements (Dirichlet data) or the prescribed boundary tractions (Neumann data) on the over-specified boundary, in the alternating iterative algorithm of Kozlov et al. (1991). The two mixed direct (well-posed) problems associated with each iteration are solved using the method of fundamental solutions (MFS), in conjunction with the Tikhonov regularization method, while the optimal value of the regularization parameter is chosen via the generalized cross-validation (GCV) criterion. An efficient regularizing stopping criterion which ceases the iterative procedure at the point where the accumulation of noise becomes dominant and the errors in predicting the exact solutions increase, is also presented. The MFS-based iterative algorithms with relaxation are tested for Cauchy problems for isotropic linear elastic materials in various geometries to confirm the numerical convergence, stability, accuracy and computational efficiency of the proposed method.