927 resultados para TRAINING SET


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better orcomparable generalization performance over existing methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A fuzzy dynamic flood routing model (FDFRM) for natural channels is presented, wherein the flood wave can be approximated to a monoclinal wave. This study is based on modification of an earlier published work by the same authors, where the nature of the wave was of gravity type. Momentum equation of the dynamic wave model is replaced by a fuzzy rule based model, while retaining the continuity equation in its complete form. Hence, the FDFRM gets rid of the assumptions associated with the momentum equation. Also, it overcomes the necessity of calculating friction slope (S-f) in flood routing and hence the associated uncertainties are eliminated. The fuzzy rule based model is developed on an equation for wave velocity, which is obtained in terms of discontinuities in the gradient of flow parameters. The channel reach is divided into a number of approximately uniform sub-reaches. Training set required for development of the fuzzy rule based model for each sub-reach is obtained from discharge-area relationship at its mean section. For highly heterogeneous sub-reaches, optimized fuzzy rule based models are obtained by means of a neuro-fuzzy algorithm. For demonstration, the FDFRM is applied to flood routing problems in a fictitious channel with single uniform reach, in a fictitious channel with two uniform sub-reaches and also in a natural channel with a number of approximately uniform sub-reaches. It is observed that in cases of the fictitious channels, the FDFRM outputs match well with those of an implicit numerical model (INM), which solves the dynamic wave equations using an implicit numerical scheme. For the natural channel, the FDFRM Outputs are comparable to those of the HEC-RAS model.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND Polygenic risk scores comprising established susceptibility variants have shown to be informative classifiers for several complex diseases including prostate cancer. For prostate cancer it is unknown if inclusion of genetic markers that have so far not been associated with prostate cancer risk at a genome-wide significant level will improve disease prediction. METHODS We built polygenic risk scores in a large training set comprising over 25,000 individuals. Initially 65 established prostate cancer susceptibility variants were selected. After LD pruning additional variants were prioritized based on their association with prostate cancer. Six-fold cross validation was performed to assess genetic risk scores and optimize the number of additional variants to be included. The final model was evaluated in an independent study population including 1,370 cases and 1,239 controls. RESULTS The polygenic risk score with 65 established susceptibility variants provided an area under the curve (AUC) of 0.67. Adding an additional 68 novel variants significantly increased the AUC to 0.68 (P = 0.0012) and the net reclassification index with 0.21 (P = 8.5E-08). All novel variants were located in genomic regions established as associated with prostate cancer risk. CONCLUSIONS Inclusion of additional genetic variants from established prostate cancer susceptibility regions improves disease prediction. Prostate 75:1467–1474, 2015. © 2015 Wiley Periodicals, Inc.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The swelling pressure of soil depends upon various soil parameters such as mineralogy, clay content, Atterberg's limits, dry density, moisture content, initial degree of saturation, etc. along with structural and environmental factors. It is very difficult to model and analyze swelling pressure effectively taking all the above aspects into consideration. Various statistical/empirical methods have been attempted to predict the swelling pressure based on index properties of soil. In this paper, the computational intelligence techniques artificial neural network and support vector machine have been used to develop models based on the set of available experimental results to predict swelling pressure from the inputs; natural moisture content, dry density, liquid limit, plasticity index, and clay fraction. The generalization of the model to new set of data other than the training set of data is discussed which is required for successful application of a model. A detailed study of the relative performance of the computational intelligence techniques has been carried out based on different statistical performance criteria.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The swelling pressure of soil depends upon various soil parameters such as mineralogy, clay content, Atterberg's limits, dry density, moisture content, initial degree of saturation, etc. along with structural and environmental factors. It is very difficult to model and analyze swelling pressure effectively taking all the above aspects into consideration. Various statistical/empirical methods have been attempted to predict the swelling pressure based on index properties of soil. In this paper, the computational intelligence techniques artificial neural network and support vector machine have been used to develop models based on the set of available experimental results to predict swelling pressure from the inputs; natural moisture content, dry density, liquid limit, plasticity index, and clay fraction. The generalization of the model to new set of data other than the training set of data is discussed which is required for successful application of a model. A detailed study of the relative performance of the computational intelligence techniques has been carried out based on different statistical performance criteria.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we show the applicability of Ant Colony Optimisation (ACO) techniques for pattern classification problem that arises in tool wear monitoring. In an earlier study, artificial neural networks and genetic programming have been successfully applied to tool wear monitoring problem. ACO is a recent addition to evolutionary computation technique that has gained attention for its ability to extract the underlying data relationships and express them in form of simple rules. Rules are extracted for data classification using training set of data points. These rules are then applied to set of data in the testing/validation set to obtain the classification accuracy. A major attraction in ACO based classification is the possibility of obtaining an expert system like rules that can be directly applied subsequently by the user in his/her application. The classification accuracy obtained in ACO based approach is as good as obtained in other biologically inspired techniques.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A two-stage iterative algorithm for selecting a subset of a training set of samples for use in a condensed nearest neighbor (CNN) decision rule is introduced. The proposed method uses the concept of mutual nearest neighborhood for selecting samples close to the decision line. The efficacy of the algorithm is brought out by means of an example.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We address the problem of designing codes for specific applications using deterministic annealing. Designing a block code over any finite dimensional space may be thought of as forming the corresponding number of clusters over the particular dimensional space. We have shown that the total distortion incurred in encoding a training set is related to the probability of correct reception over a symmetric channel. While conventional deterministic annealing make use of the Euclidean squared error distance measure, we have developed an algorithm that can be used for clustering with Hamming distance as the distance measure, which is required in the error correcting, scenario.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Glioblastoma (GBM) is the most common and aggressive primary brain tumor with very poor patient median survival. To identify a microRNA (miRNA) expression signature that can predict GBM patient survival, we analyzed the miRNA expression data of GBM patients (n = 222) derived from The Cancer Genome Atlas (TCGA) dataset. We divided the patients randomly into training and testing sets with equal number in each group. We identified 10 significant miRNAs using Cox regression analysis on the training set and formulated a risk score based on the expression signature of these miRNAs that segregated the patients into high and low risk groups with significantly different survival times (hazard ratio HR] = 2.4; 95% CI = 1.4-3.8; p < 0.0001). Of these 10 miRNAs, 7 were found to be risky miRNAs and 3 were found to be protective. This signature was independently validated in the testing set (HR = 1.7; 95% CI = 1.1-2.8; p = 0.002). GBM patients with high risk scores had overall poor survival compared to the patients with low risk scores. Overall survival among the entire patient set was 35.0% at 2 years, 21.5% at 3 years, 18.5% at 4 years and 11.8% at 5 years in the low risk group, versus 11.0%, 5.5%, 0.0 and 0.0% respectively in the high risk group (HR = 2.0; 95% CI = 1.4-2.8; p < 0.0001). Cox multivariate analysis with patient age as a covariate on the entire patient set identified risk score based on the 10 miRNA expression signature to be an independent predictor of patient survival (HR = 1.120; 95% CI = 1.04-1.20; p = 0.003). Thus we have identified a miRNA expression signature that can predict GBM patient survival. These findings may have implications in the understanding of gliomagenesis, development of targeted therapy and selection of high risk cancer patients for adjuvant therapy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Depth measures the extent of atom/residue burial within a protein. It correlates with properties such as protein stability, hydrogen exchange rate, protein-protein interaction hot spots, post-translational modification sites and sequence variability. Our server, DEPTH, accurately computes depth and solvent-accessible surface area (SASA) values. We show that depth can be used to predict small molecule ligand binding cavities in proteins. Often, some of the residues lining a ligand binding cavity are both deep and solvent exposed. Using the depth-SASA pair values for a residue, its likelihood to form part of a small molecule binding cavity is estimated. The parameters of the method were calibrated over a training set of 900 high-resolution X-ray crystal structures of single-domain proteins bound to small molecules (molecular weight < 1.5 KDa). The prediction accuracy of DEPTH is comparable to that of other geometry-based prediction methods including LIGSITE, SURFNET and Pocket-Finder (all with Matthew's correlation coefficient of similar to 0.4) over a testing set of 225 single and multi-chain protein structures. Users have the option of tuning several parameters to detect cavities of different sizes, for example, geometrically flat binding sites. The input to the server is a protein 3D structure in PDB format. The users have the option of tuning the values of four parameters associated with the computation of residue depth and the prediction of binding cavities. The computed depths, SASA and binding cavity predictions are displayed in 2D plots and mapped onto 3D representations of the protein structure using Jmol. Links are provided to download the outputs. Our server is useful for all structural analysis based on residue depth and SASA, such as guiding site-directed mutagenesis experiments and small molecule docking exercises, in the context of protein functional annotation and drug discovery.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to compu- tational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classifi- cation. In the last few years, large margin classifiers like sup-port vector machines (SVMs) have shown much promise for structured output learning. The related optimization prob -lem is a convex quadratic program (QP) with a large num-ber of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes re-peated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we describe a method for feature extraction and classification of characters manually isolated from scene or natural images. Characters in a scene image may be affected by low resolution, uneven illumination or occlusion. We propose a novel method to perform binarization on gray scale images by minimizing energy functional. Discrete Cosine Transform and Angular Radial Transform are used to extract the features from characters after normalization for scale and translation. We have evaluated our method on the complete test set of Chars74k dataset for English and Kannada scripts consisting of handwritten and synthesized characters, as well as characters extracted from camera captured images. We utilize only synthesized and handwritten characters from this dataset as training set. Nearest neighbor classification is used in our experiments.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we explore noise-tolerant learning of classifiers. We formulate the problem as follows. We assume that there is an unobservable training set that is noise free. The actual training set given to the learning algorithm is obtained from this ideal data set by corrupting the class label of each example. The probability that the class label of an example is corrupted is a function of the feature vector of the example. This would account for most kinds of noisy data one encounters in practice. We say that a learning method is noise tolerant if the classifiers learnt with noise-free data and with noisy data, both have the same classification accuracy on the noise-free data. In this paper, we analyze the noise-tolerance properties of risk minimization (under different loss functions). We show that risk minimization under 0-1 loss function has impressive noise-tolerance properties and that under squared error loss is tolerant only to uniform noise; risk minimization under other loss functions is not noise tolerant. We conclude this paper with some discussion on the implications of these theoretical results.