968 resultados para Training Sample
Resumo:
This study investigates face recognition with partial occlusion, illumination variation and their combination, assuming no prior information about the mismatch, and limited training data for each person. The authors extend their previous posterior union model (PUM) to give a new method capable of dealing with all these problems. PUM is an approach for selecting the optimal local image features for recognition to improve robustness to partial occlusion. The extension is in two stages. First, authors extend PUM from a probability-based formulation to a similarity-based formulation, so that it operates with as little as one single training sample to offer robustness to partial occlusion. Second, they extend this new formulation to make it robust to illumination variation, and to combined illumination variation and partial occlusion, by a novel combination of multicondition relighting and optimal feature selection. To evaluate the new methods, a number of databases with various simulated and realistic occlusion/illumination mismatches have been used. The results have demonstrated the improved robustness of the new methods.
Resumo:
The design of binary morphological operators that are translation-invariant and locally defined by a finite neighborhood window corresponds to the problem of designing Boolean functions. As in any supervised classification problem, morphological operators designed from a training sample also suffer from overfitting. Large neighborhood tends to lead to performance degradation of the designed operator. This work proposes a multilevel design approach to deal with the issue of designing large neighborhood-based operators. The main idea is inspired by stacked generalization (a multilevel classifier design approach) and consists of, at each training level, combining the outcomes of the previous level operators. The final operator is a multilevel operator that ultimately depends on a larger neighborhood than of the individual operators that have been combined. Experimental results show that two-level operators obtained by combining operators designed on subwindows of a large window consistently outperform the single-level operators designed on the full window. They also show that iterating two-level operators is an effective multilevel approach to obtain better results.
Resumo:
A new clustering technique, based on the concept of immediato neighbourhood, with a novel capability to self-learn the number of clusters expected in the unsupervized environment, has been developed. The method compares favourably with other clustering schemes based on distance measures, both in terms of conceptual innovations and computational economy. Test implementation of the scheme using C-l flight line training sample data in a simulated unsupervized mode has brought out the efficacy of the technique. The technique can easily be implemented as a front end to established pattern classification systems with supervized learning capabilities to derive unified learning systems capable of operating in both supervized and unsupervized environments. This makes the technique an attractive proposition in the context of remotely sensed earth resources data analysis wherein it is essential to have such a unified learning system capability.
Resumo:
The problem of identifying user intent has received considerable attention in recent years, particularly in the context of improving the search experience via query contextualization. Intent can be characterized by multiple dimensions, which are often not observed from query words alone. Accurate identification of Intent from query words remains a challenging problem primarily because it is extremely difficult to discover these dimensions. The problem is often significantly compounded due to lack of representative training sample. We present a generic, extensible framework for learning the multi-dimensional representation of user intent from the query words. The approach models the latent relationships between facets using tree structured distribution which leads to an efficient and convergent algorithm, FastQ, for identifying the multi-faceted intent of users based on just the query words. We also incorporated WordNet to extend the system capabilities to queries which contain words that do not appear in the training data. Empirical results show that FastQ yields accurate identification of intent when compared to a gold standard.
Resumo:
This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage.
Resumo:
This work presents two new score functions based on the Bayesian Dirichlet equivalent uniform (BDeu) score for learning Bayesian network structures. They consider the sensitivity of BDeu to varying parameters of the Dirichlet prior. The scores take on the most adversary and the most beneficial priors among those within a contamination set around the symmetric one. We build these scores in such way that they are decomposable and can be computed efficiently. Because of that, they can be integrated into any state-of-the-art structure learning method that explores the space of directed acyclic graphs and allows decomposable scores. Empirical results suggest that our scores outperform the standard BDeu score in terms of the likelihood of unseen data and in terms of edge discovery with respect to the true network, at least when the training sample size is small. We discuss the relation between these new scores and the accuracy of inferred models. Moreover, our new criteria can be used to identify the amount of data after which learning is saturated, that is, additional data are of little help to improve the resulting model.
Resumo:
Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.
Resumo:
The purpose of this study is to examine the impact of the choice of cut-off points, sampling procedures, and the business cycle on the accuracy of bankruptcy prediction models. Misclassification can result in erroneous predictions leading to prohibitive costs to firms, investors and the economy. To test the impact of the choice of cut-off points and sampling procedures, three bankruptcy prediction models are assessed- Bayesian, Hazard and Mixed Logit. A salient feature of the study is that the analysis includes both parametric and nonparametric bankruptcy prediction models. A sample of firms from Lynn M. LoPucki Bankruptcy Research Database in the U. S. was used to evaluate the relative performance of the three models. The choice of a cut-off point and sampling procedures were found to affect the rankings of the various models. In general, the results indicate that the empirical cut-off point estimated from the training sample resulted in the lowest misclassification costs for all three models. Although the Hazard and Mixed Logit models resulted in lower costs of misclassification in the randomly selected samples, the Mixed Logit model did not perform as well across varying business-cycles. In general, the Hazard model has the highest predictive power. However, the higher predictive power of the Bayesian model, when the ratio of the cost of Type I errors to the cost of Type II errors is high, is relatively consistent across all sampling methods. Such an advantage of the Bayesian model may make it more attractive in the current economic environment. This study extends recent research comparing the performance of bankruptcy prediction models by identifying under what conditions a model performs better. It also allays a range of user groups, including auditors, shareholders, employees, suppliers, rating agencies, and creditors' concerns with respect to assessing failure risk.
Resumo:
The main objective of this letter is to formulate a new approach of learning a Mahalanobis distance metric for nearest neighbor regression from a training sample set. We propose a modified version of the large margin nearest neighbor metric learning method to deal with regression problems. As an application, the prediction of post-operative trunk 3-D shapes in scoliosis surgery using nearest neighbor regression is described. Accuracy of the proposed method is quantitatively evaluated through experiments on real medical data.
Resumo:
A condition needed for testing nested hypotheses from a Bayesian viewpoint is that the prior for the alternative model concentrates mass around the small, or null, model. For testing independence in contingency tables, the intrinsic priors satisfy this requirement. Further, the degree of concentration of the priors is controlled by a discrete parameter m, the training sample size, which plays an important role in the resulting answer regardless of the sample size. In this paper we study robustness of the tests of independence in contingency tables with respect to the intrinsic priors with different degree of concentration around the null, and compare with other “robust” results by Good and Crook. Consistency of the intrinsic Bayesian tests is established. We also discuss conditioning issues and sampling schemes, and argue that conditioning should be on either one margin or the table total, but not on both margins. Examples using real are simulated data are given
Resumo:
Airborne lidar provides accurate height information of objects on the earth and has been recognized as a reliable and accurate surveying tool in many applications. In particular, lidar data offer vital and significant features for urban land-cover classification, which is an important task in urban land-use studies. In this article, we present an effective approach in which lidar data fused with its co-registered images (i.e. aerial colour images containing red, green and blue (RGB) bands and near-infrared (NIR) images) and other derived features are used effectively for accurate urban land-cover classification. The proposed approach begins with an initial classification performed by the Dempster–Shafer theory of evidence with a specifically designed basic probability assignment function. It outputs two results, i.e. the initial classification and pseudo-training samples, which are selected automatically according to the combined probability masses. Second, a support vector machine (SVM)-based probability estimator is adopted to compute the class conditional probability (CCP) for each pixel from the pseudo-training samples. Finally, a Markov random field (MRF) model is established to combine spatial contextual information into the classification. In this stage, the initial classification result and the CCP are exploited. An efficient belief propagation (EBP) algorithm is developed to search for the global minimum-energy solution for the maximum a posteriori (MAP)-MRF framework in which three techniques are developed to speed up the standard belief propagation (BP) algorithm. Lidar and its co-registered data acquired by Toposys Falcon II are used in performance tests. The experimental results prove that fusing the height data and optical images is particularly suited for urban land-cover classification. There is no training sample needed in the proposed approach, and the computational cost is relatively low. An average classification accuracy of 93.63% is achieved.
Resumo:
This paper introduces a novel approach for free-text keystroke dynamics authentication which incorporates the use of the keyboard’s key-layout. The method extracts timing features from specific key-pairs. The Euclidean distance is then utilized to find the level of similarity between a user’s profile data and his/her test data. The results obtained from this method are reasonable for free-text authentication while maintaining the maximum level of user relaxation. Moreover, it has been proven in this study that flight time yields better authentication results when compared with dwell time. In particular, the results were obtained with only one training sample for the purpose of practicality and ease of real life application.
Resumo:
A Facilitação Neuromuscular Proprioceptiva – FNP – é uma técnica que cada vez mais vem sendo utilizada no treinamento muscular de pessoas saudáveis e atletas. Pesquisas vêm mostrando que exercícios de resistência, dentre eles a FNP, são capazes de converter o tipo das fibras musculares treinadas. Esta pesquisa teve como objetivo verificar a eficiência da FNP no acréscimo de força muscular e verificar por métodos não invasivos se haveria indicativo de conversão de tipo de fibra muscular após o treinamento. Um grupo amostral de 22 jovens, universitárias do sexo feminino com idade entre 18 e 25 anos e fisicamente ativas, foi dividido em: grupo controle (GC n=10) e grupo experimental (GE n=12). Foram inicialmente mensurados: I - força da Contração Voluntária Máxima - CVM do músculo quadríceps por dinamometria analógica e root mean square - RMS e II - área de ativação muscular por eletromiografia de superfície (EMG) de todos os sujeitos. Após a primeira coleta de dados o GE realizou treinamento baseado na FNP no membro inferior dominante por 15 sessões em 5 semanas. Ao final, nova mensuração foi feita em todos. Quanto à força muscular, houve acréscimo em ambos os grupos, significativa no GC (p<0,01) e no GE (p<0,05); para RMS e tempo de CVM, houve aumento não significativo no GE, mas a interação Vxt aumentou significativamente para este grupo. Os resultados corroboram a literatura ao mostrar que músculos com predomínio de fibras resistentes (fibras I/ II A) possuem maior tempo de contração com mais ativação elétrica e de que a FNP é capaz fibras tipo II B para II A. Concluiu-se que para a amostra estudada o treinamento foi eficiente no acréscimo de força muscular e os dados EMG apresentados mostram fortes evidências da conversão das fibras do músculo treinado.
Resumo:
Recently, screening tests for monitoring the prevalence of transmissible spongiform encephalopathies specifically in sheep and goats became available. Although most countries require comprehensive test validation prior to approval, little is known about their performance under normal operating conditions. Switzerland was one of the first countries to implement 2 of these tests, an enzyme-linked immunosorbent assay (ELISA) and a Western blot, in a 1-year active surveillance program. Slaughtered animals (n = 32,777) were analyzed in either of the 2 tests with immunohistochemistry for confirmation of initial reactive results, and fallen stock samples (n = 3,193) were subjected to both screening tests and immunohistochemistry in parallel. Initial reactive and false-positive rates were recorded over time. Both tests revealed an excellent diagnostic specificity (>99.5%). However, initial reactive rates were elevated at the beginning of the program but dropped to levels below 1% with routine and enhanced staff training. Only those in the ELISA increased again in the second half of the program and correlated with the degree of tissue autolysis in the fallen stock samples. It is noteworthy that the Western blot missed 1 of the 3 atypical scrapie cases in the fallen stock, indicating potential differences in the diagnostic sensitivities between the 2 screening tests. However, an estimation of the diagnostic sensitivity for both tests on field samples remained difficult due to the low disease prevalence. Taken together, these results highlight the importance of staff training, sample quality, and interlaboratory comparison trials when such screening tests are implemented in the field.
Resumo:
* The work is supported by RFBR, grant 04-01-00858-a