984 resultados para Training Sample


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The design of binary morphological operators that are translation-invariant and locally defined by a finite neighborhood window corresponds to the problem of designing Boolean functions. As in any supervised classification problem, morphological operators designed from a training sample also suffer from overfitting. Large neighborhood tends to lead to performance degradation of the designed operator. This work proposes a multilevel design approach to deal with the issue of designing large neighborhood-based operators. The main idea is inspired by stacked generalization (a multilevel classifier design approach) and consists of, at each training level, combining the outcomes of the previous level operators. The final operator is a multilevel operator that ultimately depends on a larger neighborhood than of the individual operators that have been combined. Experimental results show that two-level operators obtained by combining operators designed on subwindows of a large window consistently outperform the single-level operators designed on the full window. They also show that iterating two-level operators is an effective multilevel approach to obtain better results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The classification rules of linear discriminant analysis are defined by the true mean vectors and the common covariance matrix of the populations from which the data come. Because these true parameters are generally unknown, they are commonly estimated by the sample mean vector and covariance matrix of the data in a training sample randomly drawn from each population. However, these sample statistics are notoriously susceptible to contamination by outliers, a problem compounded by the fact that the outliers may be invisible to conventional diagnostics. High-breakdown estimation is a procedure designed to remove this cause for concern by producing estimates that are immune to serious distortion by a minority of outliers, regardless of their severity. In this article we motivate and develop a high-breakdown criterion for linear discriminant analysis and give an algorithm for its implementation. The procedure is intended to supplement rather than replace the usual sample-moment methodology of discriminant analysis either by providing indications that the dataset is not seriously affected by outliers (supporting the usual analysis) or by identifying apparently aberrant points and giving resistant estimators that are not affected by them.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A condition needed for testing nested hypotheses from a Bayesianviewpoint is that the prior for the alternative model concentratesmass around the small, or null, model. For testing independencein contingency tables, the intrinsic priors satisfy this requirement.Further, the degree of concentration of the priors is controlled bya discrete parameter m, the training sample size, which plays animportant role in the resulting answer regardless of the samplesize.In this paper we study robustness of the tests of independencein contingency tables with respect to the intrinsic priors withdifferent degree of concentration around the null, and comparewith other “robust” results by Good and Crook. Consistency ofthe intrinsic Bayesian tests is established.We also discuss conditioning issues and sampling schemes,and argue that conditioning should be on either one margin orthe table total, but not on both margins.Examples using real are simulated data are given

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The purpose of this study is to examine the impact of the choice of cut-off points, sampling procedures, and the business cycle on the accuracy of bankruptcy prediction models. Misclassification can result in erroneous predictions leading to prohibitive costs to firms, investors and the economy. To test the impact of the choice of cut-off points and sampling procedures, three bankruptcy prediction models are assessed- Bayesian, Hazard and Mixed Logit. A salient feature of the study is that the analysis includes both parametric and nonparametric bankruptcy prediction models. A sample of firms from Lynn M. LoPucki Bankruptcy Research Database in the U. S. was used to evaluate the relative performance of the three models. The choice of a cut-off point and sampling procedures were found to affect the rankings of the various models. In general, the results indicate that the empirical cut-off point estimated from the training sample resulted in the lowest misclassification costs for all three models. Although the Hazard and Mixed Logit models resulted in lower costs of misclassification in the randomly selected samples, the Mixed Logit model did not perform as well across varying business-cycles. In general, the Hazard model has the highest predictive power. However, the higher predictive power of the Bayesian model, when the ratio of the cost of Type I errors to the cost of Type II errors is high, is relatively consistent across all sampling methods. Such an advantage of the Bayesian model may make it more attractive in the current economic environment. This study extends recent research comparing the performance of bankruptcy prediction models by identifying under what conditions a model performs better. It also allays a range of user groups, including auditors, shareholders, employees, suppliers, rating agencies, and creditors' concerns with respect to assessing failure risk.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The main objective of this letter is to formulate a new approach of learning a Mahalanobis distance metric for nearest neighbor regression from a training sample set. We propose a modified version of the large margin nearest neighbor metric learning method to deal with regression problems. As an application, the prediction of post-operative trunk 3-D shapes in scoliosis surgery using nearest neighbor regression is described. Accuracy of the proposed method is quantitatively evaluated through experiments on real medical data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A condition needed for testing nested hypotheses from a Bayesian viewpoint is that the prior for the alternative model concentrates mass around the small, or null, model. For testing independence in contingency tables, the intrinsic priors satisfy this requirement. Further, the degree of concentration of the priors is controlled by a discrete parameter m, the training sample size, which plays an important role in the resulting answer regardless of the sample size. In this paper we study robustness of the tests of independence in contingency tables with respect to the intrinsic priors with different degree of concentration around the null, and compare with other “robust” results by Good and Crook. Consistency of the intrinsic Bayesian tests is established. We also discuss conditioning issues and sampling schemes, and argue that conditioning should be on either one margin or the table total, but not on both margins. Examples using real are simulated data are given

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Airborne lidar provides accurate height information of objects on the earth and has been recognized as a reliable and accurate surveying tool in many applications. In particular, lidar data offer vital and significant features for urban land-cover classification, which is an important task in urban land-use studies. In this article, we present an effective approach in which lidar data fused with its co-registered images (i.e. aerial colour images containing red, green and blue (RGB) bands and near-infrared (NIR) images) and other derived features are used effectively for accurate urban land-cover classification. The proposed approach begins with an initial classification performed by the Dempster–Shafer theory of evidence with a specifically designed basic probability assignment function. It outputs two results, i.e. the initial classification and pseudo-training samples, which are selected automatically according to the combined probability masses. Second, a support vector machine (SVM)-based probability estimator is adopted to compute the class conditional probability (CCP) for each pixel from the pseudo-training samples. Finally, a Markov random field (MRF) model is established to combine spatial contextual information into the classification. In this stage, the initial classification result and the CCP are exploited. An efficient belief propagation (EBP) algorithm is developed to search for the global minimum-energy solution for the maximum a posteriori (MAP)-MRF framework in which three techniques are developed to speed up the standard belief propagation (BP) algorithm. Lidar and its co-registered data acquired by Toposys Falcon II are used in performance tests. The experimental results prove that fusing the height data and optical images is particularly suited for urban land-cover classification. There is no training sample needed in the proposed approach, and the computational cost is relatively low. An average classification accuracy of 93.63% is achieved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper introduces a novel approach for free-text keystroke dynamics authentication which incorporates the use of the keyboard’s key-layout. The method extracts timing features from specific key-pairs. The Euclidean distance is then utilized to find the level of similarity between a user’s profile data and his/her test data. The results obtained from this method are reasonable for free-text authentication while maintaining the maximum level of user relaxation. Moreover, it has been proven in this study that flight time yields better authentication results when compared with dwell time. In particular, the results were obtained with only one training sample for the purpose of practicality and ease of real life application.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A Facilitação Neuromuscular Proprioceptiva – FNP – é uma técnica que cada vez mais vem sendo utilizada no treinamento muscular de pessoas saudáveis e atletas. Pesquisas vêm mostrando que exercícios de resistência, dentre eles a FNP, são capazes de converter o tipo das fibras musculares treinadas. Esta pesquisa teve como objetivo verificar a eficiência da FNP no acréscimo de força muscular e verificar por métodos não invasivos se haveria indicativo de conversão de tipo de fibra muscular após o treinamento. Um grupo amostral de 22 jovens, universitárias do sexo feminino com idade entre 18 e 25 anos e fisicamente ativas, foi dividido em: grupo controle (GC n=10) e grupo experimental (GE n=12). Foram inicialmente mensurados: I - força da Contração Voluntária Máxima - CVM do músculo quadríceps por dinamometria analógica e root mean square - RMS e II - área de ativação muscular por eletromiografia de superfície (EMG) de todos os sujeitos. Após a primeira coleta de dados o GE realizou treinamento baseado na FNP no membro inferior dominante por 15 sessões em 5 semanas. Ao final, nova mensuração foi feita em todos. Quanto à força muscular, houve acréscimo em ambos os grupos, significativa no GC (p<0,01) e no GE (p<0,05); para RMS e tempo de CVM, houve aumento não significativo no GE, mas a interação Vxt aumentou significativamente para este grupo. Os resultados corroboram a literatura ao mostrar que músculos com predomínio de fibras resistentes (fibras I/ II A) possuem maior tempo de contração com mais ativação elétrica e de que a FNP é capaz fibras tipo II B para II A. Concluiu-se que para a amostra estudada o treinamento foi eficiente no acréscimo de força muscular e os dados EMG apresentados mostram fortes evidências da conversão das fibras do músculo treinado.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently, screening tests for monitoring the prevalence of transmissible spongiform encephalopathies specifically in sheep and goats became available. Although most countries require comprehensive test validation prior to approval, little is known about their performance under normal operating conditions. Switzerland was one of the first countries to implement 2 of these tests, an enzyme-linked immunosorbent assay (ELISA) and a Western blot, in a 1-year active surveillance program. Slaughtered animals (n = 32,777) were analyzed in either of the 2 tests with immunohistochemistry for confirmation of initial reactive results, and fallen stock samples (n = 3,193) were subjected to both screening tests and immunohistochemistry in parallel. Initial reactive and false-positive rates were recorded over time. Both tests revealed an excellent diagnostic specificity (>99.5%). However, initial reactive rates were elevated at the beginning of the program but dropped to levels below 1% with routine and enhanced staff training. Only those in the ELISA increased again in the second half of the program and correlated with the degree of tissue autolysis in the fallen stock samples. It is noteworthy that the Western blot missed 1 of the 3 atypical scrapie cases in the fallen stock, indicating potential differences in the diagnostic sensitivities between the 2 screening tests. However, an estimation of the diagnostic sensitivity for both tests on field samples remained difficult due to the low disease prevalence. Taken together, these results highlight the importance of staff training, sample quality, and interlaboratory comparison trials when such screening tests are implemented in the field.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

* The work is supported by RFBR, grant 04-01-00858-a

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Technology of classification of electronic documents based on the theory of disturbance of pseudoinverse matrices was proposed.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

* Работа выполнена при поддержке РФФИ, гранты 07-01-00331-a и 08-01-00944-a

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Previously developed models for predicting absolute risk of invasive epithelial ovarian cancer have included a limited number of risk factors and have had low discriminatory power (area under the receiver operating characteristic curve (AUC) < 0.60). Because of this, we developed and internally validated a relative risk prediction model that incorporates 17 established epidemiologic risk factors and 17 genome-wide significant single nucleotide polymorphisms (SNPs) using data from 11 case-control studies in the United States (5,793 cases; 9,512 controls) from the Ovarian Cancer Association Consortium (data accrued from 1992 to 2010). We developed a hierarchical logistic regression model for predicting case-control status that included imputation of missing data. We randomly divided the data into an 80% training sample and used the remaining 20% for model evaluation. The AUC for the full model was 0.664. A reduced model without SNPs performed similarly (AUC = 0.649). Both models performed better than a baseline model that included age and study site only (AUC = 0.563). The best predictive power was obtained in the full model among women younger than 50 years of age (AUC = 0.714); however, the addition of SNPs increased the AUC the most for women older than 50 years of age (AUC = 0.638 vs. 0.616). Adapting this improved model to estimate absolute risk and evaluating it in prospective data sets is warranted.