44 resultados para Analisi Discriminante, Teoria dei Network, Cross-Validation, Validazione.
em Université de Lausanne, Switzerland
Resumo:
BACKGROUND/OBJECTIVES: (1) To cross-validate tetra- (4-BIA) and octopolar (8-BIA) bioelectrical impedance analysis vs dual-energy X-ray absorptiometry (DXA) for the assessment of total and appendicular body composition and (2) to evaluate the accuracy of external 4-BIA algorithms for the prediction of total body composition, in a representative sample of Swiss children. SUBJECTS/METHODS: A representative sample of 333 Swiss children aged 6-13 years from the Kinder-Sportstudie (KISS) (ISRCTN15360785). Whole-body fat-free mass (FFM) and appendicular lean tissue mass were measured with DXA. Body resistance (R) was measured at 50 kHz with 4-BIA and segmental body resistance at 5, 50, 250 and 500 kHz with 8-BIA. The resistance index (RI) was calculated as height(2)/R. Selection of predictors (gender, age, weight, RI4 and RI8) for BIA algorithms was performed using bootstrapped stepwise linear regression on 1000 samples. We calculated 95% confidence intervals (CI) of regression coefficients and measures of model fit using bootstrap analysis. Limits of agreement were used as measures of interchangeability of BIA with DXA. RESULTS: 8-BIA was more accurate than 4-BIA for the assessment of FFM (root mean square error (RMSE)=0.90 (95% CI 0.82-0.98) vs 1.12 kg (1.01-1.24); limits of agreement 1.80 to -1.80 kg vs 2.24 to -2.24 kg). 8-BIA also gave accurate estimates of appendicular body composition, with RMSE < or = 0.10 kg for arms and < or = 0.24 kg for legs. All external 4-BIA algorithms performed poorly with substantial negative proportional bias (r> or = 0.48, P<0.001). CONCLUSIONS: In a representative sample of young Swiss children (1) 8-BIA was superior to 4-BIA for the prediction of FFM, (2) external 4-BIA algorithms gave biased predictions of FFM and (3) 8-BIA was an accurate predictor of segmental body composition.
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.
Resumo:
BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
Resumo:
In the last five years, Deep Brain Stimulation (DBS) has become the most popular and effective surgical technique for the treatent of Parkinson's disease (PD). The Subthalamic Nucleus (STN) is the usual target involved when applying DBS. Unfortunately, the STN is in general not visible in common medical imaging modalities. Therefore, atlas-based segmentation is commonly considered to locate it in the images. In this paper, we propose a scheme that allows both, to perform a comparison between different registration algorithms and to evaluate their ability to locate the STN automatically. Using this scheme we can evaluate the expert variability against the error of the algorithms and we demonstrate that automatic STN location is possible and as accurate as the methods currently used.
Resumo:
In the last five years, Deep Brain Stimulation (DBS) has become the most popular and effective surgical technique for the treatent of Parkinson's disease (PD). The Subthalamic Nucleus (STN) is the usual target involved when applying DBS. Unfortunately, the STN is in general not visible in common medical imaging modalities. Therefore, atlas-based segmentation is commonly considered to locate it in the images. In this paper, we propose a scheme that allows both, to perform a comparison between different registration algorithms and to evaluate their ability to locate the STN automatically. Using this scheme we can evaluate the expert variability against the error of the algorithms and we demonstrate that automatic STN location is possible and as accurate as the methods currently used.
Resumo:
In occupational exposure assessment of airborne contaminants, exposure levels can either be estimated through repeated measurements of the pollutant concentration in air, expert judgment or through exposure models that use information on the conditions of exposure as input. In this report, we propose an empirical hierarchical Bayesian model to unify these approaches. Prior to any measurement, the hygienist conducts an assessment to generate prior distributions of exposure determinants. Monte-Carlo samples from these distributions feed two level-2 models: a physical, two-compartment model, and a non-parametric, neural network model trained with existing exposure data. The outputs of these two models are weighted according to the expert's assessment of their relevance to yield predictive distributions of the long-term geometric mean and geometric standard deviation of the worker's exposure profile (level-1 model). Bayesian inferences are then drawn iteratively from subsequent measurements of worker exposure. Any traditional decision strategy based on a comparison with occupational exposure limits (e.g. mean exposure, exceedance strategies) can then be applied. Data on 82 workers exposed to 18 contaminants in 14 companies were used to validate the model with cross-validation techniques. A user-friendly program running the model is available upon request.
Resumo:
The paper deals with the development and application of the generic methodology for automatic processing (mapping and classification) of environmental data. General Regression Neural Network (GRNN) is considered in detail and is proposed as an efficient tool to solve the problem of spatial data mapping (regression). The Probabilistic Neural Network (PNN) is considered as an automatic tool for spatial classifications. The automatic tuning of isotropic and anisotropic GRNN/PNN models using cross-validation procedure is presented. Results are compared with the k-Nearest-Neighbours (k-NN) interpolation algorithm using independent validation data set. Real case studies are based on decision-oriented mapping and classification of radioactively contaminated territories.
Resumo:
Introduzione : Plinio il Giovane erogò una somma di denaro per costruire una biblioteca e assicurarne il funzionamento. Una cittadina donò al Municipio di Balerna due case di appartamenti con lo scopo di mantenervi pigioni moderate e di mettere i vani a disposizione quale abitazione primaria per la popolazione del Comune. Questi due esempi illustrano casi di donazioni gravate da un onere. Da essi si evince che la donazione modale consta di due elementi: un dono e una finalità posta allo stesso. Una tale donazione esemplifica l'idea di alterità dell'unità, come Giano - dio del Pantheon romano - il quale è rappresentato con due volti, che gli permettevano di vedere il passato e di scrutare il futuro. Così, il presente lavoro - anch'esso paradigma dell'alterità dell'unità - si prefigge lo scopo di creare un ponte tra il passato e il potenziale futuro, sulla scorta anche di testi filosofici e letterari, senza tacere la sua dimensione giuridica. Un'invocazione a Giano si giustifica: anche Seneca, nel suo canzonatorio poema sulla trasformazione in zucca dell'imperatore Claudio - l'Apocolocintosi - narra che il dio patrocinò l'interessato, siccome abile oratore e aduso all'arte forense poiché i suoi due volti indicano in senso metaforico - e ironico - la capacità di esaminare le questioni sotto tutti i loro aspetti. Ciò posto, giovi considerare che la donazione modale è costituita di due scorciatoie cognitive che espongono la sua alterità: l'una, il dono, rimanda all'idea di gratuità, di benevolenza, di amicizia, di reciprocità; l'altra, l'onere, riconduce all'onerosità, allo scambio, al contratto, alla commutatività. Così, per enucleare l'unità della donazione gravata da un onere occorre chinarsi sulla sua alterità e percorrere i cammini che essa propone: la gratuità e l'onerosità, per poi ridurre gli esiti nell'unitarietà dell'istituto. Il presente lavoro vaglia invero la donazione modale. Esso è nondimeno condito di riflessioni più generali inerenti alla teoria dei contratti e alla filosofia del diritto, in una digressione temporale che dal diritto romano porta al diritto svizzero, passando per alcuni glossatori e umanisti e, anche, attraverso alcune legislazioni regionali del XIX secolo. Donde un lavoro che, in ultima analisi, abbraccia più tematiche, suddivise come esposto in appresso. Il primo capitolo, di introduzione al tema, è un compendio della terminologia latina del dono, ottenuto individuando esempi addotti dalle fonti giuridiche e fatti narrati da scrittori latini. Dacché la legge si palesa mediante uno scritto insieme di segni che esprimono concetti - mi è parso opportuno soffermarmi sulle sfaccettature linguistiche del "dono". La società romana era tributaria di molteplici rapporti di amicizia, i cui contenuti hanno poi dato adito ai cosiddetti contratti gratuiti. A Roma, la gratuità era un concetto bicefalo. Esisteva una gratuità propriamente detta e una gratuità qualificata di lucrativa. L'una escludeva l'altra. Esse godevano di un campo d'applicazione autonomo e indipendente. Queste due nozioni sono l'oggetto del secondo capitolo. Dato che la donazione modale è una commistione tra gratuità e onerosità, dopo aver esposto i criteri della gratuità, il terzo capitolo si incentra sull'analisi del modus testamentario e del concetto di donazione remuneratoria. Lungo le pagine del quarto capitolo, preludio al nucleo stesso della tesi, si tratteggia lo sviluppo giuridico della donazione nel diritto romano: da causa di atti a contratto indipendente. Dopo avere tracciato i contorni degli elementi che si fondono nel concetto di donazione gravata da un onere (dono, gratuità, onerosità), il quinto capitolo contiene la base del presente studio: l'analisi della donazione modale. Assemblando i risultati emersi nei capitoli precedenti, si definisce e si delimita questo istituto, si elabora la sua maturazione storica e concettuale. L'incedere della tesi, in ambito svizzero, segue quasi pedissequamente la struttura romanistica del primo titolo. Dopo una presentazione di alcune legislazioni del XIX secolo, nel preludio, il secondo capitolo si concentra sulla nozione di gratuità, quale può essere estrapolata dal Codice civile, dal Codice delle obbligazioni e dalla Legge federale sulla esecuzione e sul fallimento. Il terzo capitolo tratta dell'onerosità, limitata al Codice civile e, principalmente, all'onere successorio. La tesi si chiude, nel quarto capitolo, con una digressione sulla donazione gravata da un onere nel diritto svizzero. In questo lavoro i quesiti di fondo, cui si cerca di fornire una risposta, riguardano la natura e la struttura dell'istituto in esame: di che tipo di contratto si tratta? È gratuito, lucrativo o oneroso? Come risponde il donatario inadempiente? Permettetemi un'avvertenza preliminare: i testi letterari, giuridici e filosofici su cui poggia il primo titolo sono vecchi di un paio di millenni, e, prima di giungere nelle nostre biblioteche e nelle nostre case, hanno subito un percorso tortuoso. Alcuni sono stati alterati, altri modificati. Errori di trascrizione si sommano ad adeguamenti alle nuove realtà storiche. Ciò premesso, i testi letterari e filosofici sono presentati nelle versioni indicate nelle pagine della bibliografia e non sono stati l'oggetto di particolari attenzioni interpolazionistiche, critiche che affiorano tuttavia per i frammenti giuridici più interessanti per il presente lavoro.
Resumo:
Recent studies have started to use media data to measure party positions and issue salience. The aim of this article is to compare and cross-validate this alternative approach with the more commonly used party manifestos, expert judgments and mass surveys. To this purpose, we present two methods to generate indicators of party positions and issue salience from media coverage: the core sentence approach and political claims analysis. Our cross-validation shows that with regard to party positions, indicators derived from the media converge with traditionally used measurements from party manifestos, mass surveys and expert judgments, but that salience indicators measure different underlying constructs. We conclude with a discussion of specific research questions for which media data offer potential advantages over more established methods.
Resumo:
The paper deals with the development and application of the methodology for automatic mapping of pollution/contamination data. General Regression Neural Network (GRNN) is considered in detail and is proposed as an efficient tool to solve this problem. The automatic tuning of isotropic and an anisotropic GRNN model using cross-validation procedure is presented. Results are compared with k-nearest-neighbours interpolation algorithm using independent validation data set. Quality of mapping is controlled by the analysis of raw data and the residuals using variography. Maps of probabilities of exceeding a given decision level and ?thick? isoline visualization of the uncertainties are presented as examples of decision-oriented mapping. Real case study is based on mapping of radioactively contaminated territories.
Resumo:
Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.
Resumo:
BACKGROUND: Little information is available on the validity of simple and indirect body-composition methods in non-Western populations. Equations for predicting body composition are population-specific, and body composition differs between blacks and whites. OBJECTIVE: We tested the hypothesis that the validity of equations for predicting total body water (TBW) from bioelectrical impedance analysis measurements is likely to depend on the racial background of the group from which the equations were derived. DESIGN: The hypothesis was tested by comparing, in 36 African women, TBW values measured by deuterium dilution with those predicted by 23 equations developed in white, African American, or African subjects. These cross-validations in our African sample were also compared, whenever possible, with results from other studies in black subjects. RESULTS: Errors in predicting TBW showed acceptable values (1.3-1.9 kg) in all cases, whereas a large range of bias (0.2-6.1 kg) was observed independently of the ethnic origin of the sample from which the equations were derived. Three equations (2 from whites and 1 from blacks) showed nonsignificant bias and could be used in Africans. In all other cases, we observed either an overestimation or underestimation of TBW with variable bias values, regardless of racial background, yielding no clear trend for validity as a function of ethnic origin. CONCLUSIONS: The findings of this cross-validation study emphasize the need for further fundamental research to explore the causes of the poor validity of TBW prediction equations across populations rather than the need to develop new prediction equations for use in Africa.
Resumo:
Models predicting species spatial distribution are increasingly applied to wildlife management issues, emphasising the need for reliable methods to evaluate the accuracy of their predictions. As many available datasets (e.g. museums, herbariums, atlas) do not provide reliable information about species absences, several presence-only based analyses have been developed. However, methods to evaluate the accuracy of their predictions are few and have never been validated. The aim of this paper is to compare existing and new presenceonly evaluators to usual presence/absence measures. We use a reliable, diverse, presence/absence dataset of 114 plant species to test how common presence/absence indices (Kappa, MaxKappa, AUC, adjusted D-2) compare to presenceonly measures (AVI, CVI, Boyce index) for evaluating generalised linear models (GLM). Moreover we propose a new, threshold-independent evaluator, which we call "continuous Boyce index". All indices were implemented in the B10MAPPER software. We show that the presence-only evaluators are fairly correlated (p > 0.7) to the presence/absence ones. The Boyce indices are closer to AUC than to MaxKappa and are fairly insensitive to species prevalence. In addition, the Boyce indices provide predicted-toexpected ratio curves that offer further insights into the model quality: robustness, habitat suitability resolution and deviation from randomness. This information helps reclassifying predicted maps into meaningful habitat suitability classes. The continuous Boyce index is thus both a complement to usual evaluation of presence/absence models and a reliable measure of presence-only based predictions.
Resumo:
OBJECTIVE: Mild neurocognitive disorders (MND) affect a subset of HIV+ patients under effective combination antiretroviral therapy (cART). In this study, we used an innovative multi-contrast magnetic resonance imaging (MRI) approach at high-field to assess the presence of micro-structural brain alterations in MND+ patients. METHODS: We enrolled 17 MND+ and 19 MND- patients with undetectable HIV-1 RNA and 19 healthy controls (HC). MRI acquisitions at 3T included: MP2RAGE for T1 relaxation times, Magnetization Transfer (MT), T2* and Susceptibility Weighted Imaging (SWI) to probe micro-structural integrity and iron deposition in the brain. Statistical analysis used permutation-based tests and correction for family-wise error rate. Multiple regression analysis was performed between MRI data and (i) neuropsychological results (ii) HIV infection characteristics. A linear discriminant analysis (LDA) based on MRI data was performed between MND+ and MND- patients and cross-validated with a leave-one-out test. RESULTS: Our data revealed loss of structural integrity and micro-oedema in MND+ compared to HC in the global white and cortical gray matter, as well as in the thalamus and basal ganglia. Multiple regression analysis showed a significant influence of sub-cortical nuclei alterations on the executive index of MND+ patients (p = 0.04 he and R(2) = 95.2). The LDA distinguished MND+ and MND- patients with a classification quality of 73% after cross-validation. CONCLUSION: Our study shows micro-structural brain tissue alterations in MND+ patients under effective therapy and suggests that multi-contrast MRI at high field is a powerful approach to discriminate between HIV+ patients on cART with and without mild neurocognitive deficits.
Resumo:
Aim This study used data from temperate forest communities to assess: (1) five different stepwise selection methods with generalized additive models, (2) the effect of weighting absences to ensure a prevalence of 0.5, (3) the effect of limiting absences beyond the environmental envelope defined by presences, (4) four different methods for incorporating spatial autocorrelation, and (5) the effect of integrating an interaction factor defined by a regression tree on the residuals of an initial environmental model. Location State of Vaud, western Switzerland. Methods Generalized additive models (GAMs) were fitted using the grasp package (generalized regression analysis and spatial predictions, http://www.cscf.ch/grasp). Results Model selection based on cross-validation appeared to be the best compromise between model stability and performance (parsimony) among the five methods tested. Weighting absences returned models that perform better than models fitted with the original sample prevalence. This appeared to be mainly due to the impact of very low prevalence values on evaluation statistics. Removing zeroes beyond the range of presences on main environmental gradients changed the set of selected predictors, and potentially their response curve shape. Moreover, removing zeroes slightly improved model performance and stability when compared with the baseline model on the same data set. Incorporating a spatial trend predictor improved model performance and stability significantly. Even better models were obtained when including local spatial autocorrelation. A novel approach to include interactions proved to be an efficient way to account for interactions between all predictors at once. Main conclusions Models and spatial predictions of 18 forest communities were significantly improved by using either: (1) cross-validation as a model selection method, (2) weighted absences, (3) limited absences, (4) predictors accounting for spatial autocorrelation, or (5) a factor variable accounting for interactions between all predictors. The final choice of model strategy should depend on the nature of the available data and the specific study aims. Statistical evaluation is useful in searching for the best modelling practice. However, one should not neglect to consider the shapes and interpretability of response curves, as well as the resulting spatial predictions in the final assessment.
Resumo:
The n-octanol/water partition coefficient (log Po/w) is a key physicochemical parameter for drug discovery, design, and development. Here, we present a physics-based approach that shows a strong linear correlation between the computed solvation free energy in implicit solvents and the experimental log Po/w on a cleansed data set of more than 17,500 molecules. After internal validation by five-fold cross-validation and data randomization, the predictive power of the most interesting multiple linear model, based on two GB/SA parameters solely, was tested on two different external sets of molecules. On the Martel druglike test set, the predictive power of the best model (N = 706, r = 0.64, MAE = 1.18, and RMSE = 1.40) is similar to six well-established empirical methods. On the 17-drug test set, our model outperformed all compared empirical methodologies (N = 17, r = 0.94, MAE = 0.38, and RMSE = 0.52). The physical basis of our original GB/SA approach together with its predictive capacity, computational efficiency (1 to 2 s per molecule), and tridimensional molecular graphics capability lay the foundations for a promising predictor, the implicit log P method (iLOGP), to complement the portfolio of drug design tools developed and provided by the SIB Swiss Institute of Bioinformatics.