950 resultados para Cross-validation
Resumo:
Motivation: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. Methods: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. Results: The average three-state prediction accuracy per protein (Q3) is estimated by cross-validation to be 77.07 ± 0.26% with a segment overlap (Sov) score of 73.32 ± 0.39%. The SVM performs similarly to the 'state-of-the-art' PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods. Availability: The SVM classifier is available from the authors. Work is in progress to make the method available on-line and to integrate the SVM predictions into the PSIPRED server.
Resumo:
Current methods for estimating vegetation parameters are generally sub-optimal in the way they exploit information and do not generally consider uncertainties. We look forward to a future where operational dataassimilation schemes improve estimates by tracking land surface processes and exploiting multiple types of observations. Dataassimilation schemes seek to combine observations and models in a statistically optimal way taking into account uncertainty in both, but have not yet been much exploited in this area. The EO-LDAS scheme and prototype, developed under ESA funding, is designed to exploit the anticipated wealth of data that will be available under GMES missions, such as the Sentinel family of satellites, to provide improved mapping of land surface biophysical parameters. This paper describes the EO-LDAS implementation, and explores some of its core functionality. EO-LDAS is a weak constraint variational dataassimilationsystem. The prototype provides a mechanism for constraint based on a prior estimate of the state vector, a linear dynamic model, and EarthObservationdata (top-of-canopy reflectance here). The observation operator is a non-linear optical radiative transfer model for a vegetation canopy with a soil lower boundary, operating over the range 400 to 2500 nm. Adjoint codes for all model and operator components are provided in the prototype by automatic differentiation of the computer codes. In this paper, EO-LDAS is applied to the problem of daily estimation of six of the parameters controlling the radiative transfer operator over the course of a year (> 2000 state vector elements). Zero and first order process model constraints are implemented and explored as the dynamic model. The assimilation estimates all state vector elements simultaneously. This is performed in the context of a typical Sentinel-2 MSI operating scenario, using synthetic MSI observations simulated with the observation operator, with uncertainties typical of those achieved by optical sensors supposed for the data. The experiments consider a baseline state vector estimation case where dynamic constraints are applied, and assess the impact of dynamic constraints on the a posteriori uncertainties. The results demonstrate that reductions in uncertainty by a factor of up to two might be obtained by applying the sorts of dynamic constraints used here. The hyperparameter (dynamic model uncertainty) required to control the assimilation are estimated by a cross-validation exercise. The result of the assimilation is seen to be robust to missing observations with quite large data gaps.
Resumo:
Background. Within a therapeutic gene by environment (GxE) framework, we recently demonstrated that variation in the Serotonin Transporter Promoter Polymorphism; 5HTTLPR and marker rs6330 in Nerve Growth Factor gene; NGF is associated with poorer outcomes following cognitive behaviour therapy (CBT) for child anxiety disorders. The aim of this study was to explore one potential means of extending the translational reach of G×E data in a way that may be clinically informative. We describe a ‘risk-index’ approach combining genetic, demographic and clinical data and test its ability to predict diagnostic outcome following CBT in anxious children. Method. DNA and clinical data were collected from 384 children with a primary anxiety disorder undergoing CBT. We tested our risk model in five cross-validation training sets. Results. In predicting treatment outcome, six variables had a minimum mean beta value of 0.5: 5HTTLPR, NGF rs6330, gender, primary anxiety severity, comorbid mood disorder and comorbid externalising disorder. A risk index (range 0-8) constructed from these variables had moderate predictive ability (AUC = .62-.69) in this study. Children scoring high on this index (5-8) were approximately three times as likely to retain their primary anxiety disorder at follow-up as compared to those children scoring 2 or less. Conclusion. Significant genetic, demographic and clinical predictors of outcome following CBT for anxiety-disordered children were identified. Combining these predictors within a risk-index could be used to identify which children are less likely to be diagnosis free following CBT alone or thus require longer or enhanced treatment. The ‘risk-index’ approach represents one means of harnessing the translational potential of G×E data.
Resumo:
We present an efficient graph-based algorithm for quantifying the similarity of household-level energy use profiles, using a notion of similarity that allows for small time–shifts when comparing profiles. Experimental results on a real smart meter data set demonstrate that in cases of practical interest our technique is far faster than the existing method for computing the same similarity measure. Having a fast algorithm for measuring profile similarity improves the efficiency of tasks such as clustering of customers and cross-validation of forecasting methods using historical data. Furthermore, we apply a generalisation of our algorithm to produce substantially better household-level energy use forecasts from historical smart meter data.
Resumo:
We propose a new class of neurofuzzy construction algorithms with the aim of maximizing generalization capability specifically for imbalanced data classification problems based on leave-one-out (LOO) cross validation. The algorithms are in two stages, first an initial rule base is constructed based on estimating the Gaussian mixture model with analysis of variance decomposition from input data; the second stage carries out the joint weighted least squares parameter estimation and rule selection using orthogonal forward subspace selection (OFSS)procedure. We show how different LOO based rule selection criteria can be incorporated with OFSS, and advocate either maximizing the leave-one-out area under curve of the receiver operating characteristics, or maximizing the leave-one-out Fmeasure if the data sets exhibit imbalanced class distribution. Extensive comparative simulations illustrate the effectiveness of the proposed algorithms.
Resumo:
tWe develop an orthogonal forward selection (OFS) approach to construct radial basis function (RBF)network classifiers for two-class problems. Our approach integrates several concepts in probabilisticmodelling, including cross validation, mutual information and Bayesian hyperparameter fitting. At eachstage of the OFS procedure, one model term is selected by maximising the leave-one-out mutual infor-mation (LOOMI) between the classifier’s predicted class labels and the true class labels. We derive theformula of LOOMI within the OFS framework so that the LOOMI can be evaluated efficiently for modelterm selection. Furthermore, a Bayesian procedure of hyperparameter fitting is also integrated into theeach stage of the OFS to infer the l2-norm based local regularisation parameter from the data. Since eachforward stage is effectively fitting of a one-variable model, this task is very fast. The classifier construc-tion procedure is automatically terminated without the need of using additional stopping criterion toyield very sparse RBF classifiers with excellent classification generalisation performance, which is par-ticular useful for the noisy data sets with highly overlapping class distribution. A number of benchmarkexamples are employed to demonstrate the effectiveness of our proposed approach.
Resumo:
Simulation models are widely employed to make probability forecasts of future conditions on seasonal to annual lead times. Added value in such forecasts is reflected in the information they add, either to purely empirical statistical models or to simpler simulation models. An evaluation of seasonal probability forecasts from the Development of a European Multimodel Ensemble system for seasonal to inTERannual prediction (DEMETER) and ENSEMBLES multi-model ensemble experiments is presented. Two particular regions are considered: Nino3.4 in the Pacific and the Main Development Region in the Atlantic; these regions were chosen before any spatial distribution of skill was examined. The ENSEMBLES models are found to have skill against the climatological distribution on seasonal time-scales. For models in ENSEMBLES that have a clearly defined predecessor model in DEMETER, the improvement from DEMETER to ENSEMBLES is discussed. Due to the long lead times of the forecasts and the evolution of observation technology, the forecast-outcome archive for seasonal forecast evaluation is small; arguably, evaluation data for seasonal forecasting will always be precious. Issues of information contamination from in-sample evaluation are discussed and impacts (both positive and negative) of variations in cross-validation protocol are demonstrated. Other difficulties due to the small forecast-outcome archive are identified. The claim that the multi-model ensemble provides a ‘better’ probability forecast than the best single model is examined and challenged. Significant forecast information beyond the climatological distribution is also demonstrated in a persistence probability forecast. The ENSEMBLES probability forecasts add significantly more information to empirical probability forecasts on seasonal time-scales than on decadal scales. Current operational forecasts might be enhanced by melding information from both simulation models and empirical models. Simulation models based on physical principles are sometimes expected, in principle, to outperform empirical models; direct comparison of their forecast skill provides information on progress toward that goal.
Resumo:
An efficient data based-modeling algorithm for nonlinear system identification is introduced for radial basis function (RBF) neural networks with the aim of maximizing generalization capability based on the concept of leave-one-out (LOO) cross validation. Each of the RBF kernels has its own kernel width parameter and the basic idea is to optimize the multiple pairs of regularization parameters and kernel widths, each of which is associated with a kernel, one at a time within the orthogonal forward regression (OFR) procedure. Thus, each OFR step consists of one model term selection based on the LOO mean square error (LOOMSE), followed by the optimization of the associated kernel width and regularization parameter, also based on the LOOMSE. Since like our previous state-of-the-art local regularization assisted orthogonal least squares (LROLS) algorithm, the same LOOMSE is adopted for model selection, our proposed new OFR algorithm is also capable of producing a very sparse RBF model with excellent generalization performance. Unlike our previous LROLS algorithm which requires an additional iterative loop to optimize the regularization parameters as well as an additional procedure to optimize the kernel width, the proposed new OFR algorithm optimizes both the kernel widths and regularization parameters within the single OFR procedure, and consequently the required computational complexity is dramatically reduced. Nonlinear system identification examples are included to demonstrate the effectiveness of this new approach in comparison to the well-known approaches of support vector machine and least absolute shrinkage and selection operator as well as the LROLS algorithm.
Resumo:
Accurate and reliable rain rate estimates are important for various hydrometeorological applications. Consequently, rain sensors of different types have been deployed in many regions. In this work, measurements from different instruments, namely, rain gauge, weather radar, and microwave link, are combined for the first time to estimate with greater accuracy the spatial distribution and intensity of rainfall. The objective is to retrieve the rain rate that is consistent with all these measurements while incorporating the uncertainty associated with the different sources of information. Assuming the problem is not strongly nonlinear, a variational approach is implemented and the Gauss–Newton method is used to minimize the cost function containing proper error estimates from all sensors. Furthermore, the method can be flexibly adapted to additional data sources. The proposed approach is tested using data from 14 rain gauges and 14 operational microwave links located in the Zürich area (Switzerland) to correct the prior rain rate provided by the operational radar rain product from the Swiss meteorological service (MeteoSwiss). A cross-validation approach demonstrates the improvement of rain rate estimates when assimilating rain gauge and microwave link information.
Resumo:
This paper investigates the feasibility of using approximate Bayesian computation (ABC) to calibrate and evaluate complex individual-based models (IBMs). As ABC evolves, various versions are emerging, but here we only explore the most accessible version, rejection-ABC. Rejection-ABC involves running models a large number of times, with parameters drawn randomly from their prior distributions, and then retaining the simulations closest to the observations. Although well-established in some fields, whether ABC will work with ecological IBMs is still uncertain. Rejection-ABC was applied to an existing 14-parameter earthworm energy budget IBM for which the available data consist of body mass growth and cocoon production in four experiments. ABC was able to narrow the posterior distributions of seven parameters, estimating credible intervals for each. ABC’s accepted values produced slightly better fits than literature values do. The accuracy of the analysis was assessed using cross-validation and coverage, currently the best available tests. Of the seven unnarrowed parameters, ABC revealed that three were correlated with other parameters, while the remaining four were found to be not estimable given the data available. It is often desirable to compare models to see whether all component modules are necessary. Here we used ABC model selection to compare the full model with a simplified version which removed the earthworm’s movement and much of the energy budget. We are able to show that inclusion of the energy budget is necessary for a good fit to the data. We show how our methodology can inform future modelling cycles, and briefly discuss how more advanced versions of ABC may be applicable to IBMs. We conclude that ABC has the potential to represent uncertainty in model structure, parameters and predictions, and to embed the often complex process of optimizing an IBM’s structure and parameters within an established statistical framework, thereby making the process more transparent and objective.
Resumo:
The clear cell subtype of renal cell carcinoma (RCC) is the most lethal and prevalent cancer of the urinary system. To investigate the molecular changes associated with malignant transformation in clear cell RCC, the gene expression profiles of matched samples of tumor and adjacent non-neoplastic tissue were obtained from six patients. A custom-built cDNA microarray platform was used, comprising 2292 probes that map to exons of genes and 822 probes for noncoding RNAs mapping to intronic regions. Intronic transcription was detected in all normal and neoplastic renal tissues. A subset of 55 transcripts was significantly down-regulated in clear cell RCC relative to the matched nontumor tissue as determined by a combination of two statistical tests and leave-one-out patient cross-validation. Among the down-regulated transcripts, 49 mapped to untranslated or coding exons and 6 were intronic relative to known exons of protein-coding genes. Lower levels of expression of SIN3B, TRIP3, SYNJ2BP and NDE1 (P<0.02), and of intronic transcripts derived from SND1 and ACTN4 loci (P<0.05), were confirmed in clear cell RCC by Real-time RT-PCR. A subset of 25 transcripts was deregulated in additional six nonclear cell RCC samples, pointing to common transcriptional alterations in RCC irrespective of the histological subtype or differentiation state of the tumor. Our results indicate a novel set of tumor suppressor gene candidates, including noncoding intronic RNAs, which may play a significant role in malignant transformations of normal renal cells. (C) 2008 Wiley-Liss, Inc.
Resumo:
Objective: To investigate whether spirography-based objective measures are able to effectively characterize the severity of unwanted symptom states (Off and dyskinesia) and discriminate them from motor state of healthy elderly subjects. Background: Sixty-five patients with advanced Parkinson’s disease (PD) and 10 healthy elderly (HE) subjects performed repeated assessments of spirography, using a touch screen telemetry device in their home environments. On inclusion, the patients were either treated with levodopa-carbidopa intestinal gel or were candidates for switching to this treatment. On each test occasion, the subjects were asked trace a pre-drawn Archimedes spiral shown on the screen, using an ergonomic pen stylus. The test was repeated three times and was performed using dominant hand. A clinician used a web interface which animated the spiral drawings, allowing him to observe different kinematic features, like accelerations and spatial changes, during the drawing process and to rate different motor impairments. Initially, the motor impairments of drawing speed, irregularity and hesitation were rated on a 0 (normal) to 4 (extremely severe) scales followed by marking the momentary motor state of the patient into 2 categories that is Off and Dyskinesia. A sample of spirals drawn by HE subjects was randomly selected and used in subsequent analysis. Methods: The raw spiral data, consisting of stylus position and timestamp, were processed using time series analysis techniques like discrete wavelet transform, approximate entropy and dynamic time warping in order to extract 13 quantitative measures for representing meaningful motor impairment information. A principal component analysis (PCA) was used to reduce the dimensions of the quantitative measures into 4 principal components (PC). In order to classify the motor states into 3 categories that is Off, HE and dyskinesia, a logistic regression model was used as a classifier to map the 4 PCs to the corresponding clinically assigned motor state categories. A stratified 10-fold cross-validation (also known as rotation estimation) was applied to assess the generalization ability of the logistic regression classifier to future independent data sets. To investigate mean differences of the 4 PCs across the three categories, a one-way ANOVA test followed by Tukey multiple comparisons was used. Results: The agreements between computed and clinician ratings were very good with a weighted area under the receiver operating characteristic curve (AUC) coefficient of 0.91. The mean PC scores were different across the three motor state categories, only at different levels. The first 2 PCs were good at discriminating between the motor states whereas the PC3 was good at discriminating between HE subjects and PD patients. The mean scores of PC4 showed a trend across the three states but without significant differences. The Spearman’s rank correlations between the first 2 PCs and clinically assessed motor impairments were as follows: drawing speed (PC1, 0.34; PC2, 0.83), irregularity (PC1, 0.17; PC2, 0.17), and hesitation (PC1, 0.27; PC2, 0.77). Conclusions: These findings suggest that spirography-based objective measures are valid measures of spatial- and time-dependent deficits and can be used to distinguish drug-related motor dysfunctions between Off and dyskinesia in PD. These measures can be potentially useful during clinical evaluation of individualized drug-related complications such as over- and under-medications thus maximizing the amount of time the patients spend in the On state.
Resumo:
This paper presents the development and evaluation of a method for enabling quantitative and automatic scoring of alternating tapping performance of patients with Parkinson’s disease (PD). Ten healthy elderly subjects and 95 patients in different clinical stages of PD have utilized a touch-pad handheld computer to perform alternate tapping tests in their home environments. First, a neurologist used a web-based system to visually assess impairments in four tapping dimensions (‘speed’, ‘accuracy’, ‘fatigue’ and ‘arrhythmia’) and a global tapping severity (GTS). Second, tapping signals were processed with time series analysis and statistical methods to derive 24 quantitative parameters. Third, principal component analysis was used to reduce the dimensions of these parameters and to obtain scores for the four dimensions. Finally, a logistic regression classifier was trained using a 10-fold stratified cross-validation to map the reduced parameters to the corresponding visually assessed GTS scores. Results showed that the computed scores correlated well to visually assessed scores and were significantly different across Unified Parkinson’s Disease Rating Scale scores of upper limb motor performance. In addition, they had good internal consistency, had good ability to discriminate between healthy elderly and patients in different disease stages, had good sensitivity to treatment interventions and could reflect the natural disease progression over time. In conclusion, the automatic method can be useful to objectively assess the tapping performance of PD patients and can be included in telemedicine tools for remote monitoring of tapping.
Resumo:
Objective: To define and evaluate a Computer-Vision (CV) method for scoring Paced Finger-Tapping (PFT) in Parkinson's disease (PD) using quantitative motion analysis of index-fingers and to compare the obtained scores to the UPDRS (Unified Parkinson's Disease Rating Scale) finger-taps (FT). Background: The naked-eye evaluation of PFT in clinical practice results in coarse resolution to determine PD status. Besides, sensor mechanisms for PFT evaluation may cause patients discomfort. In order to avoid cost and effort of applying wearable sensors, a CV system for non-invasive PFT evaluation is introduced. Methods: A database of 221 PFT videos from 6 PD patients was processed. The subjects were instructed to position their hands above their shoulders besides the face and tap the index-finger against the thumb consistently with speed. They were facing towards a pivoted camera during recording. The videos were rated by two clinicians between symptom levels 0-to-3 using UPDRS-FT. The CV method incorporates a motion analyzer and a face detector. The method detects the face of testee in each video-frame. The frame is split into two images from face-rectangle center. Two regions of interest are located in each image to detect index-finger motion of left and right hands respectively. The tracking of opening and closing phases of dominant hand index-finger produces a tapping time-series. This time-series is normalized by the face height. The normalization calibrates the amplitude in tapping signal which is affected by the varying distance between camera and subject (farther the camera, lesser the amplitude). A total of 15 features were classified using K-nearest neighbor (KNN) classifier to characterize the symptoms levels in UPDRS-FT. The target ratings provided by the raters were averaged. Results: A 10-fold cross validation in KNN classified 221 videos between 3 symptom levels with 75% accuracy. An area under the receiver operating characteristic curves of 82.6% supports feasibility of the obtained features to replicate clinical assessments. Conclusions: The system is able to track index-finger motion to estimate tapping symptoms in PD. It has certain advantages compared to other technologies (e.g. magnetic sensors, accelerometers etc.) for PFT evaluation to improve and automate the ratings
Resumo:
Objective: To develop a method for objective quantification of PD motor symptoms related to Off episodes and peak dose dyskinesias, using spiral data gathered by using a touch screen telemetry device. The aim was to objectively characterize predominant motor phenotypes (bradykinesia and dyskinesia), to help in automating the process of visual interpretation of movement anomalies in spirals as rated by movement disorder specialists. Background: A retrospective analysis was conducted on recordings from 65 patients with advanced idiopathic PD from nine different clinics in Sweden, recruited from January 2006 until August 2010. In addition to the patient group, 10 healthy elderly subjects were recruited. Upper limb movement data were collected using a touch screen telemetry device from home environments of the subjects. Measurements with the device were performed four times per day during week-long test periods. On each test occasion, the subjects were asked to trace pre-drawn Archimedean spirals, using the dominant hand. The pre-drawn spiral was shown on the screen of the device. The spiral test was repeated three times per test occasion and they were instructed to complete it within 10 seconds. The device had a sampling rate of 10Hz and measured both position and time-stamps (in milliseconds) of the pen tip. Methods: Four independent raters (FB, DH, AJ and DN) used a web interface that animated the spiral drawings and allowed them to observe different kinematic features during the drawing process and to rate task performance. Initially, a number of kinematic features were assessed including ‘impairment’, ‘speed’, ‘irregularity’ and ‘hesitation’ followed by marking the predominant motor phenotype on a 3-category scale: tremor, bradykinesia and/or choreatic dyskinesia. There were only 2 test occasions for which all the four raters either classified them as tremor or could not identify the motor phenotype. Therefore, the two main motor phenotype categories were bradykinesia and dyskinesia. ‘Impairment’ was rated on a scale from 0 (no impairment) to 10 (extremely severe) whereas ‘speed’, ‘irregularity’ and ‘hesitation’ were rated on a scale from 0 (normal) to 4 (extremely severe). The proposed data-driven method consisted of the following steps. Initially, 28 spatiotemporal features were extracted from the time series signals before being presented to a Multilayer Perceptron (MLP) classifier. The features were based on different kinematic quantities of spirals including radius, angle, speed and velocity with the aim of measuring the severity of involuntary symptoms and discriminate between PD-specific (bradykinesia) and/or treatment-induced symptoms (dyskinesia). A Principal Component Analysis was applied on the features to reduce their dimensions where 4 relevant principal components (PCs) were retained and used as inputs to the MLP classifier. Finally, the MLP classifier mapped these components to the corresponding visually assessed motor phenotype scores for automating the process of scoring the bradykinesia and dyskinesia in PD patients whilst they draw spirals using the touch screen device. For motor phenotype (bradykinesia vs. dyskinesia) classification, the stratified 10-fold cross validation technique was employed. Results: There were good agreements between the four raters when rating the individual kinematic features with intra-class correlation coefficient (ICC) of 0.88 for ‘impairment’, 0.74 for ‘speed’, 0.70 for ‘irregularity’, and moderate agreements when rating ‘hesitation’ with an ICC of 0.49. When assessing the two main motor phenotype categories (bradykinesia or dyskinesia) in animated spirals the agreements between the four raters ranged from fair to moderate. There were good correlations between mean ratings of the four raters on individual kinematic features and computed scores. The MLP classifier classified the motor phenotype that is bradykinesia or dyskinesia with an accuracy of 85% in relation to visual classifications of the four movement disorder specialists. The test-retest reliability of the four PCs across the three spiral test trials was good with Cronbach’s Alpha coefficients of 0.80, 0.82, 0.54 and 0.49, respectively. These results indicate that the computed scores are stable and consistent over time. Significant differences were found between the two groups (patients and healthy elderly subjects) in all the PCs, except for the PC3. Conclusions: The proposed method automatically assessed the severity of unwanted symptoms and could reasonably well discriminate between PD-specific and/or treatment-induced motor symptoms, in relation to visual assessments of movement disorder specialists. The objective assessments could provide a time-effect summary score that could be useful for improving decision-making during symptom evaluation of individualized treatment when the goal is to maximize functional On time for patients while minimizing their Off episodes and troublesome dyskinesias.