40 resultados para Analisi Discriminante, Teoria dei Network, Cross-Validation, Validazione.
Resumo:
The paper introduces an efficient construction algorithm for obtaining sparse linear-in-the-weights regression models based on an approach of directly optimizing model generalization capability. This is achieved by utilizing the delete-1 cross validation concept and the associated leave-one-out test error also known as the predicted residual sums of squares (PRESS) statistic, without resorting to any other validation data set for model evaluation in the model construction process. Computational efficiency is ensured using an orthogonal forward regression, but the algorithm incrementally minimizes the PRESS statistic instead of the usual sum of the squared training errors. A local regularization method can naturally be incorporated into the model selection procedure to further enforce model sparsity. The proposed algorithm is fully automatic, and the user is not required to specify any criterion to terminate the model construction procedure. Comparisons with some of the existing state-of-art modeling methods are given, and several examples are included to demonstrate the ability of the proposed algorithm to effectively construct sparse models that generalize well.
Resumo:
The identification of non-linear systems using only observed finite datasets has become a mature research area over the last two decades. A class of linear-in-the-parameter models with universal approximation capabilities have been intensively studied and widely used due to the availability of many linear-learning algorithms and their inherent convergence conditions. This article presents a systematic overview of basic research on model selection approaches for linear-in-the-parameter models. One of the fundamental problems in non-linear system identification is to find the minimal model with the best model generalisation performance from observational data only. The important concepts in achieving good model generalisation used in various non-linear system-identification algorithms are first reviewed, including Bayesian parameter regularisation and models selective criteria based on the cross validation and experimental design. A significant advance in machine learning has been the development of the support vector machine as a means for identifying kernel models based on the structural risk minimisation principle. The developments on the convex optimisation-based model construction algorithms including the support vector regression algorithms are outlined. Input selection algorithms and on-line system identification algorithms are also included in this review. Finally, some industrial applications of non-linear models are discussed.
Resumo:
We propose a unified data modeling approach that is equally applicable to supervised regression and classification applications, as well as to unsupervised probability density function estimation. A particle swarm optimization (PSO) aided orthogonal forward regression (OFR) algorithm based on leave-one-out (LOO) criteria is developed to construct parsimonious radial basis function (RBF) networks with tunable nodes. Each stage of the construction process determines the center vector and diagonal covariance matrix of one RBF node by minimizing the LOO statistics. For regression applications, the LOO criterion is chosen to be the LOO mean square error, while the LOO misclassification rate is adopted in two-class classification applications. By adopting the Parzen window estimate as the desired response, the unsupervised density estimation problem is transformed into a constrained regression problem. This PSO aided OFR algorithm for tunable-node RBF networks is capable of constructing very parsimonious RBF models that generalize well, and our analysis and experimental results demonstrate that the algorithm is computationally even simpler than the efficient regularization assisted orthogonal least square algorithm based on LOO criteria for selecting fixed-node RBF models. Another significant advantage of the proposed learning procedure is that it does not have learning hyperparameters that have to be tuned using costly cross validation. The effectiveness of the proposed PSO aided OFR construction procedure is illustrated using several examples taken from regression and classification, as well as density estimation applications.
Resumo:
We develop a particle swarm optimisation (PSO) aided orthogonal forward regression (OFR) approach for constructing radial basis function (RBF) classifiers with tunable nodes. At each stage of the OFR construction process, the centre vector and diagonal covariance matrix of one RBF node is determined efficiently by minimising the leave-one-out (LOO) misclassification rate (MR) using a PSO algorithm. Compared with the state-of-the-art regularisation assisted orthogonal least square algorithm based on the LOO MR for selecting fixednode RBF classifiers, the proposed PSO aided OFR algorithm for constructing tunable-node RBF classifiers offers significant advantages in terms of better generalisation performance and smaller model size as well as imposes lower computational complexity in classifier construction process. Moreover, the proposed algorithm does not have any hyperparameter that requires costly tuning based on cross validation.
Resumo:
Light Detection And Ranging (LIDAR) is an important modality in terrain and land surveying for many environmental, engineering and civil applications. This paper presents the framework for a recently developed unsupervised classification algorithm called Skewness Balancing for object and ground point separation in airborne LIDAR data. The main advantages of the algorithm are threshold-freedom and independence from LIDAR data format and resolution, while preserving object and terrain details. The framework for Skewness Balancing has been built in this contribution with a prediction model in which unknown LIDAR tiles can be categorised as “hilly” or “moderate” terrains. Accuracy assessment of the model is carried out using cross-validation with an overall accuracy of 95%. An extension to the algorithm is developed to address the overclassification issue for hilly terrain. For moderate terrain, the results show that from the classified tiles detached objects (buildings and vegetation) and attached objects (bridges and motorway junctions) are separated from bare earth (ground, roads and yards) which makes Skewness Balancing ideal to be integrated into geographic information system (GIS) software packages.
Resumo:
A new parameter-estimation algorithm, which minimises the cross-validated prediction error for linear-in-the-parameter models, is proposed, based on stacked regression and an evolutionary algorithm. It is initially shown that cross-validation is very important for prediction in linear-in-the-parameter models using a criterion called the mean dispersion error (MDE). Stacked regression, which can be regarded as a sophisticated type of cross-validation, is then introduced based on an evolutionary algorithm, to produce a new parameter-estimation algorithm, which preserves the parsimony of a concise model structure that is determined using the forward orthogonal least-squares (OLS) algorithm. The PRESS prediction errors are used for cross-validation, and the sunspot and Canadian lynx time series are used to demonstrate the new algorithms.
Resumo:
This study investigated the potential application of mid-infrared spectroscopy (MIR 4,000–900 cm−1) for the determination of milk coagulation properties (MCP), titratable acidity (TA), and pH in Brown Swiss milk samples (n = 1,064). Because MCP directly influence the efficiency of the cheese-making process, there is strong industrial interest in developing a rapid method for their assessment. Currently, the determination of MCP involves time-consuming laboratory-based measurements, and it is not feasible to carry out these measurements on the large numbers of milk samples associated with milk recording programs. Mid-infrared spectroscopy is an objective and nondestructive technique providing rapid real-time analysis of food compositional and quality parameters. Analysis of milk rennet coagulation time (RCT, min), curd firmness (a30, mm), TA (SH°/50 mL; SH° = Soxhlet-Henkel degree), and pH was carried out, and MIR data were recorded over the spectral range of 4,000 to 900 cm−1. Models were developed by partial least squares regression using untreated and pretreated spectra. The MCP, TA, and pH prediction models were improved by using the combined spectral ranges of 1,600 to 900 cm−1, 3,040 to 1,700 cm−1, and 4,000 to 3,470 cm−1. The root mean square errors of cross-validation for the developed models were 2.36 min (RCT, range 24.9 min), 6.86 mm (a30, range 58 mm), 0.25 SH°/50 mL (TA, range 3.58 SH°/50 mL), and 0.07 (pH, range 1.15). The most successfully predicted attributes were TA, RCT, and pH. The model for the prediction of TA provided approximate prediction (R2 = 0.66), whereas the predictive models developed for RCT and pH could discriminate between high and low values (R2 = 0.59 to 0.62). It was concluded that, although the models require further development to improve their accuracy before their application in industry, MIR spectroscopy has potential application for the assessment of RCT, TA, and pH during routine milk analysis in the dairy industry. The implementation of such models could be a means of improving MCP through phenotypic-based selection programs and to amend milk payment systems to incorporate MCP into their payment criteria.
Resumo:
The potential of near infrared spectroscopy in conjunction with partial least squares regression to predict Miscanthus xgiganteus and short rotation coppice willow quality indices was examined. Moisture, calorific value, ash and carbon content were predicted with a root mean square error of cross validation of 0.90% (R2 = 0.99), 0.13 MJ/kg (R2 = 0.99), 0.42% (R2 = 0.58), and 0.57% (R2 = 0.88), respectively. The moisture and calorific value prediction models had excellent accuracy while the carbon and ash models were fair and poor, respectively. The results indicate that near infrared spectroscopy has the potential to predict quality indices of dedicated energy crops, however the models must be further validated on a wider range of samples prior to implementation. The utilization of such models would assist in the optimal use of the feedstock based on its biomass properties.
Resumo:
The objective of this study was to investigate the potential application of mid-infrared spectroscopy for determination of selected sensory attributes in a range of experimentally manufactured processed cheese samples. This study also evaluates mid-infrared spectroscopy against other recently proposed techniques for predicting sensory texture attributes. Processed cheeses (n = 32) of varying compositions were manufactured on a pilot scale. After 2 and 4 wk of storage at 4 degrees C, mid-infrared spectra ( 640 to 4,000 cm(-1)) were recorded and samples were scored on a scale of 0 to 100 for 9 attributes using descriptive sensory analysis. Models were developed by partial least squares regression using raw and pretreated spectra. The mouth-coating and mass-forming models were improved by using a reduced spectral range ( 930 to 1,767 cm(-1)). The remaining attributes were most successfully modeled using a combined range ( 930 to 1,767 cm(-1) and 2,839 to 4,000 cm(-1)). The root mean square errors of cross-validation for the models were 7.4(firmness; range 65.3), 4.6 ( rubbery; range 41.7), 7.1 ( creamy; range 60.9), 5.1(chewy; range 43.3), 5.2(mouth-coating; range 37.4), 5.3 (fragmentable; range 51.0), 7.4 ( melting; range 69.3), and 3.1 (mass-forming; range 23.6). These models had a good practical utility. Model accuracy ranged from approximate quantitative predictions to excellent predictions ( range error ratio = 9.6). In general, the models compared favorably with previously reported instrumental texture models and near-infrared models, although the creamy, chewy, and melting models were slightly weaker than the previously reported near-infrared models. We concluded that mid-infrared spectroscopy could be successfully used for the nondestructive and objective assessment of processed cheese sensory quality..
Resumo:
The objective of this study was to determine the potential of mid-infrared spectroscopy in conjunction with partial least squares (PLS) regression to predict various quality parameters in cheddar cheese. Cheddar cheeses (n = 24) were manufactured and stored at 8 degrees C for 12 mo. Mid-infrared spectra (640 to 4000/cm) were recorded after 4, 6, 9, and 12 mo storage. At 4, 6, and 9 mo, the water-soluble nitrogen (WSN) content of the samples was determined and the samples were also evaluated for 11 sensory texture attributes using descriptive sensory analysis. The mid-infrared spectra were subjected to a number of pretreatments, and predictive models were developed for all parameters. Age was predicted using scatter-corrected, 1st derivative spectra with a root mean square error of cross-validation (RMSECV) of 1 mo, while WSN was predicted using 1st derivative spectra (RMSECV = 2.6%). The sensory texture attributes most successfully predicted were rubbery, crumbly, chewy, and massforming. These attributes were modeled using 2nd derivative spectra and had, corresponding RMSECV values in the range of 2.5 to 4.2 on a scale of 0 to 100. It was concluded that mid-infrared spectroscopy has the potential to predict age, WSN, and several sensory texture attributes of cheddar cheese..
Resumo:
Motivation: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. Methods: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. Results: The average three-state prediction accuracy per protein (Q3) is estimated by cross-validation to be 77.07 ± 0.26% with a segment overlap (Sov) score of 73.32 ± 0.39%. The SVM performs similarly to the 'state-of-the-art' PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods. Availability: The SVM classifier is available from the authors. Work is in progress to make the method available on-line and to integrate the SVM predictions into the PSIPRED server.
Resumo:
Current methods for estimating vegetation parameters are generally sub-optimal in the way they exploit information and do not generally consider uncertainties. We look forward to a future where operational dataassimilation schemes improve estimates by tracking land surface processes and exploiting multiple types of observations. Dataassimilation schemes seek to combine observations and models in a statistically optimal way taking into account uncertainty in both, but have not yet been much exploited in this area. The EO-LDAS scheme and prototype, developed under ESA funding, is designed to exploit the anticipated wealth of data that will be available under GMES missions, such as the Sentinel family of satellites, to provide improved mapping of land surface biophysical parameters. This paper describes the EO-LDAS implementation, and explores some of its core functionality. EO-LDAS is a weak constraint variational dataassimilationsystem. The prototype provides a mechanism for constraint based on a prior estimate of the state vector, a linear dynamic model, and EarthObservationdata (top-of-canopy reflectance here). The observation operator is a non-linear optical radiative transfer model for a vegetation canopy with a soil lower boundary, operating over the range 400 to 2500 nm. Adjoint codes for all model and operator components are provided in the prototype by automatic differentiation of the computer codes. In this paper, EO-LDAS is applied to the problem of daily estimation of six of the parameters controlling the radiative transfer operator over the course of a year (> 2000 state vector elements). Zero and first order process model constraints are implemented and explored as the dynamic model. The assimilation estimates all state vector elements simultaneously. This is performed in the context of a typical Sentinel-2 MSI operating scenario, using synthetic MSI observations simulated with the observation operator, with uncertainties typical of those achieved by optical sensors supposed for the data. The experiments consider a baseline state vector estimation case where dynamic constraints are applied, and assess the impact of dynamic constraints on the a posteriori uncertainties. The results demonstrate that reductions in uncertainty by a factor of up to two might be obtained by applying the sorts of dynamic constraints used here. The hyperparameter (dynamic model uncertainty) required to control the assimilation are estimated by a cross-validation exercise. The result of the assimilation is seen to be robust to missing observations with quite large data gaps.
Resumo:
Background. Within a therapeutic gene by environment (GxE) framework, we recently demonstrated that variation in the Serotonin Transporter Promoter Polymorphism; 5HTTLPR and marker rs6330 in Nerve Growth Factor gene; NGF is associated with poorer outcomes following cognitive behaviour therapy (CBT) for child anxiety disorders. The aim of this study was to explore one potential means of extending the translational reach of G×E data in a way that may be clinically informative. We describe a ‘risk-index’ approach combining genetic, demographic and clinical data and test its ability to predict diagnostic outcome following CBT in anxious children. Method. DNA and clinical data were collected from 384 children with a primary anxiety disorder undergoing CBT. We tested our risk model in five cross-validation training sets. Results. In predicting treatment outcome, six variables had a minimum mean beta value of 0.5: 5HTTLPR, NGF rs6330, gender, primary anxiety severity, comorbid mood disorder and comorbid externalising disorder. A risk index (range 0-8) constructed from these variables had moderate predictive ability (AUC = .62-.69) in this study. Children scoring high on this index (5-8) were approximately three times as likely to retain their primary anxiety disorder at follow-up as compared to those children scoring 2 or less. Conclusion. Significant genetic, demographic and clinical predictors of outcome following CBT for anxiety-disordered children were identified. Combining these predictors within a risk-index could be used to identify which children are less likely to be diagnosis free following CBT alone or thus require longer or enhanced treatment. The ‘risk-index’ approach represents one means of harnessing the translational potential of G×E data.
Resumo:
We present an efficient graph-based algorithm for quantifying the similarity of household-level energy use profiles, using a notion of similarity that allows for small time–shifts when comparing profiles. Experimental results on a real smart meter data set demonstrate that in cases of practical interest our technique is far faster than the existing method for computing the same similarity measure. Having a fast algorithm for measuring profile similarity improves the efficiency of tasks such as clustering of customers and cross-validation of forecasting methods using historical data. Furthermore, we apply a generalisation of our algorithm to produce substantially better household-level energy use forecasts from historical smart meter data.
Resumo:
We propose a new class of neurofuzzy construction algorithms with the aim of maximizing generalization capability specifically for imbalanced data classification problems based on leave-one-out (LOO) cross validation. The algorithms are in two stages, first an initial rule base is constructed based on estimating the Gaussian mixture model with analysis of variance decomposition from input data; the second stage carries out the joint weighted least squares parameter estimation and rule selection using orthogonal forward subspace selection (OFSS)procedure. We show how different LOO based rule selection criteria can be incorporated with OFSS, and advocate either maximizing the leave-one-out area under curve of the receiver operating characteristics, or maximizing the leave-one-out Fmeasure if the data sets exhibit imbalanced class distribution. Extensive comparative simulations illustrate the effectiveness of the proposed algorithms.