990 resultados para imbalanced data
Resumo:
The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.
Resumo:
Many kernel classifier construction algorithms adopt classification accuracy as performance metrics in model evaluation. Moreover, equal weighting is often applied to each data sample in parameter estimation. These modeling practices often become problematic if the data sets are imbalanced. We present a kernel classifier construction algorithm using orthogonal forward selection (OFS) in order to optimize the model generalization for imbalanced two-class data sets. This kernel classifier identification algorithm is based on a new regularized orthogonal weighted least squares (ROWLS) estimator and the model selection criterion of maximal leave-one-out area under curve (LOO-AUC) of the receiver operating characteristics (ROCs). It is shown that, owing to the orthogonalization procedure, the LOO-AUC can be calculated via an analytic formula based on the new regularized orthogonal weighted least squares parameter estimator, without actually splitting the estimation data set. The proposed algorithm can achieve minimal computational expense via a set of forward recursive updating formula in searching model terms with maximal incremental LOO-AUC value. Numerical examples are used to demonstrate the efficacy of the algorithm.
Resumo:
We propose a new class of neurofuzzy construction algorithms with the aim of maximizing generalization capability specifically for imbalanced data classification problems based on leave-one-out (LOO) cross validation. The algorithms are in two stages, first an initial rule base is constructed based on estimating the Gaussian mixture model with analysis of variance decomposition from input data; the second stage carries out the joint weighted least squares parameter estimation and rule selection using orthogonal forward subspace selection (OFSS)procedure. We show how different LOO based rule selection criteria can be incorporated with OFSS, and advocate either maximizing the leave-one-out area under curve of the receiver operating characteristics, or maximizing the leave-one-out Fmeasure if the data sets exhibit imbalanced class distribution. Extensive comparative simulations illustrate the effectiveness of the proposed algorithms.
Resumo:
The combination of the synthetic minority oversampling technique (SMOTE) and the radial basis function (RBF) classifier is proposed to deal with classification for imbalanced two-class data. In order to enhance the significance of the small and specific region belonging to the positive class in the decision region, the SMOTE is applied to generate synthetic instances for the positive class to balance the training data set. Based on the over-sampled training data, the RBF classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier structure and the parameters of RBF kernels are determined using a particle swarm optimization algorithm based on the criterion of minimizing the leave-one-out misclassification rate. The experimental results on both simulated and real imbalanced data sets are presented to demonstrate the effectiveness of our proposed algorithm.
Resumo:
This contribution proposes a powerful technique for two-class imbalanced classification problems by combining the synthetic minority over-sampling technique (SMOTE) and the particle swarm optimisation (PSO) aided radial basis function (RBF) classifier. In order to enhance the significance of the small and specific region belonging to the positive class in the decision region, the SMOTE is applied to generate synthetic instances for the positive class to balance the training data set. Based on the over-sampled training data, the RBF classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier's structure and the parameters of RBF kernels are determined using a PSO algorithm based on the criterion of minimising the leave-one-out misclassification rate. The experimental results obtained on a simulated imbalanced data set and three real imbalanced data sets are presented to demonstrate the effectiveness of our proposed algorithm.
Resumo:
This contribution proposes a novel probability density function (PDF) estimation based over-sampling (PDFOS) approach for two-class imbalanced classification problems. The classical Parzen-window kernel function is adopted to estimate the PDF of the positive class. Then according to the estimated PDF, synthetic instances are generated as the additional training data. The essential concept is to re-balance the class distribution of the original imbalanced data set under the principle that synthetic data sample follows the same statistical properties. Based on the over-sampled training data, the radial basis function (RBF) classifier is constructed by applying the orthogonal forward selection procedure, in which the classifier’s structure and the parameters of RBF kernels are determined using a particle swarm optimisation algorithm based on the criterion of minimising the leave-one-out misclassification rate. The effectiveness of the proposed PDFOS approach is demonstrated by the empirical study on several imbalanced data sets.
Resumo:
There is an increasing interest in the application of Evolutionary Algorithms (EAs) to induce classification rules. This hybrid approach can benefit areas where classical methods for rule induction have not been very successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when one or more classes heavily outnumber other classes. Frequently, classical machine learning (ML) classifiers are not able to learn in the presence of imbalanced data sets, inducing classification models that always predict the most numerous classes. In this work, we propose a novel hybrid approach to deal with this problem. We create several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are fed to classical ML systems that produce rule sets. The rule sets are combined creating a pool of rules and an EA is used to build a classifier from this pool of rules. This hybrid approach has some advantages over undersampling, since it reduces the amount of discarded information, and some advantages over oversampling, since it avoids overfitting. The proposed approach was experimentally analysed and the experimental results show an improvement in the classification performance measured as the area under the receiver operating characteristics (ROC) curve.
Resumo:
Being able to accurately predict the risk of falling is crucial in patients with Parkinson’s dis- ease (PD). This is due to the unfavorable effect of falls, which can lower the quality of life as well as directly impact on survival. Three methods considered for predicting falls are decision trees (DT), Bayesian networks (BN), and support vector machines (SVM). Data on a 1-year prospective study conducted at IHBI, Australia, for 51 people with PD are used. Data processing are conducted using rpart and e1071 packages in R for DT and SVM, con- secutively; and Bayes Server 5.5 for the BN. The results show that BN and SVM produce consistently higher accuracy over the 12 months evaluation time points (average sensitivity and specificity > 92%) than DT (average sensitivity 88%, average specificity 72%). DT is prone to imbalanced data so needs to adjust for the misclassification cost. However, DT provides a straightforward, interpretable result and thus is appealing for helping to identify important items related to falls and to generate fallers’ profiles.
Resumo:
Nos últimos anos, o número de vítimas de acidentes de tráfego por milhões de habitantes em Portugal tem sido mais elevado do que a média da União Europeia. Ao nível nacional torna-se premente uma melhor compreensão dos dados de acidentes e sobre o efeito do veículo na gravidade do mesmo. O objetivo principal desta investigação consistiu no desenvolvimento de modelos de previsão da gravidade do acidente, para o caso de um único veículo envolvido e para caso de uma colisão, envolvendo dois veículos. Além disso, esta investigação compreendeu o desenvolvimento de uma análise integrada para avaliar o desempenho do veículo em termos de segurança, eficiência energética e emissões de poluentes. Os dados de acidentes foram recolhidos junto da Guarda Nacional Republicana Portuguesa, na área metropolitana do Porto para o período de 2006-2010. Um total de 1,374 acidentes foram recolhidos, 500 acidentes envolvendo um único veículo e 874 colisões. Para a análise da segurança, foram utilizados modelos de regressão logística. Para os acidentes envolvendo um único veículo, o efeito das características do veículo no risco de feridos graves e/ou mortos (variável resposta definida como binária) foi explorado. Para as colisões envolvendo dois veículos foram criadas duas variáveis binárias adicionais: uma para prever a probabilidade de feridos graves e/ou mortos num dos veículos (designado como veículo V1) e outra para prever a probabilidade de feridos graves e/ou mortos no outro veículo envolvido (designado como veículo V2). Para ultrapassar o desafio e limitações relativas ao tamanho da amostra e desigualdade entre os casos analisados (apenas 5.1% de acidentes graves), foi desenvolvida uma metodologia com base numa estratégia de reamostragem e foram utilizadas 10 amostras geradas de forma aleatória e estratificada para a validação dos modelos. Durante a fase de modelação, foi analisado o efeito das características do veículo, como o peso, a cilindrada, a distância entre eixos e a idade do veículo. Para a análise do consumo de combustível e das emissões, foi aplicada a metodologia CORINAIR. Posteriormente, os dados das emissões foram modelados de forma a serem ajustados a regressões lineares. Finalmente, foi desenvolvido um indicador de análise integrada (denominado “SEG”) que proporciona um método de classificação para avaliar o desempenho do veículo ao nível da segurança rodoviária, consumos e emissões de poluentes.Face aos resultados obtidos, para os acidentes envolvendo um único veículo, o modelo de previsão do risco de gravidade identificou a idade e a cilindrada do veículo como estatisticamente significativas para a previsão de ocorrência de feridos graves e/ou mortos, ao nível de significância de 5%. A exatidão do modelo foi de 58.0% (desvio padrão (D.P.) 3.1). Para as colisões envolvendo dois veículos, ao prever a probabilidade de feridos graves e/ou mortos no veículo V1, a cilindrada do veículo oposto (veículo V2) aumentou o risco para os ocupantes do veículo V1, ao nível de significância de 10%. O modelo para prever o risco de gravidade no veículo V1 revelou um bom desempenho, com uma exatidão de 61.2% (D.P. 2.4). Ao prever a probabilidade de feridos graves e/ou mortos no veículo V2, a cilindrada do veículo V1 aumentou o risco para os ocupantes do veículo V2, ao nível de significância de 5%. O modelo para prever o risco de gravidade no veículo V2 também revelou um desempenho satisfatório, com uma exatidão de 40.5% (D.P. 2.1). Os resultados do indicador integrado SEG revelaram que os veículos mais recentes apresentam uma melhor classificação para os três domínios: segurança, consumo e emissões. Esta investigação demonstra que não existe conflito entre a componente da segurança, a eficiência energética e emissões relativamente ao desempenho dos veículos.
Resumo:
Biplot graphics are widely employed in the study of the genotypeenvironment interactions, but they are only a graphical tool without a statistical hypothesis test. The singular values and scores (singular vectors) used in biplots correspond to specific estimates of its parameters, and the use of uncertainty measures may lead to different conclusions from those provided by a simple visual evaluation. The aim of this work is to estimate the genotype-environment interactions, using AMMI analysis, through Bayesian approach. Therefore the credibility intervals can be used for decision-making in different situations of analyses. It allows to verify the consistency of the selection and recommendation of cultivars. Two analyses were performed. The first analysis looked into 10 regular commercial hybrids and all possible 45 hybrids obtained from them. They were assessed in 15 locations. The second analysis evaluated 28 hybrids in 35 different environments, with imbalance data. The ellipses were grouped according to the standard of interaction in the biplot. The AMMI analysis with a Bayesian approach proved to be a complete analysis of stability and adaptability, which provides important information that may help the breeder in their decisions. The regions of credibility, built in the biplots, allow to perform an accurate selection and a precise genotype recommendation, with a level of credibility. Genotypes and environments can be grouped according to the existing interaction pattern, which makes possible to formulate specific recommendations. Moreover the environments can be evaluated, in order to find out which ones contribute similarly to the interaction and those to be discarted. The method makes possible to deal with imbalanced data in a natural way, showing efficiency for multienvironment trials. The prediction takes into account instability and the interaction standard of the observed data, in order to establish a direct comparison between genotypes of both 1st and 2nd seasons.
Resumo:
In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^
Resumo:
La familia de algoritmos de Boosting son un tipo de técnicas de clasificación y regresión que han demostrado ser muy eficaces en problemas de Visión Computacional. Tal es el caso de los problemas de detección, de seguimiento o bien de reconocimiento de caras, personas, objetos deformables y acciones. El primer y más popular algoritmo de Boosting, AdaBoost, fue concebido para problemas binarios. Desde entonces, muchas han sido las propuestas que han aparecido con objeto de trasladarlo a otros dominios más generales: multiclase, multilabel, con costes, etc. Nuestro interés se centra en extender AdaBoost al terreno de la clasificación multiclase, considerándolo como un primer paso para posteriores ampliaciones. En la presente tesis proponemos dos algoritmos de Boosting para problemas multiclase basados en nuevas derivaciones del concepto margen. El primero de ellos, PIBoost, está concebido para abordar el problema descomponiéndolo en subproblemas binarios. Por un lado, usamos una codificación vectorial para representar etiquetas y, por otro, utilizamos la función de pérdida exponencial multiclase para evaluar las respuestas. Esta codificación produce un conjunto de valores margen que conllevan un rango de penalizaciones en caso de fallo y recompensas en caso de acierto. La optimización iterativa del modelo genera un proceso de Boosting asimétrico cuyos costes dependen del número de etiquetas separadas por cada clasificador débil. De este modo nuestro algoritmo de Boosting tiene en cuenta el desbalanceo debido a las clases a la hora de construir el clasificador. El resultado es un método bien fundamentado que extiende de manera canónica al AdaBoost original. El segundo algoritmo propuesto, BAdaCost, está concebido para problemas multiclase dotados de una matriz de costes. Motivados por los escasos trabajos dedicados a generalizar AdaBoost al terreno multiclase con costes, hemos propuesto un nuevo concepto de margen que, a su vez, permite derivar una función de pérdida adecuada para evaluar costes. Consideramos nuestro algoritmo como la extensión más canónica de AdaBoost para este tipo de problemas, ya que generaliza a los algoritmos SAMME, Cost-Sensitive AdaBoost y PIBoost. Por otro lado, sugerimos un simple procedimiento para calcular matrices de coste adecuadas para mejorar el rendimiento de Boosting a la hora de abordar problemas estándar y problemas con datos desbalanceados. Una serie de experimentos nos sirven para demostrar la efectividad de ambos métodos frente a otros conocidos algoritmos de Boosting multiclase en sus respectivas áreas. En dichos experimentos se usan bases de datos de referencia en el área de Machine Learning, en primer lugar para minimizar errores y en segundo lugar para minimizar costes. Además, hemos podido aplicar BAdaCost con éxito a un proceso de segmentación, un caso particular de problema con datos desbalanceados. Concluimos justificando el horizonte de futuro que encierra el marco de trabajo que presentamos, tanto por su aplicabilidad como por su flexibilidad teórica. Abstract The family of Boosting algorithms represents a type of classification and regression approach that has shown to be very effective in Computer Vision problems. Such is the case of detection, tracking and recognition of faces, people, deformable objects and actions. The first and most popular algorithm, AdaBoost, was introduced in the context of binary classification. Since then, many works have been proposed to extend it to the more general multi-class, multi-label, costsensitive, etc... domains. Our interest is centered in extending AdaBoost to two problems in the multi-class field, considering it a first step for upcoming generalizations. In this dissertation we propose two Boosting algorithms for multi-class classification based on new generalizations of the concept of margin. The first of them, PIBoost, is conceived to tackle the multi-class problem by solving many binary sub-problems. We use a vectorial codification to represent class labels and a multi-class exponential loss function to evaluate classifier responses. This representation produces a set of margin values that provide a range of penalties for failures and rewards for successes. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on the number of classes separated by each weak-learner. In this way the Boosting procedure takes into account class imbalances when building the ensemble. The resulting algorithm is a well grounded method that canonically extends the original AdaBoost. The second algorithm proposed, BAdaCost, is conceived for multi-class problems endowed with a cost matrix. Motivated by the few cost-sensitive extensions of AdaBoost to the multi-class field, we propose a new margin that, in turn, yields a new loss function appropriate for evaluating costs. Since BAdaCost generalizes SAMME, Cost-Sensitive AdaBoost and PIBoost algorithms, we consider our algorithm as a canonical extension of AdaBoost to this kind of problems. We additionally suggest a simple procedure to compute cost matrices that improve the performance of Boosting in standard and unbalanced problems. A set of experiments is carried out to demonstrate the effectiveness of both methods against other relevant Boosting algorithms in their respective areas. In the experiments we resort to benchmark data sets used in the Machine Learning community, firstly for minimizing classification errors and secondly for minimizing costs. In addition, we successfully applied BAdaCost to a segmentation task, a particular problem in presence of imbalanced data. We conclude the thesis justifying the horizon of future improvements encompassed in our framework, due to its applicability and theoretical flexibility.
Resumo:
The nonlinear filtering of a 10Gb/s data stream in a dispersion-imbalanced fibre loop mirror has been demonstrated over a wide spectral range of 28nm. A relative extinction ratio of - 30 dB for the cw background has been achieved across the whole spectral range.
Resumo:
The assimilation of measurements from the stratosphere and mesosphere is becoming increasingly common as the lids of weather prediction and climate models rise into the mesosphere and thermosphere. However, the dynamics of the middle atmosphere pose specific challenges to the assimilation of measurements from this region. Forecast-error variances can be very large in the mesosphere and this can render assimilation schemes very sensitive to the details of the specification of forecast error correlations. An example is shown where observations in the stratosphere are able to produce increments in the mesosphere. Such sensitivity of the assimilation scheme to misspecification of covariances can also amplify any existing biases in measurements or forecasts. Since both models and measurements of the middle atmosphere are known to have biases, the separation of these sources of bias remains a issue. Finally, well-known deficiencies of assimilation schemes, such as the production of imbalanced states or the assumption of zero bias, are proposed explanations for the inaccurate transport resulting from assimilated winds. The inability of assimilated winds to accurately transport constituents in the middle atmosphere remains a fundamental issue limiting the use of assimilated products for applications involving longer time-scales.
Resumo:
There has been a significant increase in the skill and resolution of numerical weather prediction models (NWPs) in recent decades, extending the time scales of useful weather predictions. The land-surface models (LSMs) of NWPs are often employed in hydrological applications, which raises the question of how hydrologically representative LSMs really are. In this paper, precipitation (P), evaporation (E) and runoff (R) from the European Centre for Medium-Range Weather Forecasts (ECMWF) global models were evaluated against observational products. The forecasts differ substantially from observed data for key hydrological variables. In addition, imbalanced surface water budgets, mostly caused by data assimilation, were found on both global (P-E) and basin scales (P-E-R), with the latter being more important. Modeled surface fluxes should be used with care in hydrological applications and further improvement in LSMs in terms of process descriptions, resolution and estimation of uncertainties is needed to accurately describe the land-surface water budgets.