999 resultados para imbalanced data


Relevância:

60.00% 60.00%

Publicador:

Resumo:

La familia de algoritmos de Boosting son un tipo de técnicas de clasificación y regresión que han demostrado ser muy eficaces en problemas de Visión Computacional. Tal es el caso de los problemas de detección, de seguimiento o bien de reconocimiento de caras, personas, objetos deformables y acciones. El primer y más popular algoritmo de Boosting, AdaBoost, fue concebido para problemas binarios. Desde entonces, muchas han sido las propuestas que han aparecido con objeto de trasladarlo a otros dominios más generales: multiclase, multilabel, con costes, etc. Nuestro interés se centra en extender AdaBoost al terreno de la clasificación multiclase, considerándolo como un primer paso para posteriores ampliaciones. En la presente tesis proponemos dos algoritmos de Boosting para problemas multiclase basados en nuevas derivaciones del concepto margen. El primero de ellos, PIBoost, está concebido para abordar el problema descomponiéndolo en subproblemas binarios. Por un lado, usamos una codificación vectorial para representar etiquetas y, por otro, utilizamos la función de pérdida exponencial multiclase para evaluar las respuestas. Esta codificación produce un conjunto de valores margen que conllevan un rango de penalizaciones en caso de fallo y recompensas en caso de acierto. La optimización iterativa del modelo genera un proceso de Boosting asimétrico cuyos costes dependen del número de etiquetas separadas por cada clasificador débil. De este modo nuestro algoritmo de Boosting tiene en cuenta el desbalanceo debido a las clases a la hora de construir el clasificador. El resultado es un método bien fundamentado que extiende de manera canónica al AdaBoost original. El segundo algoritmo propuesto, BAdaCost, está concebido para problemas multiclase dotados de una matriz de costes. Motivados por los escasos trabajos dedicados a generalizar AdaBoost al terreno multiclase con costes, hemos propuesto un nuevo concepto de margen que, a su vez, permite derivar una función de pérdida adecuada para evaluar costes. Consideramos nuestro algoritmo como la extensión más canónica de AdaBoost para este tipo de problemas, ya que generaliza a los algoritmos SAMME, Cost-Sensitive AdaBoost y PIBoost. Por otro lado, sugerimos un simple procedimiento para calcular matrices de coste adecuadas para mejorar el rendimiento de Boosting a la hora de abordar problemas estándar y problemas con datos desbalanceados. Una serie de experimentos nos sirven para demostrar la efectividad de ambos métodos frente a otros conocidos algoritmos de Boosting multiclase en sus respectivas áreas. En dichos experimentos se usan bases de datos de referencia en el área de Machine Learning, en primer lugar para minimizar errores y en segundo lugar para minimizar costes. Además, hemos podido aplicar BAdaCost con éxito a un proceso de segmentación, un caso particular de problema con datos desbalanceados. Concluimos justificando el horizonte de futuro que encierra el marco de trabajo que presentamos, tanto por su aplicabilidad como por su flexibilidad teórica. Abstract The family of Boosting algorithms represents a type of classification and regression approach that has shown to be very effective in Computer Vision problems. Such is the case of detection, tracking and recognition of faces, people, deformable objects and actions. The first and most popular algorithm, AdaBoost, was introduced in the context of binary classification. Since then, many works have been proposed to extend it to the more general multi-class, multi-label, costsensitive, etc... domains. Our interest is centered in extending AdaBoost to two problems in the multi-class field, considering it a first step for upcoming generalizations. In this dissertation we propose two Boosting algorithms for multi-class classification based on new generalizations of the concept of margin. The first of them, PIBoost, is conceived to tackle the multi-class problem by solving many binary sub-problems. We use a vectorial codification to represent class labels and a multi-class exponential loss function to evaluate classifier responses. This representation produces a set of margin values that provide a range of penalties for failures and rewards for successes. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on the number of classes separated by each weak-learner. In this way the Boosting procedure takes into account class imbalances when building the ensemble. The resulting algorithm is a well grounded method that canonically extends the original AdaBoost. The second algorithm proposed, BAdaCost, is conceived for multi-class problems endowed with a cost matrix. Motivated by the few cost-sensitive extensions of AdaBoost to the multi-class field, we propose a new margin that, in turn, yields a new loss function appropriate for evaluating costs. Since BAdaCost generalizes SAMME, Cost-Sensitive AdaBoost and PIBoost algorithms, we consider our algorithm as a canonical extension of AdaBoost to this kind of problems. We additionally suggest a simple procedure to compute cost matrices that improve the performance of Boosting in standard and unbalanced problems. A set of experiments is carried out to demonstrate the effectiveness of both methods against other relevant Boosting algorithms in their respective areas. In the experiments we resort to benchmark data sets used in the Machine Learning community, firstly for minimizing classification errors and secondly for minimizing costs. In addition, we successfully applied BAdaCost to a segmentation task, a particular problem in presence of imbalanced data. We conclude the thesis justifying the horizon of future improvements encompassed in our framework, due to its applicability and theoretical flexibility.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The severe class distribution shews the presence of underrepresented data, which has great effects on the performance of learning algorithm, is still a challenge of data mining and machine learning. Lots of researches currently focus on experimental comparison of the existing re-sampling approaches. We believe it requires new ways of constructing better algorithms to further balance and analyse the data set. This paper presents a Fuzzy-based Information Decomposition oversampling (FIDoS) algorithm used for handling the imbalanced data. Generally speaking, this is a new way of addressing imbalanced learning problems from missing data perspective. First, we assume that there are missing instances in the minority class that result in the imbalanced dataset. Then the proposed algorithm which takes advantages of fuzzy membership function is used to transfer information to the missing minority class instances. Finally, the experimental results demonstrate that the proposed algorithm is more practical and applicable compared to sampling techniques.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Being an important source for real-time information dissemination in recent years, Twitter is inevitably a prime target of spammers. It has been showed that the damage caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the threat, a lot of recent studies use machine learning techniques to classify Twitter spam and report very satisfactory results. However, most of the studies overlook a fundamental issue that is widely seen in real-world Twitter data, i.e., the class imbalance problem. In this paper, we show that the unequal distribution between spam and non-spam classes in the data has a great impact on spam detection rate. To address the problem, we propose an ensemble learning approach, which involves three steps. In the first step, we adjust the class distribution in the imbalanced data set using various strategies, including random oversampling, random undersampling and fuzzy-based oversampling. In the next step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine all the classification models. Experimental results obtained using real-world Twitter data indicate that the proposed approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The nonlinear filtering of a 10Gb/s data stream in a dispersion-imbalanced fibre loop mirror has been demonstrated over a wide spectral range of 28nm. A relative extinction ratio of - 30 dB for the cw background has been achieved across the whole spectral range.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The assimilation of measurements from the stratosphere and mesosphere is becoming increasingly common as the lids of weather prediction and climate models rise into the mesosphere and thermosphere. However, the dynamics of the middle atmosphere pose specific challenges to the assimilation of measurements from this region. Forecast-error variances can be very large in the mesosphere and this can render assimilation schemes very sensitive to the details of the specification of forecast error correlations. An example is shown where observations in the stratosphere are able to produce increments in the mesosphere. Such sensitivity of the assimilation scheme to misspecification of covariances can also amplify any existing biases in measurements or forecasts. Since both models and measurements of the middle atmosphere are known to have biases, the separation of these sources of bias remains a issue. Finally, well-known deficiencies of assimilation schemes, such as the production of imbalanced states or the assumption of zero bias, are proposed explanations for the inaccurate transport resulting from assimilated winds. The inability of assimilated winds to accurately transport constituents in the middle atmosphere remains a fundamental issue limiting the use of assimilated products for applications involving longer time-scales.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

There has been a significant increase in the skill and resolution of numerical weather prediction models (NWPs) in recent decades, extending the time scales of useful weather predictions. The land-surface models (LSMs) of NWPs are often employed in hydrological applications, which raises the question of how hydrologically representative LSMs really are. In this paper, precipitation (P), evaporation (E) and runoff (R) from the European Centre for Medium-Range Weather Forecasts (ECMWF) global models were evaluated against observational products. The forecasts differ substantially from observed data for key hydrological variables. In addition, imbalanced surface water budgets, mostly caused by data assimilation, were found on both global (P-E) and basin scales (P-E-R), with the latter being more important. Modeled surface fluxes should be used with care in hydrological applications and further improvement in LSMs in terms of process descriptions, resolution and estimation of uncertainties is needed to accurately describe the land-surface water budgets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background
Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Results
One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.

Conclusion
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The pathogenic mechanisms of thromboangiitis obliterans (TAO) are not entirely known and the imbalance of matrix metalloproteinases (MMPs) plays a role in vascular diseases. We evaluated the MMP-2 and MMP-9 circulating levels and their endogenous tissue inhibitors of metalloproteinases (TIMP-1 and TIMP-2) in TAO patients with clinical manifestations. The study included 20 TAO patients (n = 10 female, n = 10 male) aged 38-59 years under clinical follow-up. The patients were classified into two groups: (1) TAO former smokers (n = 11) and (2) TAO active smokers (n = 9); the control group included normal volunteer non-smokers (n = 10) and active smokers without peripheral artery disease (n = 10). Patient plasma samples were used to analyze MMP-2 and MMP-9 levels using zymography, and TIMP-1 and TIMP-2 concentrations were determined by enzyme-linked immunosorbent assays. The analysis of MMP-2/TIMP-2 and MMP-9/TIMP-1 ratios (which were used as indices of net MMP-2 and MMP-9 activity, respectively) showed significantly higher MMP-9/TIMP-1 ratios in TAO patients (p < 0.05). We found no significant differences in MMP-2/TIMP-2 ratios (p > 0.05). We found higher MMP-9 levels and decreased levels of TIMP-1 in the TAO groups (active smokers and former smokers), especially in active smokers compared with the other groups (all p < 0.05). MMP-2 and TIMP-2 were not significantly different in patients with TAO as compared to the control group (p > 0.05). In conclusion, our results showed increased MMP-9 and reduced TIMP-1 activity in TAO patients, especially in active smokers compared with non-TAO patients. These data suggest that smoke compounds could activate MMP-9 production or inhibit TIMP-1 activity.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

By analyzing a comprehensive dataset on transport transactions in Japan, we describe a directional imbalance in freight rates by transport mode and examine its potential sources, such as economies of density and directionally imbalanced transport flow. There are certain numbers of observed links which show asymmetric transport costs. Instrumental variable analysis is used to show that economies of density account for deviation from symmetric freight rates between prefectures. Our results show that a 10% increase in outbound transport flow relative to inbound transport flow leads to a 2.1% decrease in outbound freight rate relative to inbound freight rate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Thanks to the advanced technologies and social networks that allow the data to be widely shared among the Internet, there is an explosion of pervasive multimedia data, generating high demands of multimedia services and applications in various areas for people to easily access and manage multimedia data. Towards such demands, multimedia big data analysis has become an emerging hot topic in both industry and academia, which ranges from basic infrastructure, management, search, and mining to security, privacy, and applications. Within the scope of this dissertation, a multimedia big data analysis framework is proposed for semantic information management and retrieval with a focus on rare event detection in videos. The proposed framework is able to explore hidden semantic feature groups in multimedia data and incorporate temporal semantics, especially for video event detection. First, a hierarchical semantic data representation is presented to alleviate the semantic gap issue, and the Hidden Coherent Feature Group (HCFG) analysis method is proposed to capture the correlation between features and separate the original feature set into semantic groups, seamlessly integrating multimedia data in multiple modalities. Next, an Importance Factor based Temporal Multiple Correspondence Analysis (i.e., IF-TMCA) approach is presented for effective event detection. Specifically, the HCFG algorithm is integrated with the Hierarchical Information Gain Analysis (HIGA) method to generate the Importance Factor (IF) for producing the initial detection results. Then, the TMCA algorithm is proposed to efficiently incorporate temporal semantics for re-ranking and improving the final performance. At last, a sampling-based ensemble learning mechanism is applied to further accommodate the imbalanced datasets. In addition to the multimedia semantic representation and class imbalance problems, lack of organization is another critical issue for multimedia big data analysis. In this framework, an affinity propagation-based summarization method is also proposed to transform the unorganized data into a better structure with clean and well-organized information. The whole framework has been thoroughly evaluated across multiple domains, such as soccer goal event detection and disaster information management.

Relevância:

20.00% 20.00%

Publicador: