Biblioteca Digital

145 resultados para Machine learning methods

Enhanced Twitter sentiment analysis by using feature selection and combination

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Tweet sentiment analysis is an important research topic. An accurate and timely analysis report could give good indications on the general public's opinions. After reviewing the current research, we identify the need of effective and efficient methods to conduct tweet sentiment analysis. This paper aims to achieve a high level of performance for classifying tweets with sentiment information. We propose a feasible solution which improves the level of accuracy with good time efficiency. Specifically, we develop a novel feature combination scheme which utilizes the sentiment lexicons and the extracted tweet unigrams of high information gain. We evaluate the performance of six popular machine learning classifiers among which the Naive Bayes Multinomial (NBM) classifier achieves the accuracy rate of 84.60% and takes a few minutes to complete classifying thousands of tweets.

Veja mais

Robust traffic classification with mislabelled training samples

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Traffic classification plays the significant role in the network security and management. However, accurate classification is challenging if the training data is contaminated with unclean traffic. Recent researches often assume clean training data, and hence performance reduced on real-time network traffic. To meet this challenge, in this paper, we propose a robust method, Unclean Traffic Classification (UTC), which incorporates noise elimination and suspected noise reweighting. Firstly, UTC eliminates strong noisy training data identified by a consensus filtering with multiple classifiers. Furthermore, UTC estimates the relevance of remaining training data and learns a robust traffic classifier. Through a number of experiments on a real-world traffic dataset, we show that the new method outperforms existing state-of-the-art traffic classification methods, under the extremely difficult circumstance with unclean training data.

Veja mais

Multi-layer attribute selection and classification algorithm for the diagnosis of cardiac autonomic neuropathy based on HRV attributes

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Cardiac autonomic neuropathy (CAN) poses an important clinical problem, which often remains undetected due difficulty of conducting the current tests and their lack of sensitivity. CAN has been associated with growth in the risk of unexpected death in cardiac patients with diabetes mellitus. Heart rate variability (HRV) attributes have been actively investigated, since they are important for diagnostics in diabetes, Parkinson's disease, cardiac and renal disease. Due to the adverse effects of CAN it is important to obtain a robust and highly accurate diagnostic tool for identification of early CAN, when treatment has the best outcome. Use of HRV attributes to enhance the effectiveness of diagnosis of CAN progression may provide such a tool. In the present paper we propose a new machine learning algorithm, the Multi-Layer Attribute Selection and Classification (MLASC), for the diagnosis of CAN progression based on HRV attributes. It incorporates our new automated attribute selection procedure, Double Wrapper Subset Evaluator with Particle Swarm Optimization (DWSE-PSO). We present the results of experiments, which compare MLASC with other simpler versions and counterpart methods. The experiments used our large and well-known diabetes complications database. The results of experiments demonstrate that MLASC has significantly outperformed other simpler techniques.

Veja mais

Traffic identification in big internet data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-basedmethods, the state-of-the-art method employs flow statistical features and machine learning techniques to identify network traffic. This chapter reviews the statistical-feature based traffic classification methods, that have been proposed in the last decade. We also examine a new problem: unclean traffic in the training stage of machine learning due to the labeling mistake and complex composition of big Internet data. This chapter further evaluates the performance of typical machine learning algorithms with unclean training data. The review and the empirical study can provide a guide for academia and practitioners in choosing proper traffic classification methods in real-world scenarios.

Veja mais

Fitting aggregation functions to data: part I-linearization and regularization

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.

Veja mais

An ensemble learning approach for addressing the class imbalance problem in twitter spam detection

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Being an important source for real-time information dissemination in recent years, Twitter is inevitably a prime target of spammers. It has been showed that the damage caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the threat, a lot of recent studies use machine learning techniques to classify Twitter spam and report very satisfactory results. However, most of the studies overlook a fundamental issue that is widely seen in real-world Twitter data, i.e., the class imbalance problem. In this paper, we show that the unequal distribution between spam and non-spam classes in the data has a great impact on spam detection rate. To address the problem, we propose an ensemble learning approach, which involves three steps. In the first step, we adjust the class distribution in the imbalanced data set using various strategies, including random oversampling, random undersampling and fuzzy-based oversampling. In the next step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine all the classification models. Experimental results obtained using real-world Twitter data indicate that the proposed approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

Veja mais

Statistical detection of online drifting twitter spam

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Spam has become a critical problem in online social networks. This paper focuses on Twitter spam detection. Recent research works focus on applying machine learning techniques for Twitter spam detection, which make use of the statistical features of tweets. We observe existing machine learning based detection methods suffer from the problem of Twitter spam drift, i.e., the statistical properties of spam tweets vary over time. To avoid this problem, an effective solution is to train one twitter spam classifier every day. However, it faces a challenge of the small number of imbalanced training data because labelling spam samples is time-consuming. This paper proposes a new method to address this challenge. The new method employs two new techniques, fuzzy-based redistribution and asymmetric sampling. We develop a fuzzy-based information decomposition technique to re-distribute the spam class and generate more spam samples. Moreover, an asymmetric sampling technique is proposed to re-balance the sizes of spam samples and non-spam samples in the training data. Finally, we apply the ensemble technique to combine the spam classifiers over two different training sets. A number of experiments are performed on a real-world 10-day ground-truth dataset to evaluate the new method. Experiments results show that the new method can significantly improve the detection performance for drifting Twitter spam.

Veja mais

Unsupervised parameter estimation for one-class support vector machines

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Although the hyper-plane based One-Class Support Vector Machine (OCSVM) and the hyper-spherical based Support Vector Data Description (SVDD) algorithms have been shown to be very effective in detecting outliers, their performance on noisy and unlabeled training data has not been widely studied. Moreover, only a few heuristic approaches have been proposed to set the different parameters of these methods in an unsupervised manner. In this paper, we propose two unsupervised methods for estimating the optimal parameter settings to train OCSVM and SVDD models, based on analysing the structure of the data. We show that our heuristic is substantially faster than existing parameter estimation approaches while its accuracy is comparable with supervised parameter learning methods, such as grid-search with crossvalidation on labeled data. In addition, our proposed approaches can be used to prepare a labeled data set for a OCSVM or a SVDD from unlabeled data.

Veja mais

Boosting imbalanced data learning with Wiener process oversampling

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions.

Veja mais

Adopting an active learning approach to teaching in a research-intensive higher education context transformed staff teaching attitudes and behaviours

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The conventional lecture has significant limitations in the higher education context, often leading to a passive learning experience for students. This paper reports a process of transforming teaching and learning with active learning strategies in a research-intensive educational context across a faculty of 45 academic staff and more than 1000 students. A phased approach was used, involving nine staff in a pilot phase during which a common vision and principles were developed. In short, our approach was to mandate a move away from didactic lectures to classes that involved students interacting with content, with each other and with instructors in order to attain domain-specific learning outcomes and generic skills. After refinement, an implementation phase commenced within all first-year subjects, involving 12 staff including three from the pilot group. The staff use of active learning methods in classes increased by sixfold and sevenfold in the pilot and implementation phases, respectively. An analysis of implementation phase exam questions indicated that staff increased their use of questions addressing higher order cognitive skills by 51%. Results of a staff survey indicated that this change in practice was caused by the involvement of staff in the active learning approach. Fifty-six percent of staff respondents indicated that they had maintained constructive alignment as they introduced active learning. After the pilot, only three out of nine staff agreed that they understood what makes for an effective active learning exercise. This rose to seven out of nine staff at the completion of the implementation phase. The development of a common approach with explicit vision and principles and the evaluation and refinement of active learning were effective elements of our transformational change management strategy. Future efforts will focus on ensuring that all staff have the time, skills and pedagogical understanding required to embed constructively aligned active learning within the approach.

Veja mais

145 resultados para Machine learning methods

Filtro por publicador