999 resultados para Concept Drift


Relevância:

100.00% 100.00%

Publicador:

Resumo:

An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we present a multiple window incremental learning algorithm that distinguishes between virtual concept drift and real concept drift. The algorithm is unsupervised and uses a novel approach to tracking concept drift that involves the use of competing windows to interpret the data. Unlike previous methods which use a single window to determine the drift in the data, our algorithm uses three windows of different sizes to estimate the change in the data. The advantage of this approach is that it allows the system to progressively adapt and predict the change thus enabling it to deal more effectively with different types of drift. We give a detailed description of the algorithm and present the results obtained from its application to two real world problems: background image processing and sound recognition. We also compare its performance with FLORA, an existing concept drift tracking algorithm.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we describe a supervised learning algorithm that uses selective memory to track concept drift. Unlike previous methods to track concept drift that use window heuristics to adapt to changes, we present an improved approach that discriminates between the instances observed. The advantage of this method is that it allows the system to both adapt to and track drift more accurately as well as filter the noise in the data more effectively. We present the algorithm and compare its performance with FLORA a well known concept drift tracking algorithm.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Concept drift is a problem of increasing importance in machine learning and data mining. Data sets under analysis are no longer only static databases, but also data streams in which concepts and data distributions may not be stable over time. However, most learning algorithms produced so far are based on the assumption that data comes from a fixed distribution, so they are not suitable to handle concept drifts. Moreover, some concept drifts applications requires fast response, which means an algorithm must always be (re) trained with the latest available data. But the process of labeling data is usually expensive and/or time consuming when compared to unlabeled data acquisition, thus only a small fraction of the incoming data may be effectively labeled. Semi-supervised learning methods may help in this scenario, as they use both labeled and unlabeled data in the training process. However, most of them are also based on the assumption that the data is static. Therefore, semi-supervised learning with concept drifts is still an open challenge in machine learning. Recently, a particle competition and cooperation approach was used to realize graph-based semi-supervised learning from static data. In this paper, we extend that approach to handle data streams and concept drift. The result is a passive algorithm using a single classifier, which naturally adapts to concept changes, without any explicit drift detection mechanism. Its built-in mechanisms provide a natural way of learning from new data, gradually forgetting older knowledge as older labeled data items became less influent on the classification of newer data items. Some computer simulation are presented, showing the effectiveness of the proposed method.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Concept drift, which refers to non stationary learning problems over time, has increasing importance in machine learning and data mining. Many concept drift applications require fast response, which means an algorithm must always be (re)trained with the latest available data. But the process of data labeling is usually expensive and/or time consuming when compared to acquisition of unlabeled data, thus usually only a small fraction of the incoming data may be effectively labeled. Semi-supervised learning methods may help in this scenario, as they use both labeled and unlabeled data in the training process. However, most of them are based on assumptions that the data is static. Therefore, semi-supervised learning with concept drifts is still an open challenging task in machine learning. Recently, a particle competition and cooperation approach has been developed to realize graph-based semi-supervised learning from static data. We have extend that approach to handle data streams and concept drift. The result is a passive algorithm which uses a single classifier approach, naturally adapted to concept changes without any explicit drift detection mechanism. It has built-in mechanisms that provide a natural way of learning from new data, gradually "forgetting" older knowledge as older data items are no longer useful for the classification of newer data items. The proposed algorithm is applied to the KDD Cup 1999 Data of network intrusion, showing its effectiveness.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Algorithms for concept drift handling are important for various applications including video analysis and smart grids. In this paper we present decision tree ensemble classication method based on the Random Forest algorithm for concept drift. The weighted majority voting ensemble aggregation rule is employed based on the ideas of Accuracy Weighted Ensemble (AWE) method. Base learner weight in our case is computed for each sample evaluation using base learners accuracy and intrinsic proximity measure of Random Forest. Our algorithm exploits both temporal weighting of samples and ensemble pruning as a forgetting strategy. We present results of empirical comparison of our method with îriginal random forest with incorporated replace-the-looser forgetting andother state-of-the-art concept-drift classiers like AWE2.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Business processes are prone to continuous and unexpected changes. Process workers may start executing a process differently in order to adjust to changes in workload, season, guidelines or regulations for example. Early detection of business process changes based on their event logs – also known as business process drift detection – enables analysts to identify and act upon changes that may otherwise affect process performance. Previous methods for business process drift detection are based on an exploration of a potentially large feature space and in some cases they require users to manually identify the specific features that characterize the drift. Depending on the explored feature set, these methods may miss certain types of changes. This paper proposes a fully automated and statistically grounded method for detecting process drift. The core idea is to perform statistical tests over the distributions of runs observed in two consecutive time windows. By adaptively sizing the window, the method strikes a trade-off between classification accuracy and drift detection delay. A validation on synthetic and real-life logs shows that the method accurately detects typical change patterns and scales up to the extent it is applicable for online drift detection.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Existing business process drift detection methods do not work with event streams. As such, they are designed to detect inter-trace drifts only, i.e. drifts that occur between complete process executions (traces), as recorded in event logs. However, process drift may also occur during the execution of a process, and may impact ongoing executions. Existing methods either do not detect such intra-trace drifts, or detect them with a long delay. Moreover, they do not perform well with unpredictable processes, i.e. processes whose logs exhibit a high number of distinct executions to the total number of executions. We address these two issues by proposing a fully automated and scalable method for online detection of process drift from event streams. We perform statistical tests over distributions of behavioral relations between events, as observed in two adjacent windows of adaptive size, sliding along with the stream. An extensive evaluation on synthetic and real-life logs shows that our method is fast and accurate in the detection of typical change patterns, and performs significantly better than the state of the art.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Many organizations realize that increasing amounts of data (“Big Data”) need to be dealt with intelligently in order to compete with other organizations in terms of efficiency, speed and services. The goal is not to collect as much data as possible, but to turn event data into valuable insights that can be used to improve business processes. However, data-oriented analysis approaches fail to relate event data to process models. At the same time, large organizations are generating piles of process models that are disconnected from the real processes and information systems. In this chapter we propose to manage large collections of process models and event data in an integrated manner. Observed and modeled behavior need to be continuously compared and aligned. This results in a “liquid” business process model collection, i.e. a collection of process models that is in sync with the actual organizational behavior. The collection should self-adapt to evolving organizational behavior and incorporate relevant execution data (e.g. process performance and resource utilization) extracted from the logs, thereby allowing insightful reports to be produced from factual organizational data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

With the emergence of large-volume and high-speed streaming data, the recent techniques for stream mining of CFIpsilas (closed frequent itemsets) will become inefficient. When concept drift occurs at a slow rate in high speed data streams, the rate of change of information across different sliding windows will be negligible. So, the user wonpsilat be devoid of change in information if we slide window by multiple transactions at a time. Therefore, we propose a novel approach for mining CFIpsilas cumulatively by making sliding width(ges1) over high speed data streams. However, it is nontrivial to mine CFIpsilas cumulatively over stream, because such growth may lead to the generation of exponential number of candidates for closure checking. In this study, we develop an efficient algorithm, stream-close, for mining CFIpsilas over stream by exploring some interesting properties. Our performance study reveals that stream-close achieves good scalability and has promising results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the deployment of PDM in the mobile environment. An important extension to the basic implementation of PDM dealing with concept drift is also reported. In the era of Big Data, potential applications of paramount importance offered by PDM in a variety of domains including security, business and telemedicine are discussed.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Advances in hardware and software technologies allow to capture streaming data. The area of Data Stream Mining (DSM) is concerned with the analysis of these vast amounts of data as it is generated in real-time. Data stream classification is one of the most important DSM techniques allowing to classify previously unseen data instances. Different to traditional classifiers for static data, data stream classifiers need to adapt to concept changes (concept drift) in the stream in real-time in order to reflect the most recent concept in the data as accurately as possible. A recent addition to the data stream classifier toolbox is eRules which induces and updates a set of expressive rules that can easily be interpreted by humans. However, like most rule-based data stream classifiers, eRules exhibits a poor computational performance when confronted with continuous attributes. In this work, we propose an approach to deal with continuous data effectively and accurately in rule-based classifiers by using the Gaussian distribution as heuristic for building rule terms on continuous attributes. We show on the example of eRules that incorporating our method for continuous attributes indeed speeds up the real-time rule induction process while maintaining a similar level of accuracy compared with the original eRules classifier. We termed this new version of eRules with our approach G-eRules.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

At first blush, user modeling appears to be a prime candidate for straightforward application of standard machine learning techniques. Observations of the user's behavior can provide training examples that a machine learning system can use to form a model designed to predict future actions. However, user modeling poses a number of challenges for machine learning that have hindered its application in user modeling, including: the need for large data sets; the need for labeled data; concept drift; and computational complexity. This paper examines each of these issues and reviews approaches to resolving them.