28 resultados para Frequent itemset

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

To efficiently and yet accurately cluster Web documents is of great interests to Web users and is a key component of the searching accuracy of a Web search engine. To achieve this, this paper introduces a new approach for the clustering of Web documents, which is called maximal frequent itemset (MFI) approach. Iterative clustering algorithms, such as K-means and expectation-maximization (EM), are sensitive to their initial conditions. MFI approach firstly locates the center points of high density clusters precisely. These center points then are used as initial points for the K-means algorithm. Our experimental results tested on 3 Web document sets show that our MFI approach outperforms the other methods we compared in most cases, particularly in the case of large number of categories in Web document sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Data perturbation is a popular method to achieve privacy-preserving data mining. However, distorted databases bring enormous overheads to mining algorithms as compared to original databases. In this paper, we present the GrC-FIM algorithm to address the efficiency problem in mining frequent itemsets from distorted databases. Two measures are introduced to overcome the weakness in existing work: firstly, the concept of independent granule is introduced, and granule inference is used to distinguish between non-independent itemsets and independent itemsets. We further prove that the support counts of non-independent itemsets can be directly derived from subitemsets, so that the error-prone reconstruction process can be avoided. This could improve the efficiency of the algorithm, and bring more accurate results; secondly, through the granular-bitmap representation, the support counts can be calculated in an efficient way. The empirical results on representative synthetic and real-world databases indicate that the proposed GrC-FIM algorithm outperforms the popular EMASK algorithm in both the efficiency and the support count reconstruction accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Apriori algorithm’s frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS AR is an efficient algorithm for association rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyses pruning rules for use with OPUS AR. We demonstrate that application of OPUS AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In today’s high speed networks it is becoming increasingly challenging for network managers to understand the nature of the traffic that is carried in their network. A major problem for traffic analysis in this context is how to extract a concise yet accurate summary of the relevant aggregate traffic flows that are present in network traces. In this paper, we present two summarization techniques to minimize the size of the traffic flow report that is generated by a hierarchical cluster analysis tool. By analyzing the accuracy and compaction gain of our approach on a standard benchmark dataset, we demonstrate that our approach achieves more accurate summaries than those of an existing tool that is based on frequent itemset mining.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background
AMP-activated protein kinase (AMPK) has emerged as a significant signaling intermediary that regulates metabolisms in response to energy demand and supply. An investigation into the degree of activation and deactivation of AMPK subunits under exercise can provide valuable data for understanding AMPK. In particular, the effect of AMPK on muscle cellular energy status makes this protein a promising pharmacological target for disease treatment. As more AMPK regulation data are accumulated, data mining techniques can play an important role in identifying frequent patterns in the data. Association rule mining, which is commonly used in market basket analysis, can be applied to AMPK regulation.

Results
This paper proposes a framework that can identify the potential correlation, either between the state of isoforms of α, β and γ subunits of AMPK, or between stimulus factors and the state of isoforms. Our approach is to apply item constraints in the closed interpretation to the itemset generation so that a threshold is specified in terms of the amount of results, rather than a fixed threshold value for all itemsets of all sizes. The derived rules from experiments are roughly analyzed. It is found that most of the extracted association rules have biological meaning and some of them were previously unknown. They indicate direction for further research.

Conclusion
Our findings indicate that AMPK has a great impact on most metabolic actions that are related to energy demand and supply. Those actions are adjusted via its subunit isoforms under specific physical training. Thus, there are strong co-relationships between AMPK subunit isoforms and exercises. Furthermore, the subunit isoforms are correlated with each other in some cases. The methods developed here could be used when predicting these essential relationships and enable an understanding of the functions and metabolic pathways regarding AMPK.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Protein kinases, a family of enzymes, have been viewed as an important signaling intermediary by living organisms for regulating critical biological processes such as memory, hormone response and cell growth. The
unbalanced kinases are known to cause cancer and other diseases. With the increasing efforts to collect, store and disseminate information about the entire kinase family, it not only leads to valuable data set to understand cell regulation but also poses a big challenge to extract valuable knowledge about metabolic pathway from the data. Data mining techniques that have been widely used to find frequent patterns in large datasets can be extended and adapted to kinase data as well. This paper proposes a framework for mining frequent itemsets from the collected kinase dataset. An experiment using AMPK regulation data demonstrates that our approaches are useful and efficient in analyzing kinase regulation data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data mining refers to extracting or "mining" knowledge from large amounts of data. It is also called a method of "knowledge presentation" where visualization and knowledge representation techniques are used to present the mined knowledge to the user. Efficient algorithms to mine frequent patterns are crucial to many tasks in data mining. Since the Apriori algorithm was proposed in 1994, there have been several methods proposed to improve its performance. However, most still adopt its candidate set generation-and-test approach. In addition, many methods do not generate all frequent patterns, making them inadequate to derive association rules. The Pattern Decomposition (PD) algorithm that can significantly reduce the size of the dataset on each pass makes it more efficient to mine all frequent patterns in a large dataset. This algorithm avoids the costly process of candidate set generation and saves a large amount of counting time to evaluate support with reduced datasets. In this paper, some existing frequent pattern generation algorithms are explored and their comparisons are discussed. The results show that the PD algorithm outperforms an improved version of Apriori named Direct Count of candidates & Prune transactions (DCP) by one order of magnitude and is faster than an improved FP-tree named as Predictive Item Pruning (PIP). Further, PD is also more scalable than both DCP and PIP.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data streams are usually generated in an online fashion characterized by huge volume, rapid unpredictable rates, and fast changing data characteristics. It has been hence recognized that mining over streaming data requires the problem of limited computational resources to be adequately addressed. Since the arrival rate of data streams can significantly increase and exceed the CPU capacity, the machinery must adapt to this change to guarantee the timeliness of the results. We present an online algorithm to approximate a set of frequent patterns from a sliding window over the underlying data stream - given apriori CPU capacity. The algorithm automatically detects overload situations and can adaptively shed unprocessed data to guarantee the timely results. We theoretically prove, using probabilistic and deterministic techniques, that the error on the output results is bounded within a pre-specified threshold. The empirical results on various datasets also confirmed the feasiblity of our proposal.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In data stream applications, a good approximation obtained in a timely  manner is often better than the exact answer that’s delayed beyond the window of opportunity. Of course, the quality of the approximate is as important as its timely delivery. Unfortunately, algorithms capable of online processing do not conform strictly to a precise error guarantee. Since online processing is essential and so is the precision of the error, it is necessary that stream algorithms meet both criteria. Yet, this is not the case for mining frequent sets in data streams. We present EStream, a novel algorithm that allows online processing while producing results strictly within the error bound. Our theoretical and experimental results show that EStream is a better candidate for finding frequent sets in data streams, when both constraints need to be satisfied.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced – which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Although there are significant benefits to frequent nocturnal home haemodialysis (NHHD) there has been a low acceptance of this therapy in Australia.

Aim: The aim of this paper is to explore and discuss the literature relating to the nursing barriers to frequent nocturnal home haemodialysis.

Methods:
A search of nursing, medical, social work and psychological literature was performed.

Results:
Nurses are key contributors to the increase of NHHD within the dialysis population. Knowledge, culture and nurse satisfaction are key areas to address to increase NHHD uptake.

Conclusion:
Nurses need to challenge the cultural and organisational barriers that are preventing further uptake of NHHD. If nurses do not we cannot claim to be helping patients attain their best possible outcome.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This study tests a model of Brand Knowledge and Brand Equity of brands of beer on new and frequent users in two populations that differ in their stage of the beer product life cycle and culture. Using Multiple Logistic Regression (MLR) and Binomial Logistic Regression (BLR), models based on the respondents' Brand Knowledge are able to correctly identify Chinese respondents’ preferred brand of beer 56% of the time, while correctly identifying 77% of respondents in an Australian sample when three top brands are tested. The model could further identify 67% of those that stay or switch in both the Australian and the Chinese samples.