161 resultados para contrast mining


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In data stream applications, a good approximation obtained in a timely  manner is often better than the exact answer that’s delayed beyond the window of opportunity. Of course, the quality of the approximate is as important as its timely delivery. Unfortunately, algorithms capable of online processing do not conform strictly to a precise error guarantee. Since online processing is essential and so is the precision of the error, it is necessary that stream algorithms meet both criteria. Yet, this is not the case for mining frequent sets in data streams. We present EStream, a novel algorithm that allows online processing while producing results strictly within the error bound. Our theoretical and experimental results show that EStream is a better candidate for finding frequent sets in data streams, when both constraints need to be satisfied.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced – which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, contrast-set mining. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we ob- served that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we propose a model for discovering frequent sequential patterns, phrases, which can be used as profile descriptors of documents. It is indubitable that we can obtain numerous phrases using data mining algorithms. However, it is difficult to use these phrases effectively for answering what users want. Therefore, we present a pattern taxonomy extraction model which performs the task of extracting descriptive frequent sequential patterns by pruning the meaningless ones. The model then is extended and tested by applying it to the information filtering system. The results of the experiment show that pattern-based methods outperform the keyword-based methods. The results also indicate that removal of meaningless patterns not only reduces the cost of computation but also improves the effectiveness of the system.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data mining is playing an important role in decision making for business activities and governmental administration. Since many organizations or their divisions do not possess the in-house expertise and infrastructure for data mining, it is beneficial to delegate data mining tasks to external service providers. However, the organizations or divisions may lose of private information during the delegating process. In this paper, we present a Bloom filter based solution to enable organizations or their divisions to delegate the tasks of mining association rules while protecting data privacy. Our approach can achieve high precision in data mining by only trading-off storage requirements, instead of by trading-off the level of privacy preserving.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This study is motivated by How [How, J., 2000. The initial and long run performances of mining IPOs in Australia. Aust. J. Manage. 25, 95–118] who examined 100 Australian gold mining initial public offerings (IPOs) from 1979 to 1990 to report an average 119.51% underpricing return by those IPOs. This study updates that analysis by investigating 114 Australian gold mining IPOs from 1994 to 2004 and finds a significantly lower 13.3% average first day return. Options offered to underwriters can in part explain these returns as can the change in either the Gold Index or the All Ordinaries Index from the date of the prospectus to the date of listing.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Arsenic is a proven carcinogen often found at high concentrations in association with gold and other heavy metals. The freshwater yabby, Cherax destructor Clark (Decapoda, Parastacidae), is a ubiquitous species native to Australia's central and eastern regions, with a growing international commercial market. However, in this region of Australia, yabby farmers often harvest organisms from old mine tailings dams with elevated environmental arsenic levels. Yabbies exposed to elevated environmental arsenic were found to accumulate and store as much as 100 μg/g arsenic in their tissues. The accumulation is proportional to the concentration of arsenic in the sediment and is high enough to be of concern for people who eat the yabbies. A comparison of arsenic levels in wild and lab-fed animals also was performed. Although there was no significant difference in the level of arsenic in the various organs of the wild animals, the animals purchased from a yabby farm showed a significantly higher arsenic concentration in their hepatopancreas (3.7 ± 0.9 μg/g) compared to other organs (0.6–1.8 μg/g). Furthermore, after a 40-d exposure to food containing 200 to 300 μg/g inorganic arsenic, arsenate (As[V])-exposed animals showed a significant increase in tissue-specific arsenic accumulation, whereas arsenite (As[III])-exposed animals showed a lower, nonsignificant increase in As uptake, primarily in the hepatopancreas. These results have important implications for yabby growers and consumers alike.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The objective is to measure utility of real-time commercial decision making. It is important due to a higher possibility of mistakes in real-time decisions, problems with recording actual occurrences, and significant costs associated with predictions produced by algorithms. The first contribution is to use overall utility and represent individual utility with a monetary value instead of a prediction. The second is to calculate the benefit from predictions using the utility-based decision threshold. The third is to incorporate cost of predictions. For experiments, overall utility is used to evaluate communal and spike detection, and their adaptive versions. The overall utility results show that with fewer alerts, communal detection is better than spike detection. With more alerts, adaptive communal and spike detection are better than their static versions. To maximise overall utility with all algorithms, only 1% to 4% in the highest predictions should be alerts.