852 resultados para 280103 Information Storage, Retrieval and Management
Resumo:
A major task of traditional temporal event sequence mining is to predict the occurrences of a special type of event (called target event) in a long temporal sequence. Our previous work has defined a new type of pattern, called event-oriented pattern, which can potentially predict the target event within a certain period of time. However, in the event-oriented pattern discovery, because the size of interval for prediction is pre-defined, the mining results could be inaccurate and carry misleading information. In this paper, we introduce a new concept, called temporal feature, to rectify this shortcoming. Generally, for any event-oriented pattern discovered under the pre-given size of interval, the temporal feature is the minimal size of interval that makes the pattern interesting. Thus, by further investigating the temporal features of discovered event-oriented patterns, we can refine the knowledge for the target event prediction.
Resumo:
Sharing data among organizations often leads to mutual benefit. Recent technology in data mining has enabled efficient extraction of knowledge from large databases. This, however, increases risks of disclosing the sensitive knowledge when the database is released to other parties. To address this privacy issue, one may sanitize the original database so that the sensitive knowledge is hidden. The challenge is to minimize the side effect on the quality of the sanitized database so that nonsensitive knowledge can still be mined. In this paper, we study such a problem in the context of hiding sensitive frequent itemsets by judiciously modifying the transactions in the database. To preserve the non-sensitive frequent itemsets, we propose a border-based approach to efficiently evaluate the impact of any modification to the database during the hiding process. The quality of database can be well maintained by greedily selecting the modifications with minimal side effect. Experiments results are also reported to show the effectiveness of the proposed approach. © 2005 IEEE
Resumo:
Collaborative filtering is regarded as one of the most promising recommendation algorithms. The item-based approaches for collaborative filtering identify the similarity between two items by comparing users' ratings on them. In these approaches, ratings produced at different times are weighted equally. That is to say, changes in user purchase interest are not taken into consideration. For example, an item that was rated recently by a user should have a bigger impact on the prediction of future user behaviour than an item that was rated a long time ago. In this paper, we present a novel algorithm to compute the time weights for different items in a manner that will assign a decreasing weight to old data. More specifically, the users' purchase habits vary. Even the same user has quite different attitudes towards different items. Our proposed algorithm uses clustering to discriminate between different kinds of items. To each item cluster, we trace each user's purchase interest change and introduce a personalized decay factor according to the user own purchase behaviour. Empirical studies have shown that our new algorithm substantially improves the precision of item-based collaborative filtering without introducing higher order computational complexity.
Resumo:
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
Resumo:
Management of collaborative business processes that span multiple business entities has emerged as a key requirement for business success. These processes are embedded in sets of rules describing complex message-based interactions between parties such that if a logical expression defined on the set of received messages is satisfied, one or more outgoing messages are dispatched. The execution of these processes presents significant challenges since each contentrich message may contribute towards the evaluation of multiple expressions in different ways and the sequence of message arrival cannot be predicted. These challenges must be overcome in order to develop an efficient execution strategy for collaborative processes in an intensive operating environment with a large number of rules and very high throughput of messages. In this paper, we present a discussion on issues relevant to the evaluation of such expressions and describe a basic query-based method for this purpose, including suggested indexes for improved performance. We conclude by identifying several potential future research directions in this area. © 2010 IEEE. All rights reserved
Resumo:
The paper provides evidence that spatial indexing structures offer faster resolution of Formal Concept Analysis queries than B-Tree/Hash methods. We show that many Formal Concept Analysis operations, computing the contingent and extent sizes as well as listing the matching objects, enjoy improved performance with the use of spatial indexing structures such as the RD-Tree. Speed improvements can vary up to eighty times faster depending on the data and query. The motivation for our study is the application of Formal Concept Analysis to Semantic File Systems. In such applications millions of formal objects must be dealt with. It has been found that spatial indexing also provides an effective indexing technique for more general purpose applications requiring scalability in Formal Concept Analysis systems. The coverage and benchmarking are presented with general applications in mind.
Resumo:
In multimedia retrieval, a query is typically interactively refined towards the ‘optimal’ answers by exploiting user feedback. However, in existing work, in each iteration, the refined query is re-evaluated. This is not only inefficient but fails to exploit the answers that may be common between iterations. In this paper, we introduce a new approach called SaveRF (Save random accesses in Relevance Feedback) for iterative relevance feedback search. SaveRF predicts the potential candidates for the next iteration and maintains this small set for efficient sequential scan. By doing so, repeated candidate accesses can be saved, hence reducing the number of random accesses. In addition, efficient scan on the overlap before the search starts also tightens the search space with smaller pruning radius. We implemented SaveRF and our experimental study on real life data sets show that it can reduce the I/O cost significantly.