254 resultados para Data Mining, Rough Sets, Multi-Dimension, Association Rules, Constraint


Relevância:

100.00% 100.00%

Publicador:

Resumo:

An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study assesses the recently proposed data-driven background dataset refinement technique for speaker verification using alternate SVM feature sets to the GMM supervector features for which it was originally designed. The performance improvements brought about in each trialled SVM configuration demonstrate the versatility of background dataset refinement. This work also extends on the originally proposed technique to exploit support vector coefficients as an impostor suitability metric in the data-driven selection process. Using support vector coefficients improved the performance of the refined datasets in the evaluation of unseen data. Further, attempts are made to exploit the differences in impostor example suitability measures from varying features spaces to provide added robustness.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Artificial neural network (ANN) learning methods provide a robust and non-linear approach to approximating the target function for many classification, regression and clustering problems. ANNs have demonstrated good predictive performance in a wide variety of practical problems. However, there are strong arguments as to why ANNs are not sufficient for the general representation of knowledge. The arguments are the poor comprehensibility of the learned ANN, and the inability to represent explanation structures. The overall objective of this thesis is to address these issues by: (1) explanation of the decision process in ANNs in the form of symbolic rules (predicate rules with variables); and (2) provision of explanatory capability by mapping the general conceptual knowledge that is learned by the neural networks into a knowledge base to be used in a rule-based reasoning system. A multi-stage methodology GYAN is developed and evaluated for the task of extracting knowledge from the trained ANNs. The extracted knowledge is represented in the form of restricted first-order logic rules, and subsequently allows user interaction by interfacing with a knowledge based reasoner. The performance of GYAN is demonstrated using a number of real world and artificial data sets. The empirical results demonstrate that: (1) an equivalent symbolic interpretation is derived describing the overall behaviour of the ANN with high accuracy and fidelity, and (2) a concise explanation is given (in terms of rules, facts and predicates activated in a reasoning episode) as to why a particular instance is being classified into a certain category.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a practical framework to synthesize multi-sensor navigation information for localization of a rotary-wing unmanned aerial vehicle (RUAV) and estimation of unknown ship positions when the RUAV approaches the landing deck. The estimation performance of the visual tracking sensor can also be improved through integrated navigation. Three different sensors (inertial navigation, Global Positioning System, and visual tracking sensor) are utilized complementarily to perform the navigation tasks for the purpose of an automatic landing. An extended Kalman filter (EKF) is developed to fuse data from various navigation sensors to provide the reliable navigation information. The performance of the fusion algorithm has been evaluated using real ship motion data. Simulation results suggest that the proposed method can be used to construct a practical navigation system for a UAV-ship landing system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big data is big news in almost every sector including crisis communication. However, not everyone has access to big data and even if we have access to big data, we often do not have necessary tools to analyze and cross reference such a large data set. Therefore this paper looks at patterns in small data sets that we have ability to collect with our current tools to understand if we can find actionable information from what we already have. We have analyzed 164390 tweets collected during 2011 earthquake to find out what type of location specific information people mention in their tweet and when do they talk about that. Based on our analysis we find that even a small data set that has far less data than a big data set can be useful to find priority disaster specific areas quickly.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a novel framework to further advance the recent trend of using query decomposition and high-order term relationships in query language modeling, which takes into account terms implicitly associated with different subsets of query terms. Existing approaches, most remarkably the language model based on the Information Flow method are however unable to capture multiple levels of associations and also suffer from a high computational overhead. In this paper, we propose to compute association rules from pseudo feedback documents that are segmented into variable length chunks via multiple sliding windows of different sizes. Extensive experiments have been conducted on various TREC collections and our approach significantly outperforms a baseline Query Likelihood language model, the Relevance Model and the Information Flow model.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Over past few decades, frog species have been experiencing dramatic decline around the world. The reason for this decline includes habitat loss, invasive species, climate change and so on. To better know the status of frog species, classifying frogs has become increasingly important. In this study, acoustic features are investigated for multi-level classification of Australian frogs: family, genus and species, including three families, eleven genera and eighty five species which are collected from Queensland, Australia. For each frog species, six instances are selected from which ten acoustic features are calculated. Then, the multicollinearity between ten features are studied for selecting non-correlated features for subsequent analysis. A decision tree (DT) classifier is used to visually and explicitly determine which acoustic features are relatively important for classifying family, which for genus, and which for species. Finally, a weighted support vector machines (SVMs) classifier is used for the multi- level classification with three most important acoustic features respectively. Our experiment results indicate that using different acoustic feature sets can successfully classify frogs at different levels and the average classification accuracy can be up to 85.6%, 86.1% and 56.2% for family, genus and species respectively.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

STUDY QUESTION Are single-nucleotide polymorphisms (SNPs) at the interleukin 1A (IL1A) gene locus associated with endometriosis risk? SUMMARY ANSWER We found evidence for strong association between IL1A SNPs and endometriosis risk. WHAT IS KNOWN ALREADY Genetic factors contribute substantially to the complex aetiology of endometriosis and the disease has an estimated heritability of ∼51%. We, and others, have conducted genome-wide association (GWA) studies for endometriosis, which identified a total of nine independent risk loci. Recently, two small Japanese studies reported eight SNPs (rs6542095, rs11677416, rs3783550, rs3783525, rs3783553, rs2856836, rs1304037 and rs17561) at the IL1A gene locus as suggestively associated with endometriosis risk. There is also evidence of a link between inflammation and endometriosis. STUDY DESIGN, SIZE, DURATION We sought to further investigate the eight IL1A SNPs for association with endometriosis using an independent sample of 3908 endometriosis cases and 8568 controls of European and Japanese ancestry. The study was conducted between October 2013 and July 2014. PARTICIPANTS/MATERIALS, SETTING, METHODS By leveraging GWA data from our previous multi-ethnic GWA meta-analysis for endometriosis, we imputed variants in the IL1A region, using a recent 1000 Genomes reference panel. After combining summary statistics for the eight SNPs from our European and Japanese imputed data with the published results, a fixed-effect meta-analysis was performed. An additional meta-analysis restricted to endometriosis cases with moderate-to-severe (revised American Fertility Society stage 3 or 4) disease versus controls was also performed. MAIN RESULTS AND THE ROLE OF CHANCE All eight IL1A SNPs successfully replicated at P < 0.014 in the European imputed data with concordant direction and similar size to the effects reported in the original Japanese studies. Of these, three SNPs (rs6542095, rs3783550 and rs3783525) also showed association with endometriosis at a nominal P < 0.05 in our independent Japanese sample. Fixed-effect meta-analysis of the eight SNPs for moderate-to-severe endometriosis produced a genome-wide significant association for rs6542095 (odds ratio = 1.21; 95% confidence interval = 1.13–1.29; P = 3.43 × 10−8). LIMITATIONS, REASONS FOR CAUTION The meta-analysis for moderate-to-severe endometriosis included results of moderate-to-severe endometriosis cases from our European data sets and all endometriosis cases from the Japanese data sets, as disease stage information was not available for endometriosis cases in the Japanese data sets. WIDER IMPLICATIONS OF THE FINDINGS SNP rs6542095 is located ∼2.3 kb downstream of the IL1A gene and ∼6.9 kb upstream of cytoskeleton-associated protein 2-like (CKAP2L) gene. The IL1A gene encodes the IL1a protein, a member of the interleukin 1 cytokine family which is involved in various immune responses and inflammatory processes. These results provide important replication in an independent Japanese sample and, for the first time, association of the IL1A locus in endometriosis patients of European ancestry. SNPs within the IL1A locus may regulate other genes, but if IL1A is the target, our results provide supporting evidence for a link between inflammatory responses and the pathogenesis of endometriosis. STUDY FUNDING/COMPETING INTEREST(S) The research was funded by grants from the Australian National Health and Medical Research Council and Wellcome Trust. None of the authors has competing interests for the study.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multi-document summarization addressing the problem of information overload has been widely utilized in the various real-world applications. Most of existing approaches adopt term-based representation for documents which limit the performance of multi-document summarization systems. In this paper, we proposed a novel pattern-based topic model (PBTMSum) for the task of the multi-document summarization. PBTMSum combining pattern mining techniques with LDA topic modelling could generate discriminative and semantic rich representations for topics and documents so that the most representative and non-redundant sentences can be selected to form a succinct and informative summary. Extensive experiments are conducted on the data of document understanding conference (DUC) 2007. The results prove the effectiveness and efficiency of our proposed approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.