77 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Fish-net algorithm is a novel field learning algorithm which derives classification rules by looking at the range of values of each attribute instead of the individual point values. In this paper, we present a Feature Selection Fish-net learning algorithm to solve the Dual Imbalance problem on text classification. Dual imbalance includes the instance imbalance and feature imbalance. The instance imbalance is caused by the unevenly distributed classes and feature imbalance is due to the different document length. The proposed approach consists of two phases: (1) select a feature subset which consists of the features that are more supportive to difficult minority class; (2) construct classification rules based on the original Fish-net algorithm. Our experimental results on Reuters21578 show that the proposed approach achieves better balanced accuracy rate on both majority and minority class than Naive Bayes MultiNomial and SVM.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Determining the causal structure of a domain is a key task in the area of Data Mining and Knowledge Discovery.The algorithm proposed by Wallace et al. [15] has demonstrated its strong ability in discovering Linear Causal Models from given data sets. However, some experiments showed that this algorithm experienced difficulty in discovering linear relations with small deviation, and it occasionally gives a negative message length, which should not be allowed. In this paper, a more efficient and precise MML encoding scheme is proposed to describe the model structure and the nodes in a Linear Causal Model. The estimation of different parameters is also derived. Empirical results show that the new algorithm outperformed the previous MML-based algorithm in terms of both speed and precision.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Discovering a precise causal structure accurately reflecting the given data is one of the most essential tasks in the area of data mining and machine learning. One of the successful causal discovery approaches is the information-theoretic approach using the Minimum Message Length Principle[19]. This paper presents an improved and further experimental results of the MML discovery algorithm. We introduced a new encoding scheme for measuring the cost of describing the causal structure. Stiring function is also applied to further simplify the computational complexity and thus works more efficiently. The experimental results of the current version of the discovery system show that: (1) the current version is capable of discovering what discovered by previous system; (2) current system is capable of discovering more complicated causal models with large number of variables; (3) the new version works more efficiently compared with the previous version in terms of time complexity.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The approaches proposed in the past for discovering sequential patterns mainly focused on single sequential data. In the real world, however, some sequential patterns hide their essences among multi-sequential event data. It has been noted that knowledge discovery with either user-specified constraints, or templates, or skeletons is receiving wide attention because it is more efficient and avoids the tedious selection of useful patterns from the mass-produced results. In this paper, a novel pattern in multi-sequential event data that are correlated and its mining approach are presented. We call this pattern sequential causal pattern. A group of skeletons of sequential causal patterns, which may be specified by the user or generated by the program, are verified or mined by embedding them into the mining engine. Experiments show that this method, when applied to discovering the occurring regularities of a crop pest in a region, is successful in mining sequential causal patterns with user-specified skeletons in multi-sequential event data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Spam is commonly defined as unsolicited email messages and the goal of spam categorization is to distinguish between spam and legitimate email messages. Many researchers have been trying to separate spam from legitimate emails using machine learning algorithms based on statistical learning methods. In this paper, an innovative and intelligent spam filtering model has been proposed based on support vector machine (SVM). This model combines both linear and nonlinear SVM techniques where linear SVM performs better for text based spam classification that share similar characteristics. The proposed model considers both text and image based email messages for classification by selecting an appropriate kernel function for information transformation.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In text categorization applications, class imbalance, which refers to an uneven data distribution where one class is represented by far more less instances than the others, is a commonly encountered problem. In such a situation, conventional classifiers tend to have a strong performance bias, which results in high accuracy rate on the majority class but very low rate on the minorities. An extreme strategy for unbalanced, learning is to discard the majority instances and apply one-class classification to the minority class. However, this could easily cause another type of bias, which increases the accuracy rate on minorities by sacrificing the majorities. This paper aims to investigate approaches that reduce these two types of performance bias and improve the reliability of discovered classification rules. Experimental results show that the inexact field learning method and parameter optimized one-class classifiers achieve more balanced performance than the standard approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper presents new methodology towards the automatic development of multilingual Web portal for multilingual knowledge discovery and management. It aims to provide an efficient and effective framework for selecting and organizing knowledge from voluminous linguistically diverse Web contents. To achieve this, a concept-based approach that incorporates text mining and Web content mining using neural network and fuzzy techniques is proposed. First, a concept-based taxonomy of themes, which will act as the hierarchical backbone of the Web portal, is automatically generated. Second, a concept-based multilingual Web crawler is developed to intelligently harvest relevant multilingual documents from the Web. Finally, a concept-based multilingual text categorization technique is proposed to organize multilingual documents by concepts. As such, correlated multilingual Web documents can be gathered/filtered/organised/ based on their semantic content to facilitate high-performance multilingual information access.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In this paper I describe the discursive strategies related to the writer–reader textual reciprocity. I focus on one way of achieving such reciprocity -- the employment by the writer of facilitative schematic structures and metalanguage where one text segment signposts information conveyed in the segment that follows. I refer to these facilitative schematic structures as "organising relational schemata". I see organising relations as the most explicit components of the rhetorical structure of texts: they illuminate the main message and aid the reader's cognitive processes in the orientation of how information is conveyed by text.
This paper discusses the way the choices of organising relations and associated metalanguage by the writers in different cultures and different discourse communities contribute to the communicative homeostasis in the world of text. It shows how the influence of a native culture and intellectual style together with the forces operating within the writer's international disciplinary community interact in the authorial guidance in the scholarly prose.
I introduce and exemplify three types of organising relational structures: Advance Organisers, Introducers and Enumerators. I trace the utilisation of these three types of relations in sociology research papers written in English and produced in "Anglo" and Polish academic discourse comunities by native English speaking and native Polish speaking scholars. The relational typology adopted is based on a study by Golebiowski (2002), which proposed a theoretical framework for the examination of discoursal structure of research papers, referred to as FARS – Framework for the Analysis of the Rhetorical Structure of Texts. FARS entails a relational taxonomy which displays a pattern of rhetorical relations utilised by the writer to achieve textual coherence.
I describe intertextual differences in the frequency of occurrence of organising relations, their degree of explicitness and their positioning in the hierarchical structure of texts. Differences in the mode of employment of textual organisers suggest that the rhetorical structure of English research prose produced by non-native speakers cannot escape being shaped by the characteristics and conventions of the authors’ first language. They are also attributed to cultural norms and conventions as well as educational systems prevailing within the discourse communities which constitute the social contexts of texts.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Wireless sensor networks (WSN) are attractive for information gathering in large-scale data rich environments. In order to fully exploit the data gathering and dissemination capabilities of these networks, energy-efficient and scalable solutions for data storage and information discovery are essential. In this paper, we formulate the information discovery problem as a load-balancing problem, with the combined aim being to maximize network lifetime and minimize query processing delay resulting in QoS improvements. We propose a novel information storage and distribution mechanism that takes into account the residual energy levels in individual sensors. Further, we propose a hybrid push-pull strategy that enables fast response to information discovery queries.

Simulations results prove the proposed method(s) of information discovery offer significant QoS benefits for global as well as individual queries in comparison to previous approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Wireless sensor networks (WSN) are attractive for information gathering in large-scale data rich environments. Emerging WSN applications require dissemination of information to interested clients within the network requiring support for differing traffic patterns. Further, in-network query processing capabilities are required for autonomic information discovery. In this paper, we formulate the information discovery problem as a load-balancing problem, with the combined aim being to maximize network lifetime and minimize query processing delay. We propose novel methods for data dissemination, information discovery and data aggregation that are designed to provide significant QoS benefits. We make use of affinity propagation to group "similar" sensors and have developed efficient mechanisms that can resolve both ALL-type and ANY-type queries in-network with improved energy-efficiency and query resolution time. Simulation results prove the proposed method(s) of information discovery offer significant QoS benefits for ALL-type and ANY-type queries in comparison to previous approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Multi-databases mining is an urgent task. This thesis solves 4 key problems in multi-databases mining: Application-independent database classification - Local instance analysis model - Useful pattern discovery - Pattern synthesis.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This thesis proposes an innovative adaptive multi-classifier spam filtering model, with a grey-list analyser and a dynamic feature selection method, to overcome false-positive problems in email classification. It also presents additional techniques to minimize the added complexity. Empirical evidence indicates the success of this model over existing approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

An RNA pseudoknot consists of nonnested double-stranded stems connected by single-stranded loops. There is increasing recognition that RNA pseudoknots are one of the most prevalent RNA structures and fulfill a diverse set of biological roles within cells, and there is an expanding rate of studies into RNA pseudoknotted structures as well as increasing allocation of function. These not only produce valuable structural data but also facilitate an understanding of structural and functional characteristics in RNA molecules. PseudoBase is a database providing structural, functional, and sequence data related to RNA pseudoknots. To capture the features of RNA pseudoknots, we present a novel framework using quantitative association rule mining to analyze the pseudoknot data. The derived rules are classified into specified association groups regarding structure, function, and category of RNA pseudoknots. The discovered association rules assist biologists in filtering out significant knowledge of structure-function and structure-category relationships. A brief biological interpretation to the relationships is presented, and their potential correlations with each other are highlighted.