105 resultados para Text categorization

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Classification methods are usually used to categorize text documents, such as, Rocchio method, Naïve bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct classifiers. The generated classifiers can predict which category is located for a new coming text document. The keywords in the document are often used to form rules to categorize text documents, for example “kw = computer” can be a rule for the IT documents category. However, the number of keywords is very large. To select keywords from the large number of keywords is a challenging work. Recently, a rule generation method based on enumeration of all possible keywords combinations has been proposed [2]. In this method, there remains a crucial problem: how to prune irrelevant combinations at the early stages of the rule generation procedure. In this paper, we propose a method than can effectively prune irrelative keywords at an early stage.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Text categorization (TC) is one of the main applications of machine learning. Many methods have been proposed, such as Rocchio method, Naive bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct a classifier. A new coming text document's category can be predicted. However, these methods do not give the description of each category. In the machine learning field, there are many concept learning algorithms, such as, ID3 and CN2. This paper proposes a more robust algorithm to induce concepts from training examples, which is based on enumeration of all possible keywords combinations. Experimental results show that the rules produced by our approach have more precision and simplicity than that of other methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper explores effective multi-label classification methods for multi-semantic image and text categorization. We perform an experimental study of clustering based multi-label classification (CBMLC) for the target problem. Experimental evaluation is conducted for identifying the impact of different clustering algorithms and base classifiers on the predictive performance and efficiency of CBMLC. In the experimental setting, three widely used clustering algorithms and six popular multi-label classification algorithms are used and evaluated on multi-label image and text datasets. A multi-label classification evaluation metrics, micro F1-measure, is used for presenting predictive performances of the classifications. Experimental evaluation results reveal that clustering based multi-label learning algorithms are more effective compared to their non-clustering counterparts.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In text categorization applications, class imbalance, which refers to an uneven data distribution where one class is represented by far more less instances than the others, is a commonly encountered problem. In such a situation, conventional classifiers tend to have a strong performance bias, which results in high accuracy rate on the majority class but very low rate on the minorities. An extreme strategy for unbalanced, learning is to discard the majority instances and apply one-class classification to the minority class. However, this could easily cause another type of bias, which increases the accuracy rate on minorities by sacrificing the majorities. This paper aims to investigate approaches that reduce these two types of performance bias and improve the reliability of discovered classification rules. Experimental results show that the inexact field learning method and parameter optimized one-class classifiers achieve more balanced performance than the standard approaches.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents new methodology towards the automatic development of multilingual Web portal for multilingual knowledge discovery and management. It aims to provide an efficient and effective framework for selecting and organizing knowledge from voluminous linguistically diverse Web contents. To achieve this, a concept-based approach that incorporates text mining and Web content mining using neural network and fuzzy techniques is proposed. First, a concept-based taxonomy of themes, which will act as the hierarchical backbone of the Web portal, is automatically generated. Second, a concept-based multilingual Web crawler is developed to intelligently harvest relevant multilingual documents from the Web. Finally, a concept-based multilingual text categorization technique is proposed to organize multilingual documents by concepts. As such, correlated multilingual Web documents can be gathered/filtered/organised/ based on their semantic content to facilitate high-performance multilingual information access.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the past decade the massive growth of the Internet brought huge changes in the way humans live their daily life; however, the biggest concern with rapid growth of digital information is how to efficiently manage and filter unwanted data. In this paper, we propose a method for managing RSS feeds from various news websites. A Web service was developed to provide filtered news items extracted from RSS feeds and these were categorized based on classical text categorization algorithms. A client application consuming this Web service retrieves and displays such filtered information. A prototype was implemented using Rapidminer 4.3 as a data mining tool and SVM as a classification algorithm. Experimental results suggest that the proposed method is effective and saves a significant amount of user processing time.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the past decade there has been massive growth of data on the internet. Many people rely on XML based RSS feeds to receive updates from websites. In this paper, we propose a method for managing the RSS feeds from various news websites. A web service is developed to deliver filtered news items from RSS feeds to a mobile client. Each news item is indexed, subsequently, the indexes are used for filtering news items. Indexing is done in two steps. First, classical text categorization algorithms are used to assign a category to each news item, second, geoparsing is used to assign geolocation data to each news item. An android application is developed to access filtered news items by consuming the proposed web service. A prototype is implemented using Rapid miner 5.0 as the data mining tool and SVM as the classification algorithm. Geoparsing and geocoding web services, and Android API are used to implement location-based access to news items. Experimental results prove that the proposed approach is effective and saves a significant amount of information overload processing time.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Many classification methods have been proposed to find patterns in text documents. However, according to Occam's razor principle, "the explanation of any phenomenon should make as few assumptions as possible", short patterns usually have more explainable and meaningful for classifying text documents. In this paper, we propose a depth-first pattern generation algorithm, which can find out short patterns from text document more effectively, comparing with breadth-first algorithm

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Transcription of interview data is a common practice in qualitative health research. However, there has been little discussion of the techniques of transcription and the issues inherent in the use of transcription as a strategy for managing qualitative data in nursing publications. The process of transcription may disclose or obscure certain information. Researchers need to question practices of transcription that have been taken for granted and make transparent the processes used to preserve the integrity of data. This paper first examines research reported in nursing and allied health journals employing interviews for data collection and the attention given to the transcription phase. It then deals with issues of concern regarding the transcription of interviews, and offers suggestions for promoting validity.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A version of this article was first presented at the Drama Australia Conference, Fremantle, July 2002. It draws upon Freebody and Luke's four resources literacy framework, where they describe four kinds of literacy  practices. It shows how this model is used within the literacy community and argues that this model is useful to describe the contribution that drama can make to literacy development. Freebody and Luke's model is used and  promoted throughout Australia and the author argues that it is politically astute for drama teachers to reclaim and promote their links to the English/Literacy curriculum.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Fish-net algorithm is a novel field learning algorithm which derives classification rules by looking at the range of values of each attribute instead of the individual point values. In this paper, we present a Feature Selection Fish-net learning algorithm to solve the Dual Imbalance problem on text classification. Dual imbalance includes the instance imbalance and feature imbalance. The instance imbalance is caused by the unevenly distributed classes and feature imbalance is due to the different document length. The proposed approach consists of two phases: (1) select a feature subset which consists of the features that are more supportive to difficult minority class; (2) construct classification rules based on the original Fish-net algorithm. Our experimental results on Reuters21578 show that the proposed approach achieves better balanced accuracy rate on both majority and minority class than Naive Bayes MultiNomial and SVM.