994 resultados para Text categorization


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract. Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words’ distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables'priors for the multi-level graphical model is overcome. Following this data-driven strategy,two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Classification methods are usually used to categorize text documents, such as, Rocchio method, Naïve bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct classifiers. The generated classifiers can predict which category is located for a new coming text document. The keywords in the document are often used to form rules to categorize text documents, for example “kw = computer” can be a rule for the IT documents category. However, the number of keywords is very large. To select keywords from the large number of keywords is a challenging work. Recently, a rule generation method based on enumeration of all possible keywords combinations has been proposed [2]. In this method, there remains a crucial problem: how to prune irrelevant combinations at the early stages of the rule generation procedure. In this paper, we propose a method than can effectively prune irrelative keywords at an early stage.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Text categorization (TC) is one of the main applications of machine learning. Many methods have been proposed, such as Rocchio method, Naive bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct a classifier. A new coming text document's category can be predicted. However, these methods do not give the description of each category. In the machine learning field, there are many concept learning algorithms, such as, ID3 and CN2. This paper proposes a more robust algorithm to induce concepts from training examples, which is based on enumeration of all possible keywords combinations. Experimental results show that the rules produced by our approach have more precision and simplicity than that of other methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper explores effective multi-label classification methods for multi-semantic image and text categorization. We perform an experimental study of clustering based multi-label classification (CBMLC) for the target problem. Experimental evaluation is conducted for identifying the impact of different clustering algorithms and base classifiers on the predictive performance and efficiency of CBMLC. In the experimental setting, three widely used clustering algorithms and six popular multi-label classification algorithms are used and evaluated on multi-label image and text datasets. A multi-label classification evaluation metrics, micro F1-measure, is used for presenting predictive performances of the classifications. Experimental evaluation results reveal that clustering based multi-label learning algorithms are more effective compared to their non-clustering counterparts.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to train

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we investigate whether conventional text categorization methods may suffice to infer different verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estimation of users in a spoken language dialog system may be useful when defining an optimal dialog strategy by improving its adaptation capabilities. The work is based on a corpus containing descriptions (i.e. monologs) of a short film by test persons yielding different educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of variance was performed to compare the monologs with the film transcription and to demonstrate that there are differences in the vocabulary used by the test persons yielding different verbal intelligence levels. Then, for the classification task, the monologs were represented as feature vectors using the classical TF–IDF weighting scheme. The Naive Bayes, k-nearest neighbors and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

International audience

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This article examines manual textual categorisation by human coders with the hypothesis that the law of total probability may be violated for difficult categories. An empirical evaluation was conducted to compare a one step categorisation task with a two step categorisation task using crowdsourcing. It was found that the law of total probability was violated. Both a quantum and classical probabilistic interpretations for this violation are presented. Further studies are required to resolve whether quantum models are more appropriate for this task.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In text categorization applications, class imbalance, which refers to an uneven data distribution where one class is represented by far more less instances than the others, is a commonly encountered problem. In such a situation, conventional classifiers tend to have a strong performance bias, which results in high accuracy rate on the majority class but very low rate on the minorities. An extreme strategy for unbalanced, learning is to discard the majority instances and apply one-class classification to the minority class. However, this could easily cause another type of bias, which increases the accuracy rate on minorities by sacrificing the majorities. This paper aims to investigate approaches that reduce these two types of performance bias and improve the reliability of discovered classification rules. Experimental results show that the inexact field learning method and parameter optimized one-class classifiers achieve more balanced performance than the standard approaches.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents new methodology towards the automatic development of multilingual Web portal for multilingual knowledge discovery and management. It aims to provide an efficient and effective framework for selecting and organizing knowledge from voluminous linguistically diverse Web contents. To achieve this, a concept-based approach that incorporates text mining and Web content mining using neural network and fuzzy techniques is proposed. First, a concept-based taxonomy of themes, which will act as the hierarchical backbone of the Web portal, is automatically generated. Second, a concept-based multilingual Web crawler is developed to intelligently harvest relevant multilingual documents from the Web. Finally, a concept-based multilingual text categorization technique is proposed to organize multilingual documents by concepts. As such, correlated multilingual Web documents can be gathered/filtered/organised/ based on their semantic content to facilitate high-performance multilingual information access.