994 resultados para Text categorization


Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the past decade the massive growth of the Internet brought huge changes in the way humans live their daily life; however, the biggest concern with rapid growth of digital information is how to efficiently manage and filter unwanted data. In this paper, we propose a method for managing RSS feeds from various news websites. A Web service was developed to provide filtered news items extracted from RSS feeds and these were categorized based on classical text categorization algorithms. A client application consuming this Web service retrieves and displays such filtered information. A prototype was implemented using Rapidminer 4.3 as a data mining tool and SVM as a classification algorithm. Experimental results suggest that the proposed method is effective and saves a significant amount of user processing time.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the past decade there has been massive growth of data on the internet. Many people rely on XML based RSS feeds to receive updates from websites. In this paper, we propose a method for managing the RSS feeds from various news websites. A web service is developed to deliver filtered news items from RSS feeds to a mobile client. Each news item is indexed, subsequently, the indexes are used for filtering news items. Indexing is done in two steps. First, classical text categorization algorithms are used to assign a category to each news item, second, geoparsing is used to assign geolocation data to each news item. An android application is developed to access filtered news items by consuming the proposed web service. A prototype is implemented using Rapid miner 5.0 as the data mining tool and SVM as the classification algorithm. Geoparsing and geocoding web services, and Android API are used to implement location-based access to news items. Experimental results prove that the proposed approach is effective and saves a significant amount of information overload processing time.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In Information Filtering (IF) a user may be interested in several topics in parallel. But IF systems have been built on representational models derived from Information Retrieval and Text Categorization, which assume independence between terms. The linearity of these models results in user profiles that can only represent one topic of interest. We present a methodology that takes into account term dependencies to construct a single profile representation for multiple topics, in the form of a hierarchical term network. We also introduce a series of non-linear functions for evaluating documents against the profile. Initial experiments produced positive results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Semantic Annotation component is a software application that provides support for automated text classification, a process grounded in a cohesion-centered representation of discourse that facilitates topic extraction. The component enables the semantic meta-annotation of text resources, including automated classification, thus facilitating information retrieval within the RAGE ecosystem. It is available in the ReaderBench framework (http://readerbench.com/) which integrates advanced Natural Language Processing (NLP) techniques. The component makes use of Cohesion Network Analysis (CNA) in order to ensure an in-depth representation of discourse, useful for mining keywords and performing automated text categorization. Our component automatically classifies documents into the categories provided by the ACM Computing Classification System (http://dl.acm.org/ccs_flat.cfm), but also into the categories from a high level serious games categorization provisionally developed by RAGE. English and French languages are already covered by the provided web service, whereas the entire framework can be extended in order to support additional languages.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper seeks to discover in what sense we can classify vocabulary items as technical terms in the later medieval period. In order to arrive at a principled categorization of technicality, distribution is taken as a diagnostic factor: vocabulary shared across the widest range of text types may be assumed to be both prototypical for the semantic field, but also the most general and therefore least technical terms since lexical items derive at least part of their meaning from context, a wider range of contexts implying a wider range of senses. A further way of addressing the question of technicality is tested through the classification of the lexis into semantic hierarchies: in the terms of componential analysis, having more components of meaning puts a term lower in the semantic hierarchy and flags it as having a greater specificity of sense, and thus as more technical. The various text types are interrogated through comparison of the number of levels in their hierarchies and number of lexical items at each level within the hierarchies. Focusing on the vocabulary of a single semantic field, DRESS AND TEXTILES, this paper investigates how four medieval text types (wills, sumptuary laws, petitions, and romances) employ technical terminology in the establishment of the conventions of their genres.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then‘ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Studies show cross-linguistic differences in motion event encoding, such that English speakers preferentially encode manner of motion more than Spanish speakers, who preferentially encode path of motion. Focusing on native Spanish speaking children (aged 5;00-9;00) learning L2 English, we studied path and manner verb preferences during descriptions of motion stimuli, and tested the linguistic relativity hypothesis by investigating categorization preferences in a non-verbal similarity judgement task of motion clip triads. Results revealed L2 influence on L1 motion event encoding, such that bilinguals used more manner verbs and fewer path verbs in their L1, under the influence of English. We found no effects of linguistic structure on non-verbal similarity judgements, and demonstrate for the first time effects of L2 on L1 lexicalization in child L2 learners in the domain of motion events. This pattern of verbal behaviour supports theories of bilingual semantic representation that postulate a merged lexico-semantic system in early bilinguals.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Many classification methods have been proposed to find patterns in text documents. However, according to Occam's razor principle, "the explanation of any phenomenon should make as few assumptions as possible", short patterns usually have more explainable and meaningful for classifying text documents. In this paper, we propose a depth-first pattern generation algorithm, which can find out short patterns from text document more effectively, comparing with breadth-first algorithm

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The process of learning the categories of new tunes in older and younger adults was examined for this study. Tunes were presented either one or three times along with a category name to see if multiple repetitions aid in category memory. Additionally, toexamine if an association may help some listeners, especially older ones, to better remember category information, some tunes were presented with a short associative fact; this fact was either neutral or emotional. Participants were tested on song recognition,fact recognition, and category memory. For all tasks, there was a benefit of three presentations. There were no age differences in fact recognition. For both song recognition and categorization, the memory burden of a neutral association was lessened when the association was emotional.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Large amounts of animal health care data are present in veterinary electronic medical records (EMR) and they present an opportunity for companion animal disease surveillance. Veterinary patient records are largely in free-text without clinical coding or fixed vocabulary. Text-mining, a computer and information technology application, is needed to identify cases of interest and to add structure to the otherwise unstructured data. In this study EMR's were extracted from veterinary management programs of 12 participating veterinary practices and stored in a data warehouse. Using commercially available text-mining software (WordStat™), we developed a categorization dictionary that could be used to automatically classify and extract enteric syndrome cases from the warehoused electronic medical records. The diagnostic accuracy of the text-miner for retrieving cases of enteric syndrome was measured against human reviewers who independently categorized a random sample of 2500 cases as enteric syndrome positive or negative. Compared to the reviewers, the text-miner retrieved cases with enteric signs with a sensitivity of 87.6% (95%CI, 80.4-92.9%) and a specificity of 99.3% (95%CI, 98.9-99.6%). Automatic and accurate detection of enteric syndrome cases provides an opportunity for community surveillance of enteric pathogens in companion animals.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper, we describe NewsCATS (news categorization and trading system), a system implemented to predict stock price trends for the time immediately after the publication of press releases. NewsCATS consists mainly of three components. The first component retrieves relevant information from press releases through the application of text preprocessing techniques. The second component sorts the press releases into predefined categories. Finally, appropriate trading strategies are derived by the third component by means of the earlier categorization. The findings indicate that a categorization of press releases is able to provide additional information that can be used to forecast stock price trends, but that an adequate trading strategy is essential for the results of the categorization to be fully exploited.