979 resultados para Text classification


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Rhythm analysis of written texts focuses on literary analysis and it mainly considers poetry. In this paper we investigate the relevance of rhythmic features for categorizing texts in prosaic form pertaining to different genres. Our contribution is threefold. First, we define a set of rhythmic features for written texts. Second, we extract these features from three corpora, of speeches, essays, and newspaper articles. Third, we perform feature selection by means of statistical analyses, and determine a subset of features which efficiently discriminates between the three genres. We find that using as little as eight rhythmic features, documents can be adequately assigned to a given genre with an accuracy of around 80 %, significantly higher than the 33 % baseline which results from random assignment.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Semantic Annotation component is a software application that provides support for automated text classification, a process grounded in a cohesion-centered representation of discourse that facilitates topic extraction. The component enables the semantic meta-annotation of text resources, including automated classification, thus facilitating information retrieval within the RAGE ecosystem. It is available in the ReaderBench framework (http://readerbench.com/) which integrates advanced Natural Language Processing (NLP) techniques. The component makes use of Cohesion Network Analysis (CNA) in order to ensure an in-depth representation of discourse, useful for mining keywords and performing automated text categorization. Our component automatically classifies documents into the categories provided by the ACM Computing Classification System (http://dl.acm.org/ccs_flat.cfm), but also into the categories from a high level serious games categorization provisionally developed by RAGE. English and French languages are already covered by the provided web service, whereas the entire framework can be extended in order to support additional languages.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Natural language processing has achieved great success in a wide range of ap- plications, producing both commercial language services and open-source language tools. However, most methods take a static or batch approach, assuming that the model has all information it needs and makes a one-time prediction. In this disser- tation, we study dynamic problems where the input comes in a sequence instead of all at once, and the output must be produced while the input is arriving. In these problems, predictions are often made based only on partial information. We see this dynamic setting in many real-time, interactive applications. These problems usually involve a trade-off between the amount of input received (cost) and the quality of the output prediction (accuracy). Therefore, the evaluation considers both objectives (e.g., plotting a Pareto curve). Our goal is to develop a formal understanding of sequential prediction and decision-making problems in natural language processing and to propose efficient solutions. Toward this end, we present meta-algorithms that take an existent batch model and produce a dynamic model to handle sequential inputs and outputs. Webuild our framework upon theories of Markov Decision Process (MDP), which allows learning to trade off competing objectives in a principled way. The main machine learning techniques we use are from imitation learning and reinforcement learning, and we advance current techniques to tackle problems arising in our settings. We evaluate our algorithm on a variety of applications, including dependency parsing, machine translation, and question answering. We show that our approach achieves a better cost-accuracy trade-off than the batch approach and heuristic-based decision- making approaches. We first propose a general framework for cost-sensitive prediction, where dif- ferent parts of the input come at different costs. We formulate a decision-making process that selects pieces of the input sequentially, and the selection is adaptive to each instance. Our approach is evaluated on both standard classification tasks and a structured prediction task (dependency parsing). We show that it achieves similar prediction quality to methods that use all input, while inducing a much smaller cost. Next, we extend the framework to problems where the input is revealed incremen- tally in a fixed order. We study two applications: simultaneous machine translation and quiz bowl (incremental text classification). We discuss challenges in this set- ting and show that adding domain knowledge eases the decision-making problem. A central theme throughout the chapters is an MDP formulation of a challenging problem with sequential input/output and trade-off decisions, accompanied by a learning algorithm that solves the MDP.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2015.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

As descrições de produtos turísticos na área da hotelaria, aviação, rent-a-car e pacotes de férias baseiam-se sobretudo em descrições textuais em língua natural muito heterogénea com estilos, apresentações e conteúdos muito diferentes entre si. Uma vez que o sector do turismo é bastante dinâmico e que os seus produtos e ofertas estão constantemente em alteração, o tratamento manual de normalização de toda essa informação não é possível. Neste trabalho construiu-se um protótipo que permite a classificação e extracção automática de informação a partir de descrições de produtos de turismo. Inicialmente a informação é classificada quanto ao tipo. Seguidamente são extraídos os elementos relevantes de cada tipo e gerados objectos facilmente computáveis. Sobre os objectos extraídos, o protótipo com recurso a modelos de textos e imagens gera automaticamente descrições normalizadas e orientadas a um determinado mercado. Esta versatilidade permite um novo conjunto de serviços na promoção e venda dos produtos que seria impossível implementar com a informação original. Este protótipo, embora possa ser aplicado a outros domínios, foi avaliado na normalização da descrição de hotéis. As frases descritivas do hotel são classificadas consoante o seu tipo (Local, Serviços e/ou Equipamento) através de um algoritmo de aprendizagem automática que obtém valores médios de cobertura de 96% e precisão de 72%. A cobertura foi considerada a medida mais importante uma vez que a sua maximização permite que não se percam frases para processamentos posteriores. Este trabalho permitiu também a construção e população de uma base de dados de hotéis que possibilita a pesquisa de hotéis pelas suas características. Esta funcionalidade não seria possível utilizando os conteúdos originais. ABSTRACT: The description of tourism products, like hotel, aviation, rent-a-car and holiday packages, is strongly supported on natural language expressions. Due to the extent of tourism offers and considering the high dynamics in the tourism sector, manual data management is not a reliable or scalable solution. Offer descriptions - in the order of thousands - are structured in different ways, possibly comprising different languages, complementing and/or overlap one another. This work aims at creating a prototype for the automatic classification and extraction of relevant knowledge from tourism-related text expressions. Captured knowledge is represented in a normalized/standard format to enable new services based on this information in order to promote and sale tourism products that would be impossible to implement with the raw information. Although it could be applied to other areas, this prototype was evaluated in the normalization of hotel descriptions. Hotels descriptive sentences are classified according their type (Location, Services and/or Equipment) using a machine learning algorithm. The built setting obtained an average recall of 96% and precision of 72%. Recall considered the most important measure of performance since its maximization allows that sentences were not lost in further processes. As a side product a database of hotels was built and populated with search facilities on its characteristics. This ability would not be possible using the original contents.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Questa tesi di laurea compie uno studio sull’ utilizzo di tecniche di web crawling, web scraping e Natural Language Processing per costruire automaticamente un dataset di documenti e una knowledge base di coppie verbo-oggetto utilizzabile per la classificazione di testi. Dopo una breve introduzione sulle tecniche utilizzate verrà presentato il metodo di generazione, prima in forma teorica e generalizzabile a qualunque classificazione basata su un insieme di argomenti, e poi in modo specifico attraverso un caso di studio: il software SDG Detector. In particolare quest ultimo riguarda l’applicazione pratica del metodo esposto per costruire una raccolta di informazioni utili alla classificazione di documenti in base alla presenza di uno o più Sustainable Development Goals. La parte relativa alla classificazione è curata dal co-autore di questa applicazione, la presente invece si concentra su un’analisi di correttezza e performance basata sull’espansione del dataset e della derivante base di conoscenza.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Natural Language Processing (NLP) has seen tremendous improvements over the last few years. Transformer architectures achieved impressive results in almost any NLP task, such as Text Classification, Machine Translation, and Language Generation. As time went by, transformers continued to improve thanks to larger corpora and bigger networks, reaching hundreds of billions of parameters. Training and deploying such large models has become prohibitively expensive, such that only big high tech companies can afford to train those models. Therefore, a lot of research has been dedicated to reducing a model’s size. In this thesis, we investigate the effects of Vocabulary Transfer and Knowledge Distillation for compressing large Language Models. The goal is to combine these two methodologies to further compress models without significant loss of performance. In particular, we designed different combination strategies and conducted a series of experiments on different vertical domains (medical, legal, news) and downstream tasks (Text Classification and Named Entity Recognition). Four different methods involving Vocabulary Transfer (VIPI) with and without a Masked Language Modelling (MLM) step and with and without Knowledge Distillation are compared against a baseline that assigns random vectors to new elements of the vocabulary. Results indicate that VIPI effectively transfers information of the original vocabulary and that MLM is beneficial. It is also noted that both vocabulary transfer and knowledge distillation are orthogonal to one another and may be applied jointly. The application of knowledge distillation first before subsequently applying vocabulary transfer is recommended. Finally, model performance due to vocabulary transfer does not always show a consistent trend as the vocabulary size is reduced. Hence, the choice of vocabulary size should be empirically selected by evaluation on the downstream task similar to hyperparameter tuning.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Forest cover of the Maringá municipality, located in northern Parana State, was mapped in this study. Mapping was carried out by using high-resolution HRC sensor imagery and medium resolution CCD sensor imagery from the CBERS satellite. Images were georeferenced and forest vegetation patches (TOFs - trees outside forests) were classified using two methods of digital classification: reflectance-based or the digital number of each pixel, and object-oriented. The areas of each polygon were calculated, which allowed each polygon to be segregated into size classes. Thematic maps were built from the resulting polygon size classes and summary statistics generated from each size class for each area. It was found that most forest fragments in Maringá were smaller than 500 m². There was also a difference of 58.44% in the amount of vegetation between the high-resolution imagery and medium resolution imagery due to the distinct spatial resolution of the sensors. It was concluded that high-resolution geotechnology is essential to provide reliable information on urban greens and forest cover under highly human-perturbed landscapes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

ABSTRACT The objective of this work was to study the distribution of values of the coefficient of variation (CV) in the experiments of papaya crop (Carica papaya L.) by proposing ranges to guide researchers in their evaluation for different characters in the field. The data used in this study were obtained by bibliographical review in Brazilian journals, dissertations and thesis. This study considered the following characters: diameter of the stalk, insertion height of the first fruit, plant height, number of fruits per plant, fruit biomass, fruit length, equatorial diameter of the fruit, pulp thickness, fruit firmness, soluble solids and internal cavity diameter, from which, value ranges were obtained for the CV values for each character, based on the methodology proposed by Garcia, Costa and by the standard classification of Pimentel-Gomes. The results obtained in this study indicated that ranges of CV values were different among various characters, presenting a large variation, which justifies the necessity of using specific evaluation range for each character. In addition, the use of classification ranges obtained from methodology of Costa is recommended.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

INTRODUCTION: The correct identification of the underlying cause of death and its precise assignment to a code from the International Classification of Diseases are important issues to achieve accurate and universally comparable mortality statistics These factors, among other ones, led to the development of computer software programs in order to automatically identify the underlying cause of death. OBJECTIVE: This work was conceived to compare the underlying causes of death processed respectively by the Automated Classification of Medical Entities (ACME) and the "Sistema de Seleção de Causa Básica de Morte" (SCB) programs. MATERIAL AND METHOD: The comparative evaluation of the underlying causes of death processed respectively by ACME and SCB systems was performed using the input data file for the ACME system that included deaths which occurred in the State of S. Paulo from June to December 1993, totalling 129,104 records of the corresponding death certificates. The differences between underlying causes selected by ACME and SCB systems verified in the month of June, when considered as SCB errors, were used to correct and improve SCB processing logic and its decision tables. RESULTS: The processing of the underlying causes of death by the ACME and SCB systems resulted in 3,278 differences, that were analysed and ascribed to lack of answer to dialogue boxes during processing, to deaths due to human immunodeficiency virus [HIV] disease for which there was no specific provision in any of the systems, to coding and/or keying errors and to actual problems. The detailed analysis of these latter disclosed that the majority of the underlying causes of death processed by the SCB system were correct and that different interpretations were given to the mortality coding rules by each system, that some particular problems could not be explained with the available documentation and that a smaller proportion of problems were identified as SCB errors. CONCLUSION: These results, disclosing a very low and insignificant number of actual problems, guarantees the use of the version of the SCB system for the Ninth Revision of the International Classification of Diseases and assures the continuity of the work which is being undertaken for the Tenth Revision version.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OBJECTIVE: To develop a Charlson-like comorbidity index based on clinical conditions and weights of the original Charlson comorbidity index. METHODS: Clinical conditions and weights were adapted from the International Classification of Diseases, 10th revision and applied to a single hospital admission diagnosis. The study included 3,733 patients over 18 years of age who were admitted to a public general hospital in the city of Rio de Janeiro, southeast Brazil, between Jan 2001 and Jan 2003. The index distribution was analyzed by gender, type of admission, blood transfusion, intensive care unit admission, age and length of hospital stay. Two logistic regression models were developed to predict in-hospital mortality including: a) the aforementioned variables and the risk-adjustment index (full model); and b) the risk-adjustment index and patient's age (reduced model). RESULTS: Of all patients analyzed, 22.3% had risk scores >1, and their mortality rate was 4.5% (66.0% of them had scores >1). Except for gender and type of admission, all variables were retained in the logistic regression. The models including the developed risk index had an area under the receiver operating characteristic curve of 0.86 (full model), and 0.76 (reduced model). Each unit increase in the risk score was associated with nearly 50% increase in the odds of in-hospital death. CONCLUSIONS: The risk index developed was able to effectively discriminate the odds of in-hospital death which can be useful when limited information is available from hospital databases.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Studies were made on the biochemical behavior of 100 strains of P.pestis isolated in Northeastern Brazil with regard to production of nitrous acid, reduction of nitrates to nitrltes, and aciáification of glycerol. Results showed that 98 strains can be classified as "orientalis variety", while the remaining two could not be included in any of the existing "varieties".

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We report a retrospective histopathological classification carried out under laboratory conditions by the method of Ridley & Jopling of 1,108 skin biopsies from patients clinically suspected of having leprosy from Bahia, Northeast Brazil.