944 resultados para automated text classification
Resumo:
The current study builds upon a previous study, which examined the degree to which the lexical properties of students’ essays could predict their vocabulary scores. We expand on this previous research by incorporating new natural language processing indices related to both the surface- and discourse-levels of students’ essays. Additionally, we investigate the degree to which these NLP indices can be used to account for variance in students’ reading comprehension skills. We calculated linguistic essay features using our framework, ReaderBench, which is an automated text analysis tools that calculates indices related to linguistic and rhetorical features of text. University students (n = 108) produced timed (25 minutes), argumentative essays, which were then analyzed by ReaderBench. Additionally, they completed the Gates-MacGinitie Vocabulary and Reading comprehension tests. The results of this study indicated that two indices were able to account for 32.4% of the variance in vocabulary scores and 31.6% of the variance in reading comprehension scores. Follow-up analyses revealed that these models further improved when only considering essays that contained multiple paragraph (R2 values = .61 and .49, respectively). Overall, the results of the current study suggest that natural language processing techniques can help to inform models of individual differences among student writers.
Resumo:
Rhythm analysis of written texts focuses on literary analysis and it mainly considers poetry. In this paper we investigate the relevance of rhythmic features for categorizing texts in prosaic form pertaining to different genres. Our contribution is threefold. First, we define a set of rhythmic features for written texts. Second, we extract these features from three corpora, of speeches, essays, and newspaper articles. Third, we perform feature selection by means of statistical analyses, and determine a subset of features which efficiently discriminates between the three genres. We find that using as little as eight rhythmic features, documents can be adequately assigned to a given genre with an accuracy of around 80 %, significantly higher than the 33 % baseline which results from random assignment.
Resumo:
Natural language processing has achieved great success in a wide range of ap- plications, producing both commercial language services and open-source language tools. However, most methods take a static or batch approach, assuming that the model has all information it needs and makes a one-time prediction. In this disser- tation, we study dynamic problems where the input comes in a sequence instead of all at once, and the output must be produced while the input is arriving. In these problems, predictions are often made based only on partial information. We see this dynamic setting in many real-time, interactive applications. These problems usually involve a trade-off between the amount of input received (cost) and the quality of the output prediction (accuracy). Therefore, the evaluation considers both objectives (e.g., plotting a Pareto curve). Our goal is to develop a formal understanding of sequential prediction and decision-making problems in natural language processing and to propose efficient solutions. Toward this end, we present meta-algorithms that take an existent batch model and produce a dynamic model to handle sequential inputs and outputs. Webuild our framework upon theories of Markov Decision Process (MDP), which allows learning to trade off competing objectives in a principled way. The main machine learning techniques we use are from imitation learning and reinforcement learning, and we advance current techniques to tackle problems arising in our settings. We evaluate our algorithm on a variety of applications, including dependency parsing, machine translation, and question answering. We show that our approach achieves a better cost-accuracy trade-off than the batch approach and heuristic-based decision- making approaches. We first propose a general framework for cost-sensitive prediction, where dif- ferent parts of the input come at different costs. We formulate a decision-making process that selects pieces of the input sequentially, and the selection is adaptive to each instance. Our approach is evaluated on both standard classification tasks and a structured prediction task (dependency parsing). We show that it achieves similar prediction quality to methods that use all input, while inducing a much smaller cost. Next, we extend the framework to problems where the input is revealed incremen- tally in a fixed order. We study two applications: simultaneous machine translation and quiz bowl (incremental text classification). We discuss challenges in this set- ting and show that adding domain knowledge eases the decision-making problem. A central theme throughout the chapters is an MDP formulation of a challenging problem with sequential input/output and trade-off decisions, accompanied by a learning algorithm that solves the MDP.
Resumo:
Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2015.
Resumo:
As descrições de produtos turísticos na área da hotelaria, aviação, rent-a-car e pacotes de férias baseiam-se sobretudo em descrições textuais em língua natural muito heterogénea com estilos, apresentações e conteúdos muito diferentes entre si. Uma vez que o sector do turismo é bastante dinâmico e que os seus produtos e ofertas estão constantemente em alteração, o tratamento manual de normalização de toda essa informação não é possível. Neste trabalho construiu-se um protótipo que permite a classificação e extracção automática de informação a partir de descrições de produtos de turismo. Inicialmente a informação é classificada quanto ao tipo. Seguidamente são extraídos os elementos relevantes de cada tipo e gerados objectos facilmente computáveis. Sobre os objectos extraídos, o protótipo com recurso a modelos de textos e imagens gera automaticamente descrições normalizadas e orientadas a um determinado mercado. Esta versatilidade permite um novo conjunto de serviços na promoção e venda dos produtos que seria impossível implementar com a informação original. Este protótipo, embora possa ser aplicado a outros domínios, foi avaliado na normalização da descrição de hotéis. As frases descritivas do hotel são classificadas consoante o seu tipo (Local, Serviços e/ou Equipamento) através de um algoritmo de aprendizagem automática que obtém valores médios de cobertura de 96% e precisão de 72%. A cobertura foi considerada a medida mais importante uma vez que a sua maximização permite que não se percam frases para processamentos posteriores. Este trabalho permitiu também a construção e população de uma base de dados de hotéis que possibilita a pesquisa de hotéis pelas suas características. Esta funcionalidade não seria possível utilizando os conteúdos originais. ABSTRACT: The description of tourism products, like hotel, aviation, rent-a-car and holiday packages, is strongly supported on natural language expressions. Due to the extent of tourism offers and considering the high dynamics in the tourism sector, manual data management is not a reliable or scalable solution. Offer descriptions - in the order of thousands - are structured in different ways, possibly comprising different languages, complementing and/or overlap one another. This work aims at creating a prototype for the automatic classification and extraction of relevant knowledge from tourism-related text expressions. Captured knowledge is represented in a normalized/standard format to enable new services based on this information in order to promote and sale tourism products that would be impossible to implement with the raw information. Although it could be applied to other areas, this prototype was evaluated in the normalization of hotel descriptions. Hotels descriptive sentences are classified according their type (Location, Services and/or Equipment) using a machine learning algorithm. The built setting obtained an average recall of 96% and precision of 72%. Recall considered the most important measure of performance since its maximization allows that sentences were not lost in further processes. As a side product a database of hotels was built and populated with search facilities on its characteristics. This ability would not be possible using the original contents.
Resumo:
Questa tesi di laurea compie uno studio sull’ utilizzo di tecniche di web crawling, web scraping e Natural Language Processing per costruire automaticamente un dataset di documenti e una knowledge base di coppie verbo-oggetto utilizzabile per la classificazione di testi. Dopo una breve introduzione sulle tecniche utilizzate verrà presentato il metodo di generazione, prima in forma teorica e generalizzabile a qualunque classificazione basata su un insieme di argomenti, e poi in modo specifico attraverso un caso di studio: il software SDG Detector. In particolare quest ultimo riguarda l’applicazione pratica del metodo esposto per costruire una raccolta di informazioni utili alla classificazione di documenti in base alla presenza di uno o più Sustainable Development Goals. La parte relativa alla classificazione è curata dal co-autore di questa applicazione, la presente invece si concentra su un’analisi di correttezza e performance basata sull’espansione del dataset e della derivante base di conoscenza.
Resumo:
Natural Language Processing (NLP) has seen tremendous improvements over the last few years. Transformer architectures achieved impressive results in almost any NLP task, such as Text Classification, Machine Translation, and Language Generation. As time went by, transformers continued to improve thanks to larger corpora and bigger networks, reaching hundreds of billions of parameters. Training and deploying such large models has become prohibitively expensive, such that only big high tech companies can afford to train those models. Therefore, a lot of research has been dedicated to reducing a model’s size. In this thesis, we investigate the effects of Vocabulary Transfer and Knowledge Distillation for compressing large Language Models. The goal is to combine these two methodologies to further compress models without significant loss of performance. In particular, we designed different combination strategies and conducted a series of experiments on different vertical domains (medical, legal, news) and downstream tasks (Text Classification and Named Entity Recognition). Four different methods involving Vocabulary Transfer (VIPI) with and without a Masked Language Modelling (MLM) step and with and without Knowledge Distillation are compared against a baseline that assigns random vectors to new elements of the vocabulary. Results indicate that VIPI effectively transfers information of the original vocabulary and that MLM is beneficial. It is also noted that both vocabulary transfer and knowledge distillation are orthogonal to one another and may be applied jointly. The application of knowledge distillation first before subsequently applying vocabulary transfer is recommended. Finally, model performance due to vocabulary transfer does not always show a consistent trend as the vocabulary size is reduced. Hence, the choice of vocabulary size should be empirically selected by evaluation on the downstream task similar to hyperparameter tuning.
Resumo:
INTRODUCTION: The correct identification of the underlying cause of death and its precise assignment to a code from the International Classification of Diseases are important issues to achieve accurate and universally comparable mortality statistics These factors, among other ones, led to the development of computer software programs in order to automatically identify the underlying cause of death. OBJECTIVE: This work was conceived to compare the underlying causes of death processed respectively by the Automated Classification of Medical Entities (ACME) and the "Sistema de Seleção de Causa Básica de Morte" (SCB) programs. MATERIAL AND METHOD: The comparative evaluation of the underlying causes of death processed respectively by ACME and SCB systems was performed using the input data file for the ACME system that included deaths which occurred in the State of S. Paulo from June to December 1993, totalling 129,104 records of the corresponding death certificates. The differences between underlying causes selected by ACME and SCB systems verified in the month of June, when considered as SCB errors, were used to correct and improve SCB processing logic and its decision tables. RESULTS: The processing of the underlying causes of death by the ACME and SCB systems resulted in 3,278 differences, that were analysed and ascribed to lack of answer to dialogue boxes during processing, to deaths due to human immunodeficiency virus [HIV] disease for which there was no specific provision in any of the systems, to coding and/or keying errors and to actual problems. The detailed analysis of these latter disclosed that the majority of the underlying causes of death processed by the SCB system were correct and that different interpretations were given to the mortality coding rules by each system, that some particular problems could not be explained with the available documentation and that a smaller proportion of problems were identified as SCB errors. CONCLUSION: These results, disclosing a very low and insignificant number of actual problems, guarantees the use of the version of the SCB system for the Ninth Revision of the International Classification of Diseases and assures the continuity of the work which is being undertaken for the Tenth Revision version.
Resumo:
Flow Cytometry analyzers have become trusted companions due to their ability to perform fast and accurate analyses of human blood. The aim of these analyses is to determine the possible existence of abnormalities in the blood that have been correlated with serious disease states, such as infectious mononucleosis, leukemia, and various cancers. Though these analyzers provide important feedback, it is always desired to improve the accuracy of the results. This is evidenced by the occurrences of misclassifications reported by some users of these devices. It is advantageous to provide a pattern interpretation framework that is able to provide better classification ability than is currently available. Toward this end, the purpose of this dissertation was to establish a feature extraction and pattern classification framework capable of providing improved accuracy for detecting specific hematological abnormalities in flow cytometric blood data. ^ This involved extracting a unique and powerful set of shift-invariant statistical features from the multi-dimensional flow cytometry data and then using these features as inputs to a pattern classification engine composed of an artificial neural network (ANN). The contribution of this method consisted of developing a descriptor matrix that can be used to reliably assess if a donor’s blood pattern exhibits a clinically abnormal level of variant lymphocytes, which are blood cells that are potentially indicative of disorders such as leukemia and infectious mononucleosis. ^ This study showed that the set of shift-and-rotation-invariant statistical features extracted from the eigensystem of the flow cytometric data pattern performs better than other commonly-used features in this type of disease detection, exhibiting an accuracy of 80.7%, a sensitivity of 72.3%, and a specificity of 89.2%. This performance represents a major improvement for this type of hematological classifier, which has historically been plagued by poor performance, with accuracies as low as 60% in some cases. This research ultimately shows that an improved feature space was developed that can deliver improved performance for the detection of variant lymphocytes in human blood, thus providing significant utility in the realm of suspect flagging algorithms for the detection of blood-related diseases.^
Resumo:
Given the limitations of different types of remote sensing images, automated land-cover classifications of the Amazon várzea may yield poor accuracy indexes. One way to improve accuracy is through the combination of images from different sensors, by either image fusion or multi-sensor classifications. Therefore, the objective of this study was to determine which classification method is more efficient in improving land cover classification accuracies for the Amazon várzea and similar wetland environments - (a) synthetically fused optical and SAR images or (b) multi-sensor classification of paired SAR and optical images. Land cover classifications based on images from a single sensor (Landsat TM or Radarsat-2) are compared with multi-sensor and image fusion classifications. Object-based image analyses (OBIA) and the J.48 data-mining algorithm were used for automated classification, and classification accuracies were assessed using the kappa index of agreement and the recently proposed allocation and quantity disagreement measures. Overall, optical-based classifications had better accuracy than SAR-based classifications. Once both datasets were combined using the multi-sensor approach, there was a 2% decrease in allocation disagreement, as the method was able to overcome part of the limitations present in both images. Accuracy decreased when image fusion methods were used, however. We therefore concluded that the multi-sensor classification method is more appropriate for classifying land cover in the Amazon várzea.
Resumo:
OBJECTIVE - The aim of our study was to assess the profile of a wrist monitor, the Omron Model HEM-608, compared with the indirect method for blood pressure measurement. METHODS - Our study population consisted of 100 subjects, 29 being normotensive and 71 being hypertensive. Participants had their blood pressure checked 8 times with alternate techniques, 4 by the indirect method and 4 with the Omron wrist monitor. The validation criteria used to test this device were based on the internationally recognized protocols. RESULTS - Our data showed that the Omron HEM-608 reached a classification B for systolic and A for diastolic blood pressure, according to the one protocol. The mean differences between blood pressure values obtained with each of the methods were -2.3 +7.9mmHg for systolic and 0.97+5.5mmHg for diastolic blood pressure. Therefore, we considered this type of device approved according to the criteria selected. CONCLUSION - Our study leads us to conclude that this wrist monitor is not only easy to use, but also produces results very similar to those obtained by the standard indirect method.
Resumo:
OBJECTIVE: To assess the Dixtal DX2710 automated oscillometric device used for blood pressure measurement according to the protocols of the BHS and the AAMI. METHODS: Three blood pressure measurements were taken in 94 patients (53 females 15 to 80 years). The measurements were taken randomly by 2 observers trained to measure blood pressure with a mercury column device connected with an automated device. The device was classified according to the protocols of the BHS and AAMI. RESULT: The mean of blood pressure levels obtained by the observers was 148±38/93±25 mmHg and that obtained with the device was 148±37/89±26 mmHg. Considering the differences between the measurements obtained by the observer and those obtained with the automated device according to the criteria of the BHS, the following classification was adopted: "A" for systolic pressure (69% of the differences < 5; 90% < 10; and 97% < 15 mmHg); and "B" for diastolic pressure (63% of the differences < 5; 83% < 10; and 93% < 15 mmHg). The mean and standard deviation of the differences were 0±6.27 mmHg for systolic pressure and 3.82±6.21 mmHg for diastolic pressure. CONCLUSION: The Dixtal DX2710 device was approved according to the international recommendations.
Resumo:
Diagnosis of several neurological disorders is based on the detection of typical pathological patterns in the electroencephalogram (EEG). This is a time-consuming task requiring significant training and experience. Automatic detection of these EEG patterns would greatly assist in quantitative analysis and interpretation. We present a method, which allows automatic detection of epileptiform events and discrimination of them from eye blinks, and is based on features derived using a novel application of independent component analysis. The algorithm was trained and cross validated using seven EEGs with epileptiform activity. For epileptiform events with compensation for eyeblinks, the sensitivity was 65 +/- 22% at a specificity of 86 +/- 7% (mean +/- SD). With feature extraction by PCA or classification of raw data, specificity reduced to 76 and 74%, respectively, for the same sensitivity. On exactly the same data, the commercially available software Reveal had a maximum sensitivity of 30% and concurrent specificity of 77%. Our algorithm performed well at detecting epileptiform events in this preliminary test and offers a flexible tool that is intended to be generalized to the simultaneous classification of many waveforms in the EEG.
Resumo:
Fluent health information flow is critical for clinical decision-making. However, a considerable part of this information is free-form text and inabilities to utilize it create risks to patient safety and cost-effective hospital administration. Methods for automated processing of clinical text are emerging. The aim in this doctoral dissertation is to study machine learning and clinical text in order to support health information flow.First, by analyzing the content of authentic patient records, the aim is to specify clinical needs in order to guide the development of machine learning applications.The contributions are a model of the ideal information flow,a model of the problems and challenges in reality, and a road map for the technology development. Second, by developing applications for practical cases,the aim is to concretize ways to support health information flow. Altogether five machine learning applications for three practical cases are described: The first two applications are binary classification and regression related to the practical case of topic labeling and relevance ranking.The third and fourth application are supervised and unsupervised multi-class classification for the practical case of topic segmentation and labeling.These four applications are tested with Finnish intensive care patient records.The fifth application is multi-label classification for the practical task of diagnosis coding. It is tested with English radiology reports.The performance of all these applications is promising. Third, the aim is to study how the quality of machine learning applications can be reliably evaluated.The associations between performance evaluation measures and methods are addressed,and a new hold-out method is introduced.This method contributes not only to processing time but also to the evaluation diversity and quality. The main conclusion is that developing machine learning applications for text requires interdisciplinary, international collaboration. Practical cases are very different, and hence the development must begin from genuine user needs and domain expertise. The technological expertise must cover linguistics,machine learning, and information systems. Finally, the methods must be evaluated both statistically and through authentic user-feedback.