991 resultados para Word Classification
Resumo:
Major Depressive Disorder (MDD) has been associated with biased processing and abnormal regulation of negative and positive information, which may result from compromised coordinated activity of prefrontal and subcortical brain regions involved in evaluating emotional information. We tested whether patients with MDD show distributed changes in functional connectivity with a set of independently derived brain networks that have shown high correspondence with different task demands, including stimulus salience and emotional processing. We further explored if connectivity during emotional word processing related to the tendency to engage in positive or negative emotional states. In this study, 25 medication-free MDD patients without current or past comorbidity and matched controls (n=25) performed an emotional word-evaluation task during functional MRI. Using a dual regression approach, individual spatial connectivity maps representing each subject’s connectivity with each standard network were used to evaluate between-group differences and effects of positive and negative emotionality (extraversion and neuroticism, respectively, as measured with the NEO-FFI). Results showed decreased functional connectivity of the medial prefrontal cortex, ventrolateral prefrontal cortex, and ventral striatum with the fronto-opercular salience network in MDD patients compared to controls. In patients, abnormal connectivity was related to extraversion, but not neuroticism. These results confirm the hypothesis of a relative (para)limbic-cortical decoupling that may explain dysregulated affect in MDD. As connectivity of these regions with the salience network was related to extraversion, but not to general depression severity or negative emotionality, dysfunction of this network may be responsible for the failure to sustain engagement in rewarding behavior.
Resumo:
The value of business process models is dependent not only on the choice of graphical elements in the model, but also on their annotation with additional textual and graphical information. This research discusses the use of text and icons for labeling the graphical constructs in a process model. We use two established verb classification schemes to examine the choice of activity labels in process modeling practice. Based on our findings, we synthesize a set of twenty-five activity label categories. We propose a systematic approach for graphically representing these label categories through the use of graphical icons, such that the resulting process models are easier and more readily understandable by end users. Our findings contribute to an ongoing stream of research investigating the practice of process modeling and thereby contribute to the body of knowledge about conceptual modeling quality overall.
Resumo:
Na presente pesquisa de campo, abordou-se, primeir~ mente, o estudo da comunicação entre os homens por códigos verbais, as relações entre a linguagem e o pensamento e a visão do mundo percebida pelo individuo ou pela comunidade através do uso da lin guagem. A seguir examinou-se o sistema escola, comprometido com a transmissão de uma determinada cosmovisão, caracteristica de uma nacionalidade dada, e/ou com a manutenção de valores e modos de vi da que identificam uma certa comunidade. As considerações feitas justificam a importância dada à alfabetização e aos programas de alfabetização em massa nos paises em desenvolvimento, c omo o Brasil. Levando-se em conta estas colocações, planejou-se investigar o vocabulário corrente de trinta e sete alunos do MOBRAL, em Nova Friburgo, relacionando-o com os indices sociais dos informantes (procedência, anos de vida na área geográfica considerada, idade, sexo e profissão), com as variáres temáticas (alimentação, saúde/ doença, profissão/afazeres, expectativas de vida, lembranças de vi da, lazer/diversões), escolhidas após sondagem prévia das condições de vida e dos interesses dos informantes, e com as variáveis lingüisticas, ou seja, as classes de palavras. Neste estudo explQ ratório, propôs-se ainda verificar em que medida o material escri to dos livros de leitura continuada do MOBRAL e dos jornais clas se A (Jornal do Brasil) e C (O Dia e Oltima Hora) se relacionam com o vocabulário utilizado pelos alunos do MOBRAL, em Nova Fribur go. Visando ao levantamento do vocabulário dos entrevistados,foram gravadas cinqüenta falas de acordo com a metodologia utilizada em trabalhos de natureza sociolingüistica . Os dados obtidos neste "corpus" gravado foram anali sados quantitativamente, aplicando-se um programa computacional cQ nhecido como SPSS. O estudo das rel ações entre as variáveis(classi ficação morfológica, tema, idade, sexo, profissão) conduziu à for mação de tabelas de contingência multivariada . A análise dos resultados ofereceu algumas conclusões como o uso constante de substantivos e verbos nas elocuções . Embora se tenha introduzido a técnica de captar as palavras disponíveis durante as entrevistas, não foi alterado o número de substan I tivos nesta pesquisa, porque os informantes não indicaram o nome das coisas isoladamente, fizeram-no por enunciados completos. Ba seando-se neste resultado, propuseram-se algumas sugestões de in teresse pedagógico para utilização do MOBRAL: a primeira -- nao en fatizar os nomes (processo estático da língua) em detrimento dos verkos (processo dinâmico); a segunda -- o uso de frases nas estra tégias de alfabetização. No exame das relações entre as variáveis, a grande variação detectada deveu-se ao tema. Quando se investigou a variedade dos vocábulos usa dos pelos entrevistados em Nova Friburgo, observou-se que das 11.337 ocorrências de substantivos, encontraram-se 2.222 substanti vos diferentes e 1.590 vocábulos; das 17.604 ocorrências de verbos, encontraram-se 2.365 verbos diferentes e 588 vocábulos; das 1.980 ocorrências de adjetivo~, encontraram-se 660 adjetivos diferentes e 488 vocábulos. Concluiu-se que o vocabulário deste grupo pode ser diferente dos de outras áreas, dos de outras cfu~adas sociais inse ridas em outros contextos, mas não é limitado, nem re stri to.Expre~ sa a visão e a expectativa do mundo que os cerca. Outra recomendação ao MOBRAL: a escolha das palavras a ensinar seria colhida nas diversas comunidades, onde funcionam as classes de alfabetização e a motivação para sua seleção deveria estar ligada às necessidades cotidianas dos adultos com a palavra geradora integrada em frases.
Resumo:
Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word-embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.
Resumo:
pt. 1. German and English -- pt. 2. English and German.
Resumo:
Objective To evaluate the effects of Optical Character Recognition (OCR) on the automatic cancer classification of pathology reports. Method Scanned images of pathology reports were converted to electronic free-text using a commercial OCR system. A state-of-the-art cancer classification system, the Medical Text Extraction (MEDTEX) system, was used to automatically classify the OCR reports. Classifications produced by MEDTEX on the OCR versions of the reports were compared with the classification from a human amended version of the OCR reports. Results The employed OCR system was found to recognise scanned pathology reports with up to 99.12% character accuracy and up to 98.95% word accuracy. Errors in the OCR processing were found to minimally impact on the automatic classification of scanned pathology reports into notifiable groups. However, the impact of OCR errors is not negligible when considering the extraction of cancer notification items, such as primary site, histological type, etc. Conclusions The automatic cancer classification system used in this work, MEDTEX, has proven to be robust to errors produced by the acquisition of freetext pathology reports from scanned images through OCR software. However, issues emerge when considering the extraction of cancer notification items.
Resumo:
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
Resumo:
Social media platforms, that foster user generated content, have altered the ways consumers search for product related information. Conducting online searches, reading product reviews, and comparing products ratings, is becoming a more common information seeking pathway. This research demonstrates that info-active consumers are becoming less reliant on information provided by retailers or manufacturers, hence marketing generated online content may have a reduced impact on their purchasing behaviour. The results of this study indicate that beyond traditional methods of segmenting consumers, in the online context, new classifications such as info-active and info-passive would be beneficial in digital marketing. This cross-sectional, mixed-methods study is based on 43 in-depth interviews and an online survey with 500 consumers from 30 countries.
Resumo:
In this paper, we propose a novel heuristic approach to segment recognizable symbols from online Kannada word data and perform recognition of the entire word. Two different estimates of first derivative are extracted from the preprocessed stroke groups and used as features for classification. Estimate 2 proved better resulting in 88% accuracy, which is 3% more than that achieved with estimate 1. Classification is performed by statistical dynamic space warping (SDSW) classifier which uses X, Y co-ordinates and their first derivatives as features. Classifier is trained with data from 40 writers. 295 classes are handled covering Kannada aksharas, with Kannada numerals, Indo-Arabic numerals, punctuations and other special symbols like $ and #. Classification accuracies obtained are 88% at the akshara level and 80% at the word level, which shows the scope for further improvement in segmentation algorithm
Resumo:
Classification of a large document collection involves dealing with a huge feature space where each distinct word is a feature. In such an environment, classification is a costly task both in terms of running time and computing resources. Further it will not guarantee optimal results because it is likely to overfit by considering every feature for classification. In such a context, feature selection is inevitable. This work analyses the feature selection methods, explores the relations among them and attempts to find a minimal subset of features which are discriminative for document classification.
Resumo:
There are many popular models available for classification of documents like Naïve Bayes Classifier, k-Nearest Neighbors and Support Vector Machine. In all these cases, the representation is based on the “Bag of words” model. This model doesn't capture the actual semantic meaning of a word in a particular document. Semantics are better captured by proximity of words and their occurrence in the document. We propose a new “Bag of Phrases” model to capture this discriminative power of phrases for text classification. We present a novel algorithm to extract phrases from the corpus using the well known topic model, Latent Dirichlet Allocation(LDA), and to integrate them in vector space model for classification. Experiments show a better performance of classifiers with the new Bag of Phrases model against related representation models.
Resumo:
This paper investigates several approaches to bootstrapping a new spoken language understanding (SLU) component in a target language given a large dataset of semantically-annotated utterances in some other source language. The aim is to reduce the cost associated with porting a spoken dialogue system from one language to another by minimising the amount of data required in the target language. Since word-level semantic annotations are costly, Semantic Tuple Classifiers (STCs) are used in conjunction with statistical machine translation models both of which are trained from unaligned data to further reduce development time. The paper presents experiments in which a French SLU component in the tourist information domain is bootstrapped from English data. Results show that training STCs on automatically translated data produced the best performance for predicting the utterance's dialogue act type, however individual slot/value pairs are best predicted by training STCs on the source language and using them to decode translated utterances. © 2010 ISCA.
Resumo:
Life is full of difficult choices. Everyone has their own way of dealing with these, some effective, some not. The problem is particularly acute in engineering design because of the vast amount of information designers have to process. This paper deals with a subset of this set of problems: the subset of selecting materials and processes, and their links to the design of products. Even these, though, present many of the generic problems of choice, and the challenges in creating tools to assist the designer in making them. The key elements are those of classification, of indexing, of reaching decisions using incomplete data in many different formats, and of devising effective strategies for selection. This final element - that of selection strategies - poses particular challenges. Product design, as an example, is an intricate blend of the technical and (for want of a better word) the aesthetic. To meet these needs, a tool that allows selection by analysis, by analogy, by association and simply by 'browsing' is necessary. An example of such a tool, its successes and remaining challenges, will be described.
Resumo:
In this paper a method to incorporate linguistic information regarding single-word and compound verbs is proposed, as a first step towards an SMT model based on linguistically-classified phrases. By substituting these verb structures by the base form of the head verb, we achieve a better statistical word alignment performance, and are able to better estimate the translation model and generalize to unseen verb forms during translation. Preliminary experiments for the English - Spanish language pair are performed, and future research lines are detailed. © 2005 Association for Computational Linguistics.
Resumo:
Research in emotion analysis of text suggest that emotion lexicon based features are superior to corpus based n-gram features. However the static nature of the general purpose emotion lexicons make them less suited to social media analysis, where the need to adopt to changes in vocabulary usage and context is crucial. In this paper we propose a set of methods to extract a word-emotion lexicon automatically from an emotion labelled corpus of tweets. Our results confirm that the features derived from these lexicons outperform the standard Bag-of-words features when applied to an emotion classification task. Furthermore, a comparative analysis with both manually crafted lexicons and a state-of-the-art lexicon generated using Point-Wise Mutual Information, show that the lexicons generated from the proposed methods lead to significantly better classi- fication performance.