900 resultados para Hierarchy of text classifiers
Resumo:
Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.
Resumo:
To date, more than 16 million citations of published articles in biomedical domain are available in the MEDLINE database. These articles describe the new discoveries which accompany a tremendous development in biomedicine during the last decade. It is crucial for biomedical researchers to retrieve and mine some specific knowledge from the huge quantity of published articles with high efficiency. Researchers have been engaged in the development of text mining tools to find knowledge such as protein-protein interactions, which are most relevant and useful for specific analysis tasks. This chapter provides a road map to the various information extraction methods in biomedical domain, such as protein name recognition and discovery of protein-protein interactions. Disciplines involved in analyzing and processing unstructured-text are summarized. Current work in biomedical information extracting is categorized. Challenges in the field are also presented and possible solutions are discussed.
Resumo:
This article is linked to my major study on the Poetik des Extremen by classifying the monstrous works of Marianne Fritz among a genealogy of extremist writing in German-speaking literature. Her literary project Festung, which represents in all likelihood the most extensive ‘novel’ in Western literary history, is first analysed by looking at the exponential growth of its components from a paperback of 108 pages to the not yet completed novel Naturgemäß, which will most probably comprise 15 volumes, mostly of A4 size and a length that should be equivalent to over 20,000 standard pages. In parallel to the quantitative explosion of form, the article also explores the transgression of traditional narration and Fritz’s typographical innovations of text presentation. Using reproductions of the late facsimile volumes, an exemplary ‘close reading’ of one page from Naturgemäß II is undertaken to demonstrate the enormous density of Festung. Finally, the article seeks to differentiate Fritz’s opus magnum from other out-sized works of literature by focussing on the specific interconnection between the quantitative and stylistic explosion of the form of the novel, which makes it incomparable to the major works of writers such as Robert Musil or Arno Schmidt.
Resumo:
In this article, I explore issues of commitment to truth in dating ads that use apparently impossible categorizations to project identities for ad writers and their desired others. The article begins with a brief overview of relevant aspects of Text World Theory (especially Gavins's work on dating ads), Sinclair's model of fictional worlds and Routledge and Chapman's account of truth-commitment in discourse, and proposes the need for a framework that allows for a partial suspension of commitment to truth. I then draw on the work of Ivanič and Weldon on identity in writing, in order to develop an account that offers a discourse- and genre-based discussion of how the intertextual metaphors in such ads are interpreted in relation to truth values. I suggest the default stance is that of positive commitment to literal truth and that, when this is not possible, a fall-back mode of negative commitment to metaphorical truth is preferred over an interpretation in which questions of truth are truly suspended. Finally, I consider a related category, of apparently negative dating ad identities, in order to suggest a functional motivation for the inclusion of elements that cannot be interpreted in truth-committed mode. Copyright © 2008 SAGE Publications.
Resumo:
In this paper, we propose a new similarity measure to compute the pair-wise similarity of text-based documents based on patterns of the words in the documents. First we develop a kappa measure for pair-wise comparison of documents then we use ordered weighting averaging operator to define a document similarity measure for a set of documents.
Resumo:
Social streams have proven to be the mostup-to-date and inclusive information on cur-rent events. In this paper we propose a novelprobabilistic modelling framework, called violence detection model (VDM), which enables the identification of text containing violent content and extraction of violence-related topics over social media data. The proposed VDM model does not require any labeled corpora for training, instead, it only needs the in-corporation of word prior knowledge which captures whether a word indicates violence or not. We propose a novel approach of deriving word prior knowledge using the relative entropy measurement of words based on the in-tuition that low entropy words are indicative of semantically coherent topics and therefore more informative, while high entropy words indicates words whose usage is more topical diverse and therefore less informative. Our proposed VDM model has been evaluated on the TREC Microblog 2011 dataset to identify topics related to violence. Experimental results show that deriving word priors using our proposed relative entropy method is more effective than the widely-used information gain method. Moreover, VDM gives higher violence classification results and produces more coherent violence-related topics compared toa few competitive baselines.
Resumo:
The influence of text messaging on language has been hotly debated especially in relation to spelling and the lexicon, but the impact of SMS on syntax has received less attention.This article focuses on manipulations within the verbal domain, as language evolution points towards a consistent trend going from synthetic to analytical forms (Bybee et al. 1994), which goes against the need for concision in texting. Based on an authentic corpus of about 500 SMS (Fairon et al. 2006b), the present study shows condensation strategies that are similar to those already described, yet reveals specific features such as the absence of aphaeresis and the scarcity of apocope, as well as the overuse of synthetic forms. It can thus be concluded that while SMS writing displays oral characteristics, it cannot obviously be assimilated to speech; in addition, it may well slow down language evolution and support the conservation of short standard forms.
Resumo:
This paper addresses the task of learning classifiers from streams of labelled data. In this case we can face the problem that the underlying concepts can change over time. The paper studies two mechanisms developed for dealing with changing concepts. Both are based on the time window idea. The first one forgets gradually, by assigning to the examples weight that gradually decreases over time. The second one uses a statistical test to detect changes in concept and then optimizes the size of the time window, aiming to maximise the classification accuracy on the new examples. Both methods are general in nature and can be used with any learning algorithm. The objectives of the conducted experiments were to compare the mechanisms and explore whether they can be combined to achieve a synergetic e ect. Results from experiments with three basic learning algorithms (kNN, ID3 and NBC) using four datasets are reported and discussed.
Resumo:
* The work is partially supported by the grant of National Academy of Science of Ukraine for the support of scientific researches by young scientists No 24-7/05, " Розробка Desktop Grid-системи і оптимізація її продуктивності ”.
Resumo:
* The following text has been originally published in the Proceedings of the Language Recourses and Evaluation Conference held in Lisbon, Portugal, 2004, under the title of "Towards Intelligent Written Cultural Heritage Processing - Lexical processing". I present here a revised contribution of the aforementioned paper and I add here the latest efforts done in the Center for Computational Linguistic in Prague in the field under discussion.
Resumo:
In this paper, a new method for offline handwriting recognition is presented. A robust algorithm for handwriting segmentation has been described here with the help of which individual characters can be segmented from a word selected from a paragraph of handwritten text image which is given as input to the module. Then each of the segmented characters are converted into column vectors of 625 values that are later fed into the advanced neural network setup that has been designed in the form of text files. The networks has been designed with quadruple layered neural network with 625 input and 26 output neurons each corresponding to a character from a-z, the outputs of all the four networks is fed into the genetic algorithm which has been developed using the concepts of correlation, with the help of this the overall network is optimized with the help of genetic algorithm thus providing us with recognized outputs with great efficiency of 71%.
Resumo:
One of the ultimate aims of Natural Language Processing is to automate the analysis of the meaning of text. A fundamental step in that direction consists in enabling effective ways to automatically link textual references to their referents, that is, real world objects. The work presented in this paper addresses the problem of attributing a sense to proper names in a given text, i.e., automatically associating words representing Named Entities with their referents. The method for Named Entity Disambiguation proposed here is based on the concept of semantic relatedness, which in this work is obtained via a graph-based model over Wikipedia. We show that, without building the traditional bag of words representation of the text, but instead only considering named entities within the text, the proposed method achieves results competitive with the state-of-the-art on two different datasets.
Resumo:
Бойко Бл. Банчев - Представена е обосновка и описание на език за програмиране в композиционен стил за опитни и учебни цели. Под “композиционен” имаме предвид функционален стил на програмиране, при който пресмятането е йерархия от композиции и прилагания на функции. Един от данновите типове на езика е този на геометричните фигури, които могат да бъдат получавани чрез прости правила за съотнасяне и така също образуват йерархични композиции. Езикът е силно повлиян от GeomLab, но по редица свойства се различава от него значително. Статията разглежда основните черти на езика; подробното му описание и фигурноконструктивните му възможности ще бъдат представени в съпътстваща публикация.
Resumo:
The immune system is perhaps the largest yet most diffuse and distributed somatic system in vertebrates. It plays vital roles in fighting infection and in the homeostatic control of chronic disease. As such, the immune system in both pathological and healthy states is a prime target for therapeutic interventions by drugs-both small-molecule and biologic. Comprising both the innate and adaptive immune systems, human immunity is awash with potential unexploited molecular targets. Key examples include the pattern recognition receptors of the innate immune system and the major histocompatibility complex of the adaptive immune system. Moreover, the immune system is also the source of many current and, hopefully, future drugs, of which the prime example is the monoclonal antibody, the most exciting and profitable type of present-day drug moiety. This brief review explores the identity and synergies of the hierarchy of drug targets represented by the human immune system, with particular emphasis on the emerging paradigm of systems pharmacology. © the authors, publisher and licensee Libertas Academica Limited.
Resumo:
Starting with a description of the software and hardware used for corpus linguistics in the late 1980s to early 1990s, this contribution discusses difficulties faced by the software designer when attempting to allow users to study text. Future human-machine interfaces may develop to be much more sophisticated, and certainly the aspects of text which can be studied will progress beyond plain text without images. Another area which will develop further is the study of patternings involving not just single words but word-relations across large stretches of text.