994 resultados para 280205 Text Processing
Resumo:
The exponential increase of subjective, user-generated content since the birth of the Social Web, has led to the necessity of developing automatic text processing systems able to extract, process and present relevant knowledge. In this paper, we tackle the Opinion Retrieval, Mining and Summarization task, by proposing a unified framework, composed of three crucial components (information retrieval, opinion mining and text summarization) that allow the retrieval, classification and summarization of subjective information. An extensive analysis is conducted, where different configurations of the framework are suggested and analyzed, in order to determine which is the best one, and under which conditions. The evaluation carried out and the results obtained show the appropriateness of the individual components, as well as the framework as a whole. By achieving an improvement over 10% compared to the state-of-the-art approaches in the context of blogs, we can conclude that subjective text can be efficiently dealt with by means of our proposed framework.
Resumo:
El campo de procesamiento de lenguaje natural (PLN), ha tenido un gran crecimiento en los últimos años; sus áreas de investigación incluyen: recuperación y extracción de información, minería de datos, traducción automática, sistemas de búsquedas de respuestas, generación de resúmenes automáticos, análisis de sentimientos, entre otras. En este artículo se presentan conceptos y algunas herramientas con el fin de contribuir al entendimiento del procesamiento de texto con técnicas de PLN, con el propósito de extraer información relevante que pueda ser usada en un gran rango de aplicaciones. Se pueden desarrollar clasificadores automáticos que permitan categorizar documentos y recomendar etiquetas; estos clasificadores deben ser independientes de la plataforma, fácilmente personalizables para poder ser integrados en diferentes proyectos y que sean capaces de aprender a partir de ejemplos. En el presente artículo se introducen estos algoritmos de clasificación, se analizan algunas herramientas de código abierto disponibles actualmente para llevar a cabo estas tareas y se comparan diversas implementaciones utilizando la métrica F en la evaluación de los clasificadores.
Resumo:
In this paper we present a complete system for the treatment of both geographical and temporal dimensions in text and its application to information retrieval. This system has been evaluated in both the GeoTime task of the 8th and 9th NTCIR workshop in the years 2010 and 2011 respectively, making it possible to compare the system to contemporary approaches to the topic. In order to participate in this task we have added the temporal dimension to our GIR system. The system proposed here has a modular architecture in order to add or modify features. In the development of this system, we have followed a QA-based approach as well as multi-search engines to improve the system performance.
Resumo:
This paper describes the followed methodology to automatically generate titles for a corpus of questions that belong to sociological opinion polls. Titles for questions have a twofold function: (1) they are the input of user searches and (2) they inform about the whole contents of the question and possible answer options. Thus, generation of titles can be considered as a case of automatic summarization. However, the fact that summarization had to be performed over very short texts together with the aforementioned quality conditions imposed on new generated titles led the authors to follow knowledge-rich and domain-dependent strategies for summarization, disregarding the more frequent extractive techniques for summarization.
Resumo:
Adobe's Acrobat software, released in June 1993, is based around a new Portable Document Format (PDF) which offers the possibility of being able to view and exchange electronic documents, independent of the originating software, across a wide variety of supported hardware platforms (PC, Macintosh, Sun UNIX etc.). The fact that Acrobat's imageable objects are rendered with full use of Level 2 PostScript means that the most demanding requirements can be met in terms of high-quality typography and device-independent colour. These qualities will be very desirable components in future multimedia and hypermedia systems. The current capabilities of Acrobat and PDF are described; in particular the presence of hypertext links, bookmarks, and yellow sticker annotations (in release 1.0) together with article threads and multi-media plugins in version 2.0, This article also describes the CAJUN project (CD-ROM Acrobat Journals Using Networks) which has been investigating the automated placement of PDF hypertextual features from various front-end text processing systems. CAJUN has also been experimenting with the dissemination of PDF over e-mail, via World Wide Web and on CDROM.
Resumo:
Electronic Publishing -- Origination, Dissemination and Design (EP-odd) is an academic journal which publishes refereed papers in the subject area of electronic publishing. The authors of the present paper are, respectively, editor-in-chief, system software consultant and senior production manager for the journal. EP-odd's policy is that editors, authors, referees and production staff will work closely together using electronic mail. Authors are also encouraged to originate their papers using one of the approved text-processing packages together with the appropriate set of macros which enforce the layout style for the journal. This same software will then be used by the publisher in the production phase. Our experiences with these strategies are presented, and two recently developed suites of software are described: one of these makes the macro sets available over electronic mail and the other automates the flow of papers through the refereeing process. The decision to produce EP-odd in this way means that the publisher has to adopt production procedures which differ markedly from those employed for a conventional journal.
Resumo:
Starting in December 1982 the University of Nottingham decided to phototypeset almost all of its examination papers `in house' using the troff, tbl and eqn programs running under UNIX. This tutorial lecture highlights the features of the three programs with particular reference to their strengths and weaknesses in a production environment. The following issues are particularly addressed: Standards -- all three software packages require the embedding of commands and the invocation of pre-written macros, rather than `what you see is what you get'. This can help to enforce standards, in the absence of traditional compositor skills. Hardware and Software -- the requirements are analysed for an inexpensive preview facility and a low-level interface to the phototypesetter. Mathematical and Technical papers -- the fine-tuning of eqn to impose a standard house style. Staff skills and training -- systems of this kind do not require the operators to have had previous experience of phototypesetting. Of much greater importance is willingness and flexibility in learning how to use computer systems.
Resumo:
This paper presents a study made in a field poorly explored in the Portuguese language – modality and its automatic tagging. Our main goal was to find a set of attributes for the creation of automatic tag- gers with improved performance over the bag-of-words (bow) approach. The performance was measured using precision, recall and F1. Because it is a relatively unexplored field, the study covers the creation of the corpus (composed by eleven verbs), the use of a parser to extract syntac- tic and semantic information from the sentences and a machine learning approach to identify modality values. Based on three different sets of attributes – from trigger itself and the trigger’s path (from the parse tree) and context – the system creates a tagger for each verb achiev- ing (in almost every verb) an improvement in F1 when compared to the traditional bow approach.
Resumo:
An implementation of a computational tool to generate new summaries from new source texts is presented, by means of the connectionist approach (artificial neural networks). Among other contributions that this work intends to bring to natural language processing research, the use of a more biologically plausible connectionist architecture and training for automatic summarization is emphasized. The choice relies on the expectation that it may bring an increase in computational efficiency when compared to the sa-called biologically implausible algorithms.
Resumo:
An Electrocardiogram (ECG) monitoring system deals with several challenges related with noise sources. The main goal of this text was the study of Adaptive Signal Processing Algorithms for ECG noise reduction when applied to real signals. This document presents an adaptive ltering technique based on Least Mean Square (LMS) algorithm to remove the artefacts caused by electromyography (EMG) and power line noise into ECG signal. For this experiments it was used real noise signals, mainly to observe the di erence between real noise and simulated noise sources. It was obtained very good results due to the ability of noise removing that can be reached with this technique. A recolha de sinais electrocardiogr a cos (ECG) sofre de diversos problemas relacionados com ru dos. O objectivo deste trabalho foi o estudo de algoritmos adaptativos para processamento digital de sinal, para redu c~ao de ru do em sinais ECG reais. Este texto apresenta uma t ecnica de redu c~ao de ru do baseada no algoritmo Least Mean Square (LMS) para remo c~ao de ru dos causados quer pela actividade muscular (EMG) quer por ru dos causados pela rede de energia el ectrica. Para as experiencias foram utilizados ru dos reais, principalmente para aferir a diferen ca de performance do algoritmo entre os sinais reais e os simulados. Foram conseguidos bons resultados, essencialmente devido as excelentes caracter sticas que esta t ecnica tem para remover ru dos.
Resumo:
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
Resumo:
Development and standardization of reliable methods for detection of Mycobacterium tuberculosis in clinical samples is an important goal in laboratories throughout the world. In this work, lung and spleen fragments from a patient who died with the diagnosis of miliary tuberculosis were used to evaluate the influence of the type of fixative as well as the fixation and paraffin inclusion protocols on PCR performance in paraffin embedded specimens. Tissue fragments were fixed for four h to 48 h, using either 10% non-buffered or 10% buffered formalin, and embedded in pure paraffin or paraffin mixed with bee wax. Specimens were submitted to PCR for amplification of the human beta-actin gene and separately for amplification of the insertion sequence IS6110, specific from the M. tuberculosis complex. Amplification of the beta-actin gene was positive in all samples. No amplicons were generated by PCR-IS6110 when lung tissue fragments were fixed using 10% non-buffered formalin and were embedded in paraffin containing bee wax. In conclusion, combined inhibitory factors interfere in the detection of M. tuberculosis in stored material. It is important to control these inhibitory factors in order to implement molecular diagnosis in pathology laboratories.
Resumo:
Vibrio cholerae represents a significant threat to human health in developing countries. This pathogen forms biofilms which favors its attachment to surfaces and its survival and transmission by water or food. This work evaluated the in vitro biofilm formation of V. cholerae isolated from clinical and environmental sources on stainless steel of the type used in food processing by using the environmental scanning electron microscopy (ESEM). Results showed no cell adhesion at 4 h and scarce surface colonization at 24 h. Biofilms from the environmental strain were observed at 48 h with high cellular aggregations embedded in Vibrio exopolysaccharide (VPS), while less confluence and VPS production with microcolonies of elongated cells were observed in biofilms produced by the clinical strain. At 96 h the biofilms of the environmental strain were released from the surface leaving coccoid cells and residual structures, whereas biofilms of the clinical strain formed highly organized structures such as channels, mushroom-like and pillars. This is the first study that has shown the in vitro ability of V. cholerae to colonize and form biofilms on stainless steel used in food processing.