911 resultados para Machine Learning,Natural Language Processing,Descriptive Text Mining,POIROT,Transformer
Resumo:
Active machine learning algorithms are used when large numbers of unlabeled examples are available and getting labels for them is costly (e.g. requiring consulting a human expert). Many conventional active learning algorithms focus on refining the decision boundary, at the expense of exploring new regions that the current hypothesis misclassifies. We propose a new active learning algorithm that balances such exploration with refining of the decision boundary by dynamically adjusting the probability to explore at each step. Our experimental results demonstrate improved performance on data sets that require extensive exploration while remaining competitive on data sets that do not. Our algorithm also shows significant tolerance of noise.
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
This thesis concerns artificially intelligent natural language processing systems that are capable of learning the properties of lexical items (properties like verbal valency or inflectional class membership) autonomously while they are fulfilling their tasks for which they have been deployed in the first place. Many of these tasks require a deep analysis of language input, which can be characterized as a mapping of utterances in a given input C to a set S of linguistically motivated structures with the help of linguistic information encoded in a grammar G and a lexicon L: G + L + C → S (1) The idea that underlies intelligent lexical acquisition systems is to modify this schematic formula in such a way that the system is able to exploit the information encoded in S to create a new, improved version of the lexicon: G + L + S → L' (2) Moreover, the thesis claims that a system can only be considered intelligent if it does not just make maximum usage of the learning opportunities in C, but if it is also able to revise falsely acquired lexical knowledge. So, one of the central elements in this work is the formulation of a couple of criteria for intelligent lexical acquisition systems subsumed under one paradigm: the Learn-Alpha design rule. The thesis describes the design and quality of a prototype for such a system, whose acquisition components have been developed from scratch and built on top of one of the state-of-the-art Head-driven Phrase Structure Grammar (HPSG) processing systems. The quality of this prototype is investigated in a series of experiments, in which the system is fed with extracts of a large English corpus. While the idea of using machine-readable language input to automatically acquire lexical knowledge is not new, we are not aware of a system that fulfills Learn-Alpha and is able to deal with large corpora. To instance four major challenges of constructing such a system, it should be mentioned that a) the high number of possible structural descriptions caused by highly underspeci ed lexical entries demands for a parser with a very effective ambiguity management system, b) the automatic construction of concise lexical entries out of a bulk of observed lexical facts requires a special technique of data alignment, c) the reliability of these entries depends on the system's decision on whether it has seen 'enough' input and d) general properties of language might render some lexical features indeterminable if the system tries to acquire them with a too high precision. The cornerstone of this dissertation is the motivation and development of a general theory of automatic lexical acquisition that is applicable to every language and independent of any particular theory of grammar or lexicon. This work is divided into five chapters. The introductory chapter first contrasts three different and mutually incompatible approaches to (artificial) lexical acquisition: cue-based queries, head-lexicalized probabilistic context free grammars and learning by unification. Then the postulation of the Learn-Alpha design rule is presented. The second chapter outlines the theory that underlies Learn-Alpha and exposes all the related notions and concepts required for a proper understanding of artificial lexical acquisition. Chapter 3 develops the prototyped acquisition method, called ANALYZE-LEARN-REDUCE, a framework which implements Learn-Alpha. The fourth chapter presents the design and results of a bootstrapping experiment conducted on this prototype: lexeme detection, learning of verbal valency, categorization into nominal count/mass classes, selection of prepositions and sentential complements, among others. The thesis concludes with a review of the conclusions and motivation for further improvements as well as proposals for future research on the automatic induction of lexical features.
Resumo:
Written text is an important component in the process of knowledge acquisition and communication. Poorly written text fails to deliver clear ideas to the reader no matter how revolutionary and ground-breaking these ideas are. Providing text with good writing style is essential to transfer ideas smoothly. While we have sophisticated tools to check for stylistic problems in program code, we do not apply the same techniques for written text. In this paper we present TextLint, a rule-based tool to check for common style errors in natural language. TextLint provides a structural model of written text and an extensible rule-based checking mechanism.
Resumo:
Accurate quantitative estimation of exposure using retrospective data has been one of the most challenging tasks in the exposure assessment field. To improve these estimates, some models have been developed using published exposure databases with their corresponding exposure determinants. These models are designed to be applied to reported exposure determinants obtained from study subjects or exposure levels assigned by an industrial hygienist, so quantitative exposure estimates can be obtained. ^ In an effort to improve the prediction accuracy and generalizability of these models, and taking into account that the limitations encountered in previous studies might be due to limitations in the applicability of traditional statistical methods and concepts, the use of computer science- derived data analysis methods, predominantly machine learning approaches, were proposed and explored in this study. ^ The goal of this study was to develop a set of models using decision trees/ensemble and neural networks methods to predict occupational outcomes based on literature-derived databases, and compare, using cross-validation and data splitting techniques, the resulting prediction capacity to that of traditional regression models. Two cases were addressed: the categorical case, where the exposure level was measured as an exposure rating following the American Industrial Hygiene Association guidelines and the continuous case, where the result of the exposure is expressed as a concentration value. Previously developed literature-based exposure databases for 1,1,1 trichloroethane, methylene dichloride and, trichloroethylene were used. ^ When compared to regression estimations, results showed better accuracy of decision trees/ensemble techniques for the categorical case while neural networks were better for estimation of continuous exposure values. Overrepresentation of classes and overfitting were the main causes for poor neural network performance and accuracy. Estimations based on literature-based databases using machine learning techniques might provide an advantage when they are applied to other methodologies that combine `expert inputs' with current exposure measurements, like the Bayesian Decision Analysis tool. The use of machine learning techniques to more accurately estimate exposures from literature-based exposure databases might represent the starting point for the independence from the expert judgment.^
Resumo:
BACKGROUND: Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs-even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. METHODS: We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. RESULTS AND CONCLUSIONS: The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.
Resumo:
La nanotecnología es un área de investigación de reciente creación que trata con la manipulación y el control de la materia con dimensiones comprendidas entre 1 y 100 nanómetros. A escala nanométrica, los materiales exhiben fenómenos físicos, químicos y biológicos singulares, muy distintos a los que manifiestan a escala convencional. En medicina, los compuestos miniaturizados a nanoescala y los materiales nanoestructurados ofrecen una mayor eficacia con respecto a las formulaciones químicas tradicionales, así como una mejora en la focalización del medicamento hacia la diana terapéutica, revelando así nuevas propiedades diagnósticas y terapéuticas. A su vez, la complejidad de la información a nivel nano es mucho mayor que en los niveles biológicos convencionales (desde el nivel de población hasta el nivel de célula) y, por tanto, cualquier flujo de trabajo en nanomedicina requiere, de forma inherente, estrategias de gestión de información avanzadas. Desafortunadamente, la informática biomédica todavía no ha proporcionado el marco de trabajo que permita lidiar con estos retos de la información a nivel nano, ni ha adaptado sus métodos y herramientas a este nuevo campo de investigación. En este contexto, la nueva área de la nanoinformática pretende detectar y establecer los vínculos existentes entre la medicina, la nanotecnología y la informática, fomentando así la aplicación de métodos computacionales para resolver las cuestiones y problemas que surgen con la información en la amplia intersección entre la biomedicina y la nanotecnología. Las observaciones expuestas previamente determinan el contexto de esta tesis doctoral, la cual se centra en analizar el dominio de la nanomedicina en profundidad, así como en el desarrollo de estrategias y herramientas para establecer correspondencias entre las distintas disciplinas, fuentes de datos, recursos computacionales y técnicas orientadas a la extracción de información y la minería de textos, con el objetivo final de hacer uso de los datos nanomédicos disponibles. El autor analiza, a través de casos reales, alguna de las tareas de investigación en nanomedicina que requieren o que pueden beneficiarse del uso de métodos y herramientas nanoinformáticas, ilustrando de esta forma los inconvenientes y limitaciones actuales de los enfoques de informática biomédica a la hora de tratar con datos pertenecientes al dominio nanomédico. Se discuten tres escenarios diferentes como ejemplos de actividades que los investigadores realizan mientras llevan a cabo su investigación, comparando los contextos biomédico y nanomédico: i) búsqueda en la Web de fuentes de datos y recursos computacionales que den soporte a su investigación; ii) búsqueda en la literatura científica de resultados experimentales y publicaciones relacionadas con su investigación; iii) búsqueda en registros de ensayos clínicos de resultados clínicos relacionados con su investigación. El desarrollo de estas actividades requiere el uso de herramientas y servicios informáticos, como exploradores Web, bases de datos de referencias bibliográficas indexando la literatura biomédica y registros online de ensayos clínicos, respectivamente. Para cada escenario, este documento proporciona un análisis detallado de los posibles obstáculos que pueden dificultar el desarrollo y el resultado de las diferentes tareas de investigación en cada uno de los dos campos citados (biomedicina y nanomedicina), poniendo especial énfasis en los retos existentes en la investigación nanomédica, campo en el que se han detectado las mayores dificultades. El autor ilustra cómo la aplicación de metodologías provenientes de la informática biomédica a estos escenarios resulta efectiva en el dominio biomédico, mientras que dichas metodologías presentan serias limitaciones cuando son aplicadas al contexto nanomédico. Para abordar dichas limitaciones, el autor propone un enfoque nanoinformático, original, diseñado específicamente para tratar con las características especiales que la información presenta a nivel nano. El enfoque consiste en un análisis en profundidad de la literatura científica y de los registros de ensayos clínicos disponibles para extraer información relevante sobre experimentos y resultados en nanomedicina —patrones textuales, vocabulario en común, descriptores de experimentos, parámetros de caracterización, etc.—, seguido del desarrollo de mecanismos para estructurar y analizar dicha información automáticamente. Este análisis concluye con la generación de un modelo de datos de referencia (gold standard) —un conjunto de datos de entrenamiento y de test anotados manualmente—, el cual ha sido aplicado a la clasificación de registros de ensayos clínicos, permitiendo distinguir automáticamente los estudios centrados en nanodrogas y nanodispositivos de aquellos enfocados a testear productos farmacéuticos tradicionales. El presente trabajo pretende proporcionar los métodos necesarios para organizar, depurar, filtrar y validar parte de los datos nanomédicos existentes en la actualidad a una escala adecuada para la toma de decisiones. Análisis similares para otras tareas de investigación en nanomedicina ayudarían a detectar qué recursos nanoinformáticos se requieren para cumplir los objetivos actuales en el área, así como a generar conjunto de datos de referencia, estructurados y densos en información, a partir de literatura y otros fuentes no estructuradas para poder aplicar nuevos algoritmos e inferir nueva información de valor para la investigación en nanomedicina. ABSTRACT Nanotechnology is a research area of recent development that deals with the manipulation and control of matter with dimensions ranging from 1 to 100 nanometers. At the nanoscale, materials exhibit singular physical, chemical and biological phenomena, very different from those manifested at the conventional scale. In medicine, nanosized compounds and nanostructured materials offer improved drug targeting and efficacy with respect to traditional formulations, and reveal novel diagnostic and therapeutic properties. Nevertheless, the complexity of information at the nano level is much higher than the complexity at the conventional biological levels (from populations to the cell). Thus, any nanomedical research workflow inherently demands advanced information management. Unfortunately, Biomedical Informatics (BMI) has not yet provided the necessary framework to deal with such information challenges, nor adapted its methods and tools to the new research field. In this context, the novel area of nanoinformatics aims to build new bridges between medicine, nanotechnology and informatics, allowing the application of computational methods to solve informational issues at the wide intersection between biomedicine and nanotechnology. The above observations determine the context of this doctoral dissertation, which is focused on analyzing the nanomedical domain in-depth, and developing nanoinformatics strategies and tools to map across disciplines, data sources, computational resources, and information extraction and text mining techniques, for leveraging available nanomedical data. The author analyzes, through real-life case studies, some research tasks in nanomedicine that would require or could benefit from the use of nanoinformatics methods and tools, illustrating present drawbacks and limitations of BMI approaches to deal with data belonging to the nanomedical domain. Three different scenarios, comparing both the biomedical and nanomedical contexts, are discussed as examples of activities that researchers would perform while conducting their research: i) searching over the Web for data sources and computational resources supporting their research; ii) searching the literature for experimental results and publications related to their research, and iii) searching clinical trial registries for clinical results related to their research. The development of these activities will depend on the use of informatics tools and services, such as web browsers, databases of citations and abstracts indexing the biomedical literature, and web-based clinical trial registries, respectively. For each scenario, this document provides a detailed analysis of the potential information barriers that could hamper the successful development of the different research tasks in both fields (biomedicine and nanomedicine), emphasizing the existing challenges for nanomedical research —where the major barriers have been found. The author illustrates how the application of BMI methodologies to these scenarios can be proven successful in the biomedical domain, whilst these methodologies present severe limitations when applied to the nanomedical context. To address such limitations, the author proposes an original nanoinformatics approach specifically designed to deal with the special characteristics of information at the nano level. This approach consists of an in-depth analysis of the scientific literature and available clinical trial registries to extract relevant information about experiments and results in nanomedicine —textual patterns, common vocabulary, experiment descriptors, characterization parameters, etc.—, followed by the development of mechanisms to automatically structure and analyze this information. This analysis resulted in the generation of a gold standard —a manually annotated training or reference set—, which was applied to the automatic classification of clinical trial summaries, distinguishing studies focused on nanodrugs and nanodevices from those aimed at testing traditional pharmaceuticals. The present work aims to provide the necessary methods for organizing, curating and validating existing nanomedical data on a scale suitable for decision-making. Similar analysis for different nanomedical research tasks would help to detect which nanoinformatics resources are required to meet current goals in the field, as well as to generate densely populated and machine-interpretable reference datasets from the literature and other unstructured sources for further testing novel algorithms and inferring new valuable information for nanomedicine.
Resumo:
This paper describes our participation at SemEval- 2014 sentiment analysis task, in both contextual and message polarity classification. Our idea was to com- pare two different techniques for sentiment analysis. First, a machine learning classifier specifically built for the task using the provided training corpus. On the other hand, a lexicon-based approach using natural language processing techniques, developed for a ge- neric sentiment analysis task with no adaptation to the provided training corpus. Results, though far from the best runs, prove that the generic model is more robust as it achieves a more balanced evaluation for message polarity along the different test sets.
Resumo:
Los medios sociales han revolucionado la manera en la que los consumidores se relacionan entre sí y con las marcas. Las opiniones publicadas en dichos medios tienen un poder de influencia en las decisiones de compra tan importante como las campañas de publicidad. En consecuencia, los profesionales del marketing cada vez dedican mayores esfuerzos e inversión a la obtención de indicadores que permitan medir el estado de salud de las marcas a partir de los contenidos digitales generados por sus consumidores. Dada la naturaleza no estructurada de los contenidos publicados en los medios sociales, la tecnología usada para procesar dichos contenidos ha menudo implementa técnicas de Inteligencia Artificial, tales como algoritmos de procesamiento de lenguaje natural, aprendizaje automático y análisis semántico. Esta tesis, contribuye al estado de la cuestión, con un modelo que permite estructurar e integrar la información publicada en medios sociales, y una serie de técnicas cuyos objetivos son la identificación de consumidores, así como la segmentación psicográfica y sociodemográfica de los mismos. La técnica de identificación de consumidores se basa en la huella digital de los dispositivos que utilizan para navegar por la Web y es tolerante a los cambios que se producen con frecuencia en dicha huella digital. Las técnicas de segmentación psicográfica descritas obtienen la posición en el embudo de compra de los consumidores y permiten clasificar las opiniones en función de una serie de atributos de marketing. Finalmente, las técnicas de segmentación sociodemográfica permiten obtener el lugar de residencia y el género de los consumidores. ABSTRACT Social media has revolutionised the way in which consumers relate to each other and with brands. The opinions published in social media have a power of influencing purchase decisions as important as advertising campaigns. Consequently, marketers are increasing efforts and investments for obtaining indicators to measure brand health from the digital content generated by consumers. Given the unstructured nature of social media contents, the technology used for processing such contents often implements Artificial Intelligence techniques, such as natural language processing, machine learning and semantic analysis algorithms. This thesis contributes to the State of the Art, with a model for structuring and integrating the information posted on social media, and a number of techniques whose objectives are the identification of consumers, as well as their socio-demographic and psychographic segmentation. The consumer identification technique is based on the fingerprint of the devices they use to surf the Web and is tolerant to the changes that occur frequently in such fingerprint. The psychographic profiling techniques described infer the position of consumer in the purchase funnel, and allow to classify the opinions based on a series of marketing attributes. Finally, the socio-demographic profiling techniques allow to obtain the residence and gender of consumers.
Resumo:
El objetivo principal de este proyecto ha sido introducir aprendizaje automático en la aplicación FleSe. FleSe es una aplicación web que permite realizar consultas borrosas sobre bases de datos nítidos. Para llevar a cabo esta función la aplicación utiliza unos criterios para definir los conceptos borrosos usados para llevar a cabo las consultas. FleSe además permite que el usuario cambie estas personalizaciones. Es aquí donde introduciremos el aprendizaje automático, de tal manera que los criterios por defecto cambien y aprendan en función de las personalizaciones que van realizando los usuarios. Los objetivos secundarios han sido familiarizarse con el desarrollo y diseño web, al igual que recordar y ampliar el conocimiento sobre lógica borrosa y el lenguaje de programación lógica Ciao-Prolog. A lo largo de la realización del proyecto y sobre todo después del estudio de los resultados se demuestra que la agrupación de los usuarios marca la diferencia con la última versión de la aplicación. Esto se basa en la siguiente idea, podemos usar un algoritmo de aprendizaje automático sobre las personalizaciones de los criterios de todos los usuarios, pero la gran diversidad de opiniones de los usuarios puede llevar al algoritmo a concluir criterios erróneos o no representativos. Para solucionar este problema agrupamos a los usuarios intentando que cada grupo tengan la misma opinión o mismo criterio sobre el concepto. Y después de haber realizado las agrupaciones usar el algoritmo de aprendizaje automático para precisar el criterio por defecto de cada grupo de usuarios. Como posibles mejoras para futuras versiones de la aplicación FleSe sería un mejor control y manejo del ejecutable plserver. Este archivo se encarga de permitir a la aplicación web usar el lenguaje de programación lógica Ciao-Prolog para llevar a cabo la lógica borrosa relacionada con las consultas. Uno de los problemas más importantes que ofrece plserver es que bloquea el hilo de ejecución al intentar cargar un archivo con errores y en caso de ocurrir repetidas veces bloquea todas las peticiones siguientes bloqueando la aplicación. Pensando en los usuarios y posibles clientes, sería también importante permitir que FleSe trabajase con bases de datos de SQL en vez de almacenar la base de datos en los archivos de Prolog. Otra posible mejora basarse en distintas características a la hora de agrupar los usuarios dependiendo de los conceptos borrosos que se van ha utilizar en las consultas. Con esto se conseguiría que para cada concepto borroso, se generasen distintos grupos de usuarios, los cuales tendrían opiniones distintas sobre el concepto en cuestión. Así se generarían criterios por defecto más precisos para cada usuario y cada concepto borroso.---ABSTRACT---The main objective of this project has been to introduce machine learning in the application FleSe. FleSe is a web application that makes fuzzy queries over databases with precise information, using defined criteria to define the fuzzy concepts used by the queries. The application allows the users to change and custom these criteria. On this point is where the machine learning would be introduced, so FleSe learn from every new user customization of the criteria in order to generate a new default value of it. The secondary objectives of this project were get familiar with web development and web design in order to understand the how the application works, as well as refresh and improve the knowledge about fuzzy logic and logic programing. During the realization of the project and after the study of the results, I realized that clustering the users in different groups makes the difference between this new version of the application and the previous. This conclusion follows the next idea, we can use an algorithm to introduce machine learning over the criteria that people have, but the problem is the diversity of opinions and judgements that exists, making impossible to generate a unique correct criteria for all the users. In order to solve this problem, before using the machine learning methods, we cluster the users in order to make groups that have the same opinion, and afterwards, use the machine learning methods to precise the default criteria of each users group. The future improvements that could be important for the next versions of FleSe will be to control better the behaviour of the plserver file, that cost many troubles at the beginning of this project and it also generate important errors in the previous version. The file plserver allows the web application to use Ciao-Prolog, a logic programming language that control and manage all the fuzzy logic. One of the main problems with plserver is that when the user uploads a file with errors, it will block the thread and when this happens multiple times it will start blocking all the requests. Oriented to the customer, would be important as well to allow FleSe to manage and work with SQL databases instead of store the data in the Prolog files. Another possible improvement would that the cluster algorithm would be based on different criteria depending on the fuzzy concepts that the selected Prolog file have. This will generate more meaningful clusters, and therefore, the default criteria offered to the users will be more precise.
Resumo:
The development of the Web 2.0 led to the birth of new textual genres such as blogs, reviews or forum entries. The increasing number of such texts and the highly diverse topics they discuss make blogs a rich source for analysis. This paper presents a comparative study on open domain and opinion QA systems. A collection of opinion and mixed fact-opinion questions in English is defined and two Question Answering systems are employed to retrieve the answers to these queries. The first one is generic, while the second is specific for emotions. We comparatively evaluate and analyze the systems’ results, concluding that opinion Question Answering requires the use of specific resources and methods.
Resumo:
EmotiBlog is a corpus labelled with the homonymous annotation schema designed for detecting subjectivity in the new textual genres. Preliminary research demonstrated its relevance as a Machine Learning resource to detect opinionated data. In this paper we compare EmotiBlog with the JRC corpus in order to check the EmotiBlog robustness of annotation. For this research we concentrate on its coarse-grained labels. We carry out a deep ML experimentation also with the inclusion of lexical resources. The results obtained show a similarity with the ones obtained with the JRC demonstrating the EmotiBlog validity as a resource for the SA task.
Resumo:
In this paper we address two issues. The first one analyzes whether the performance of a text summarization method depends on the topic of a document. The second one is concerned with how certain linguistic properties of a text may affect the performance of a number of automatic text summarization methods. For this we consider semantic analysis methods, such as textual entailment and anaphora resolution, and we study how they are related to proper noun, pronoun and noun ratios calculated over original documents that are grouped into related topics. Given the obtained results, we can conclude that although our first hypothesis is not supported, since it has been found no evident relationship between the topic of a document and the performance of the methods employed, adapting summarization systems to the linguistic properties of input documents benefits the process of summarization.
Resumo:
The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.
Resumo:
This introduction provides an overview of the state-of-the-art technology in Applications of Natural Language to Information Systems. Specifically, we analyze the need for such technologies to successfully address the new challenges of modern information systems, in which the exploitation of the Web as a main data source on business systems becomes a key requirement. It will also discuss the reasons why Human Language Technologies themselves have shifted their focus onto new areas of interest very directly linked to the development of technology for the treatment and understanding of Web 2.0. These new technologies are expected to be future interfaces for the new information systems to come. Moreover, we will review current topics of interest to this research community, and will present the selection of manuscripts that have been chosen by the program committee of the NLDB 2011 conference as representative cornerstone research works, especially highlighting their contribution to the advancement of such technologies.