Biblioteca Digital

974 resultados para Language processing

Tree edit distance as a baseline approach for paraphrase representation

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Finding an adequate paraphrase representation formalism is a challenging issue in Natural Language Processing. In this paper, we analyse the performance of Tree Edit Distance as a paraphrase representation baseline. Our experiments using Edit Distance Textual Entailment Suite show that, as Tree Edit Distance consists of a purely syntactic approach, paraphrase alternations not based on structural reorganizations do not find an adequate representation. They also show that there is much scope for better modelling of the way trees are aligned.

Paraphrase concept and typology. A linguistically based and computationally oriented approach

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we present a critical analysis of the state of the art in the definition and typologies of paraphrasing. This analysis shows that there exists no characterization of paraphrasing that is comprehensive, linguistically based and computationally tractable at the same time. The following sets out to define and delimit the concept on the basis of the propositional content. We present a general, inclusive and computationally oriented typology of the linguistic mechanisms that give rise to form variations between paraphrase pairs.

Plagiarism meets paraphrasing: insights for the new generation in automatic plagiarism detection

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyse the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource which uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analysed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarising, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analysed, providing critical insights for the improvement of automatic plagiarism detection systems.

Llengua al límit. Una llengua, dos camins. L'encontre necessari.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En el treball es realitza una transcripció de dos programes de televisió, amb la idea de saber quin és el tipus de llenguatge que usen aquests mitjans per adreçar-se al seu públic. Però seria absurd ignorar altres canals per als quals la llengua és imprescindible. Em refereixo al cinema, sobretot. I malgrat que no es considera un mitjà de comunicació, també és un element importantíssim pel que fa al tractament i transmissió lingüístics. I molts productes del cinema acaben sortint per televisió. La premsa escrita i, com a cas especial, Internet, també hi tenen força a dir.

Decreased pyramidal neuron size in Brodmann areas 44 and 45 in patients with autism.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Autism is a neurodevelopmental disorder characterized by deficits in social interaction and social communication, as well as by the presence of repetitive and stereotyped behaviors and interests. Brodmann areas 44 and 45 in the inferior frontal cortex, which are involved in language processing, imitation function, and sociality processing networks, have been implicated in this complex disorder. Using a stereologic approach, this study aims to explore the presence of neuropathological differences in areas 44 and 45 in patients with autism compared to age- and hemisphere-matched controls. Based on previous evidence in the fusiform gyrus, we expected to find a decrease in the number and size of pyramidal neurons as well as an increase in volume of layers III, V, and VI in patients with autism. We observed significantly smaller pyramidal neurons in patients with autism compared to controls, although there was no difference in pyramidal neuron numbers or layer volumes. The reduced pyramidal neuron size suggests that a certain degree of dysfunction of areas 44 and 45 plays a role in the pathology of autism. Our results also support previous studies that have shown specific cellular neuropathology in autism with regionally specific reduction in neuron size, and provide further evidence for the possible involvement of the mirror neuron system, as well as impairment of neuronal networks relevant to communication and social behaviors, in this disorder.

Reconciling phonological neighborhood effects in speech production through single trial analysis

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A crucial step for understanding how lexical knowledge is represented is to describe the relative similarity of lexical items, and how it influences language processing. Previous studies of the effects of form similarity on word production have reported conflicting results, notably within and across languages. The aim of the present study was to clarify this empirical issue to provide specific constraints for theoretical models of language production. We investigated the role of phonological neighborhood density in a large-scale picture naming experiment using fine-grained statistical models. The results showed that increasing phonological neighborhood density has a detrimental effect on naming latencies, and re-analyses of independently obtained data sets provide supplementary evidence for this effect. Finally, we reviewed a large body of evidence concerning phonological neighborhood density effects in word production, and discussed the occurrence of facilitatory and inhibitory effects in accuracy measures. The overall pattern shows that phonological neighborhood generates two opposite forces, one facilitatory and one inhibitory. In cases where speech production is disrupted (e.g. certain aphasic symptoms), the facilitatory component may emerge, but inhibitory processes dominate in efficient naming by healthy speakers. These findings are difficult to accommodate in terms of monitoring processes, but can be explained within interactive activation accounts combining phonological facilitation and lexical competition.

Source separation techniques applied to linear prediction

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The prediction filters are well known models for signal estimation, in communications, control and many others areas. The classical method for deriving linear prediction coding (LPC) filters is often based on the minimization of a mean square error (MSE). Consequently, second order statistics are only required, but the estimation is only optimal if the residue is independent and identically distributed (iid) Gaussian. In this paper, we derive the ML estimate of the prediction filter. Relationships with robust estimation of auto-regressive (AR) processes, with blind deconvolution and with source separation based on mutual information minimization are then detailed. The algorithm, based on the minimization of a high-order statistics criterion, uses on-line estimation of the residue statistics. Experimental results emphasize on the interest of this approach.

ClInt: A bilingual Spanish-Catalan spoken corpus of clinical interviews

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we present ClInt (Clinical Interview), a bilingual Spanish-Catalan spoken corpus that contains 15 hours of clinical interviews. It consists of audio files aligned with multiple-level transcriptions comprising orthographic, phonetic and morphological information, as well as linguistic and extralinguistic encoding. This is a previously non-existent resource for these languages and it offers a wide-ranging exploitation potential in a broad variety of disciplines such as Linguistics, Natural Language Processing and related fields.

CoCo, a web interface for corpora compilation

Relevância:

60.00% 60.00%

Publicador:

Resumo:

CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.

New kernel functions and learning methods for text and data mining

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.

Opinion Mining

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.

Learning Preferences with Kernel-Based Methods

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.

Imaging bilinguals: When the neurosciences meet the languange sciences

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The starting point of our investigation was the longstanding notion that bilingual individuals need effective mechanisms to prevent interference from one language while processing material in the other (e.g. Penﬁeld and Roberts, 1959). To demonstrate how the prevention of interference is implemented in the brain we employed event-related brain potentials (ERPs; see Munte, Urbach, ¨ Duzel and Kutas, 2000, for an introductory review) ¨ and functional magnetic resonance imaging (fMRI) techniques, thus pursuing a combined temporal and spatial imaging approach. In contrast to previous investigations using neuroimaging techniques in bilinguals, which had been mainly concerned with the localization of the primary and secondary languages (e.g. Perani, Paulesu, Galles, Dupoux, Dehaene, Bettinardi, Cappa, Fazio and Mehler, 1998; Chee, Caplan, Soon, Sriram, Tan, Thiel and Weekes, 1999), our study addressed the dynamic aspects of bilingual language processing.

Functional neuroanatomy of meaning acquisition from context

Relevância:

60.00% 60.00%

Publicador:

Resumo:

An important issue in language learning is how new words are integrated in the brain representations that sustain language processing. To identify the brain regions involved in meaning acquisition and word learning, we conducted a functional magnetic resonance imaging study. Young participants were required to deduce the meaning of a novel word presented within increasingly constrained sentence contexts that were read silently during the scanning session. Inconsistent contexts were also presented in which no meaning could be assigned to the novel word. Participants showed meaning acquisition in the consistent but not in the inconsistent condition. A distributed brain network was identified comprising the left anterior inferior frontal gyrus (BA 45), the middle temporal gyrus (BA 21), the parahippocampal gyrus, and several subcortical structures (the thalamus and the striatum). Drawing on previous neuroimaging evidence, we tentatively identify the roles of these brain areas in the retrieval, selection, and encoding of the meaning.

Biomedical Event Extraction with Machine Learning

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biomedical natural language processing (BioNLP) is a subfield of natural language processing, an area of computational linguistics concerned with developing programs that work with natural language: written texts and speech. Biomedical relation extraction concerns the detection of semantic relations such as protein-protein interactions (PPI) from scientific texts. The aim is to enhance information retrieval by detecting relations between concepts, not just individual concepts as with a keyword search. In recent years, events have been proposed as a more detailed alternative for simple pairwise PPI relations. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence “Protein A causes protein B to bind protein C” can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used by computational applications. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction. In this thesis we present a method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from full dependency parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures. We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks. The Turku Event Extraction System is published as a freely available open-source project, documenting the research in detail as well as making the method available for practical applications. In particular, in this thesis we describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects. Finally, we discuss related literature, summarize the contributions of the work and present some thoughts on future directions for biomedical event extraction. This thesis includes and builds on six original research publications. The first of these introduces the analysis of dependency parses that leads to development of TEES. The entries in the three BioNLP Shared Tasks, as well as in the DDIExtraction 2011 task are covered in four publications, and the sixth one demonstrates the application of the system to PubMed-scale text mining.

«
1
2
...
5
6
7
8
9
10
11
...
64
65
»