914 resultados para computational linguistics
Resumo:
Sequence problems belong to the most challenging interdisciplinary topics of the actuality. They are ubiquitous in science and daily life and occur, for example, in form of DNA sequences encoding all information of an organism, as a text (natural or formal) or in form of a computer program. Therefore, sequence problems occur in many variations in computational biology (drug development), coding theory, data compression, quantitative and computational linguistics (e.g. machine translation). In recent years appeared some proposals to formulate sequence problems like the closest string problem (CSP) and the farthest string problem (FSP) as an Integer Linear Programming Problem (ILPP). In the present talk we present a general novel approach to reduce the size of the ILPP by grouping isomorphous columns of the string matrix together. The approach is of practical use, since the solution of sequence problems is very time consuming, in particular when the sequences are long.
Resumo:
A purpose of this research study was to demonstrate the practical linguistic study and evaluation of dissertations by using two examples of the latest technology, the microcomputer and optical scanner. That involved developing efficient methods for data entry plus creating computer algorithms appropriate for personal, linguistic studies. The goal was to develop a prototype investigation which demonstrated practical solutions for maximizing the linguistic potential of the dissertation data base. The mode of text entry was from a Dest PC Scan 1000 Optical Scanner. The function of the optical scanner was to copy the complete stack of educational dissertations from the Florida Atlantic University Library into an I.B.M. XT microcomputer. The optical scanner demonstrated its practical value by copying 15,900 pages of dissertation text directly into the microcomputer. A total of 199 dissertations or 72% of the entire stack of education dissertations (277) were successfully copied into the microcomputer's word processor where each dissertation was analyzed for a variety of syntax frequencies. The results of the study demonstrated the practical use of the optical scanner for data entry, the microcomputer for data and statistical analysis, and the availability of the college library as a natural setting for text studies. A supplemental benefit was the establishment of a computerized dissertation corpus which could be used for future research and study. The final step was to build a linguistic model of the differences in dissertation writing styles by creating 7 factors from 55 dependent variables through principal components factor analysis. The 7 factors (textual components) were then named and described on a hypothetical construct defined as a continuum from a conversational, interactional style to a formal, academic writing style. The 7 factors were then grouped through discriminant analysis to create discriminant functions for each of the 7 independent variables. The results indicated that a conversational, interactional writing style was associated with more recent dissertations (1972-1987), an increase in author's age, females, and the department of Curriculum and Instruction. A formal, academic writing style was associated with older dissertations (1972-1987), younger authors, males, and the department of Administration and Supervision. It was concluded that there were no significant differences in writing style due to subject matter (community college studies) compared to other subject matter. It was also concluded that there were no significant differences in writing style due to the location of dissertation origin (Florida Atlantic University, University of Central Florida, Florida International University).
Resumo:
Persuasive communication is the process of shaping, reinforcing and changing others' responses. In political debates, speakers express their views towards the debated topics by choosing both the content of their discourse and the argumentation process. In this work we study the use of semantic frames for modelling argumentation in speakers' discourse. We investigate the impact of a speaker's argumentation style and their effect in influencing an audience in supporting their candidature. We model the influence index of each candidate based on their relative standings in the polls released prior to the debate and present a system which ranks speakers in terms of their relative influence using a combination of content and persuasive argumentation features. Our results show that although content alone is predictive of a speaker's influence rank, persuasive argumentation also affects such indices.
Resumo:
Event extraction from texts aims to detect structured information such as what has happened, to whom, where and when. Event extraction and visualization are typically considered as two different tasks. In this paper, we propose a novel approach based on probabilistic modelling to jointly extract and visualize events from tweets where both tasks benefit from each other. We model each event as a joint distribution over named entities, a date, a location and event-related keywords. Moreover, both tweets and event instances are associated with coordinates in the visualization space. The manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the high-dimensional space is incorporated into the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.
Resumo:
Conventional topic models are ineffective for topic extraction from microblog messages since the lack of structure and context among the posts renders poor message-level word co-occurrence patterns. In this work, we organize microblog posts as conversation trees based on reposting and replying relations, which enrich context information to alleviate data sparseness. Our model generates words according to topic dependencies derived from the conversation structures. In specific, we differentiate messages as leader messages, which initiate key aspects of previously focused topics or shift the focus to different topics, and follower messages that do not introduce any new information but simply echo topics from the messages that they repost or reply. Our model captures the different extents that leader and follower messages may contain the key topical words, thus further enhances the quality of the induced topics. The results of thorough experiments demonstrate the effectiveness of our proposed model.
Resumo:
We study the problem of detecting sentences describing adverse drug reactions (ADRs) and frame the problem as binary classification. We investigate different neural network (NN) architectures for ADR classification. In particular, we propose two new neural network models, Convolutional Recurrent Neural Network (CRNN) by concatenating convolutional neural networks with recurrent neural networks, and Convolutional Neural Network with Attention (CNNA) by adding attention weights into convolutional neural networks. We evaluate various NN architectures on a Twitter dataset containing informal language and an Adverse Drug Effects (ADE) dataset constructed by sampling from MEDLINE case reports. Experimental results show that all the NN architectures outperform the traditional maximum entropy classifiers trained from n-grams with different weighting strategies considerably on both datasets. On the Twitter dataset, all the NN architectures perform similarly. But on the ADE dataset, CNN performs better than other more complex CNN variants. Nevertheless, CNNA allows the visualisation of attention weights of words when making classification decisions and hence is more appropriate for the extraction of word subsequences describing ADRs.
Resumo:
This paper presents a study made in a field poorly explored in the Portuguese language – modality and its automatic tagging. Our main goal was to find a set of attributes for the creation of automatic tag- gers with improved performance over the bag-of-words (bow) approach. The performance was measured using precision, recall and F1. Because it is a relatively unexplored field, the study covers the creation of the corpus (composed by eleven verbs), the use of a parser to extract syntac- tic and semantic information from the sentences and a machine learning approach to identify modality values. Based on three different sets of attributes – from trigger itself and the trigger’s path (from the parse tree) and context – the system creates a tagger for each verb achiev- ing (in almost every verb) an improvement in F1 when compared to the traditional bow approach.
Resumo:
This paper describes various experiments done to investigate author profiling of tweets in 4 different languages – English, Dutch, Italian, and Spanish. Profiling consists of age and gender classification, as well as regression on 5 different person- ality dimensions – extroversion, stability, agreeableness, open- ness, and conscientiousness. Different sets of features were tested – bag-of-words, word ngrams, POS ngrams, and average of word embeddings. SVM was used as the classifier. Tfidf worked best for most English tasks while for most of the tasks from the other languages, the combination of the best features worked better.
Resumo:
The central thesis of this report is that human language is NP-complete. That is, the process of comprehending and producing utterances is bounded above by the class NP, and below by NP-hardness. This constructive complexity thesis has two empirical consequences. The first is to predict that a linguistic theory outside NP is unnaturally powerful. The second is to predict that a linguistic theory easier than NP-hard is descriptively inadequate. To prove the lower bound, I show that the following three subproblems of language comprehension are all NP-hard: decide whether a given sound is possible sound of a given language; disambiguate a sequence of words; and compute the antecedents of pronouns. The proofs are based directly on the empirical facts of the language user's knowledge, under an appropriate idealization. Therefore, they are invariant across linguistic theories. (For this reason, no knowledge of linguistic theory is needed to understand the proofs, only knowledge of English.) To illustrate the usefulness of the upper bound, I show that two widely-accepted analyses of the language user's knowledge (of syntactic ellipsis and phonological dependencies) lead to complexity outside of NP (PSPACE-hard and Undecidable, respectively). Next, guided by the complexity proofs, I construct alternate linguisitic analyses that are strictly superior on descriptive grounds, as well as being less complex computationally (in NP). The report also presents a new framework for linguistic theorizing, that resolves important puzzles in generative linguistics, and guides the mathematical investigation of human language.
Resumo:
This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools.
Resumo:
This paper discusses the claim of the situatedness of research in both theoretical and applied linguistics and some of its implications and argues that it is linked to the performativity of all assertions, including scientific ones. More importantly, I argue that it is the regressive infinity of performativity that makes inevitable the passage from presumably 'dispassionate' research to militancy.
Resumo:
Motivated by a recently proposed biologically inspired face recognition approach, we investigated the relation between human behavior and a computational model based on Fourier-Bessel (FB) spatial patterns. We measured human recognition performance of FB filtered face images using an 8-alternative forced-choice method. Test stimuli were generated by converting the images from the spatial to the FB domain, filtering the resulting coefficients with a band-pass filter, and finally taking the inverse FB transformation of the filtered coefficients. The performance of the computational models was tested using a simulation of the psychophysical experiment. In the FB model, face images were first filtered by simulated V1- type neurons and later analyzed globally for their content of FB components. In general, there was a higher human contrast sensitivity to radially than to angularly filtered images, but both functions peaked at the 11.3-16 frequency interval. The FB-based model presented similar behavior with regard to peak position and relative sensitivity, but had a wider frequency band width and a narrower response range. The response pattern of two alternative models, based on local FB analysis and on raw luminance, strongly diverged from the human behavior patterns. These results suggest that human performance can be constrained by the type of information conveyed by polar patterns, and consequently that humans might use FB-like spatial patterns in face processing.
Resumo:
We evaluated the performance of a novel procedure for segmenting mammograms and detecting clustered microcalcifications in two types of image sets obtained from digitization of mammograms using either a laser scanner, or a conventional ""optical"" scanner. Specific regions forming the digital mammograms were identified and selected, in which clustered microcalcifications appeared or not. A remarkable increase in image intensity was noticed in the images from the optical scanner compared with the original mammograms. A procedure based on a polynomial correction was developed to compensate the changes in the characteristic curves from the scanners, relative to the curves from the films. The processing scheme was applied to both sets, before and after the polynomial correction. The results indicated clearly the influence of the mammogram digitization on the performance of processing schemes intended to detect microcalcifications. The image processing techniques applied to mammograms digitized by both scanners, without the polynomial intensity correction, resulted in a better sensibility in detecting microcalcifications in the images from the laser scanner. However, when the polynomial correction was applied to the images from the optical scanner, no differences in performance were observed for both types of images. (C) 2008 SPIE and IS&T [DOI: 10.1117/1.3013544]
Resumo:
Chemical reactivity, photolability, and computational studies of the ruthenium nitrosyl complex with a substituted cyclam, fac-[Ru(NO)Cl(2)(kappa(3)N(4),N(8),N(11)(1-carboxypropyl)cyclam)]Cl center dot H(2)O ((1-carboxypropyl) cyclam = 3-(1,4,8,11-tetraazacyclotetradecan-1-yl) propionic acid)), (I) are described. Chloride ligands do not undergo aquation reactions (at 25 degrees C, pH 3). The rate of nitric oxide (NO) dissociation (k(obs-NO)) upon reduction of I is 2.8 s(-1) at 25 +/- 1 degrees C (in 0.5 mol L(-1) HCl), which is close to the highest value found for related complexes. The uncoordinated carboxyl of I has a pK(a) of similar to 3.3, which is close to that of the carboxyl of the non coordinated (1-carboxypropyl) cyclam (pK(a) = 3.4). Two additional pK(a) values were found for I at similar to 8.0 and similar to 11.5. Upon electrochemical reduction or under irradiation with light (lambda(irr) = 350 or 520 nm; pH 7.4), I releases NO in aqueous solution. The cyclam ring N bound to the carboxypropyl group is not coordinated, resulting in a fac configuration that affects the properties and chemical reactivities of I, especially as NO donor, compared with analogous trans complexes. Among the computational models tested, the B3LYP/ECP28MDF, cc-pVDZ resulted in smaller errors for the geometry of I. The computational data helped clarify the experimental acid-base equilibria and indicated the most favourable site for the second deprotonation, which follows that of the carboxyl group. Furthermore, it showed that by changing the pH it is possible to modulate the electron density of I with deprotonation. The calculated NO bond length and the Ru/NO charge ratio indicated that the predominant canonical structure is [Ru(III)NO], but the Ru-NO bond angles and bond index (b.i.) values were less clear; the angles suggested that [Ru(II)NO(+)] could contribute to the electronic structure of I and b.i. values indicated a contribution from [Ru(IV)NO(-)]. Considering that some experimental data are consistent with a [Ru(II)NO(+)] description, while others are in agreement with [Ru(III)NO], the best description for I would be a linear combination of the three canonical forms, with a higher weight for [Ru(II)NO(+)] and [Ru(III)NO].