889 resultados para corpus multilingue
Resumo:
This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.
Resumo:
This study is a pragmatic description of the evolution of the genre of English witchcraft pamphlets from the mid-sixteenth century to the end of the seventeenth century. Witchcraft pamphlets were produced for a new kind of readership semi-literate, uneducated masses and the central hypothesis of this study is that publishing for the masses entailed rethinking the ways of writing and printing texts. Analysis of the use of typographical variation and illustrations indicates how printers and publishers catered to the tastes and expectations of this new audience. Analysis of the language of witchcraft pamphlets shows how pamphlet writers took into account the new readership by transforming formal written source materials trial proceedings into more immediate ways of writing. The material for this study comes from the Corpus of Early Modern English Witchcraft Pamphlets, which has been compiled by the author. The multidisciplinary analysis incorporates both visual and linguistic aspects of the texts, with methodologies and theoretical insights adopted eclectically from historical pragmatics, genre studies, book history, corpus linguistics, systemic functional linguistics and cognitive psychology. The findings are anchored in the socio-historical context of early modern publishing, reading, literacy and witchcraft beliefs. The study shows not only how consideration of a new audience by both authors and printers influenced the development of a genre, but also the value of combining visual and linguistic features in pragmatic analyses of texts.
De "de" : Estudio histórico-comparativo de los usos y la semántica de la preposición "de" en español
Resumo:
El presente estudio supone un intento de describir y analizar el uso de la preposición "de" sobre la base de un corpus diacrónico, con énfasis en las diferentes relaciones semánticas que establece. Partiendo de un total de más de 16.000 casos de "de" hemos establecido 48 categorías de uso, que corresponden a cuatro tipos de construcción sintáctica, a saber, el uso de "de" como complemento de nombres (CN), verbos (CV), adjetivos (CA) y, finalmente, su uso como núcleo de expresiones adverbiales independientes (CI). El estudio consta de tres partes fundamentales. En la parte I, se introduce la Lingüística Cognitiva, que constituye la base teórica esencial del trabajo. Más exactamente, se introducen conceptos como la teoría del prototipo, la teoría de las metáforas conceptuales y la gramática cognitiva, especialmente las ideas de "punto de referencia" y "relación intrínseca" (Langacker 1995, 1999). La parte II incluye el análisis de las 48 categorías. En esta parte se presentan y comentan casi 2.000 ejemplos del uso contextual de "de" extraídos del corpus diacrónico. Los resultados más importantes del análisis pueden resumirse en los siguientes puntos: El uso de "de" sigue siendo esencialmente el mismo en la actualidad que hace 800 años, en el sentido de que todas las 48 categorías se identifican en todas las épocas del corpus. El uso de "de" como complemento nominal va aumentando, al contrario de lo que ocurre con su uso como complemento verbal. En el contexto nominal son especialmente las relaciones posesivas más abstractas las que se hacen más frecuentes, mientras que en el contexto verbal las relaciones que se hacen menos frecuentes son las de separación/alejamiento, causa, agente y partitivo indefinido. Destaca la importancia del siglo XVIII como época de transición entre un primer estado de las cosas y otro posterior, en especial en relación con el carácter cada vez más abstracto de las relaciones posesivas así como con la disminución de las categorías adverbales de causa, agente y partitivo. Pese a la variación en el contexto inmediato de uso, el núcleo semántico de "de" se mantiene inalterado. La parte III toma como punto de partida los resultados del análisis de la parte II, tratando de deslindar el aporte semántico de la preposición "de" a su contexto de uso del valor de la relación en conjunto. Así, recurriendo a la metodología para determinar el significado básico y la metodología para determinar lo que constituyen significados distintos de una preposición (Tyler , Evans 2003a, 2003b), se llega a la hipótesis de que "de" posee cuatro significados básicos, a saber, 'punto de partida', 'tema/asunto', 'parte/todo' y 'posesión'. Esta hipótesis, basada en las metodologías de Tyler y Evans y en los resultados del análisis de corpus, se intenta verificar empíricamente mediante el uso de dos cuestionarios destinados a averiguar hasta qué punto las distinciones semánticas a las que se llega por vía teórica son reconocidas por los hablantes nativos de la lengua (cf. Raukko 2003). El resultado conjunto de los dos acercamientos tanto refuerza como especifica la hipótesis. Los datos que arroja el análisis de los cuestionarios parecen reforzar la idea de que el núcleo semántico de "de" es complejo, constando de los cuatro valores mencionados. Sin embargo, cada uno de estos valores básicos constituye un prototipo local, en torno al cual se construye un complejo de matices semánticos derivados del prototipo. La idea final es que los hablantes son conscientes de los cuatro postulados valores básicos, pero que también distinguen matices más detallados, como son las ideas de 'causa', 'agente', 'instrumento', 'finalidad', 'cualidad', etc. Es decir, "de" constituye un elemento polisémico complejo cuya estructura semántica puede describirse como una semejanza de familia centrada en cuatro valores básicos en torno a los cuales se encuentra una serie de matices más específicos, que también constituyen valores propios de la preposición. Creemos, además, que esta caracterización semántica es válida para todas las épocas de la historia del español, con unas pequeñas modificaciones en el peso relativo de los distintos matices, lo cual está relacionado con la observada variación diacrónica en el uso de "de".
Resumo:
The aim of the study is to investigate the use of finlandisms in an historical perspective, how they have been viewed from the mid-19th century to this day, and the effect of language planning on their use. A finlandism is a word, a phrase, or a structure that is used only in the Swedish varieties used in Finland (i.e. in Finland Swedish), or used in these varieties in a different meaning than in the Swedish used in Sweden. Various aspects of Finland-Swedish language planning are discussed in relation to language planning generally; in addition, the relation of Finland Swedish to Standard Swedish and standard regional varieties is discussed, and various types of finlandisms are analysed in detail. A comprehensive picture is provided of the emergence and evolution of the ideology of language planning from the mid-19th century up until today. A theoretical model of corpus planning is presented and its effect on linguistic praxis described. One result of the study is that the belief among Finland-Swedish language planners that the Swedish language in Finland must not be allowed to become distanced from Standard Swedish, has been widely adopted by the average Finland Swede, particularly during the interwar period, following the publication of Hugo Bergroth s work Finlandssvenska in 1917. Criticism of this language-planning ideology started to appear in the 1950s, and intensified in the 1970s. However, language planning and the basis for this conception of language continue to enjoy strong support among Swedish-speaking Finns. I show that the editing of Finnish literary texts written in Swedish has often been somewhat amateurish and the results not always linguistically appropriate, and that Swedish publishers have in fact adopted a rather liberal attitude towards finlandisms. My conclusion is that language planning has achieved rather modest results in its resistance to finlandisms. Most of the finlandisms used in 1915 were still in use in 2005. Finlandisms occur among speakers of all ages, and even among academically educated people despite their more elevated style. The most common finlandisms were used by informants of all ages. The ones that are firmly rooted are the most established, in other words those that are stylistically neutral, seemingly genuinely Swedish, but which are nevertheless strongly supported by Finnish, and display a shift in meaning as compared with Standard Swedish.
Resumo:
Language Documentation and Description as Language Planning Working with Three Signed Minority Languages Sign languages are minority languages that typically have a low status in society. Language planning has traditionally been controlled from outside the sign-language community. Even though signed languages lack a written form, dictionaries have played an important role in language description and as tools in foreign language learning. The background to the present study on sign language documentation and description as language planning is empirical research in three dictionary projects in Finland-Swedish Sign Language, Albanian Sign Language, and Kosovar Sign Language. The study consists of an introductory article and five detailed studies which address language planning from different perspectives. The theoretical basis of the study is sociocultural linguistics. The research methods used were participant observation, interviews, focus group discussions, and document analysis. The primary research questions are the following: (1) What is the role of dictionary and lexicographic work in language planning, in research on undocumented signed language, and in relation to the language community as such? (2) What factors are particular challenges in the documentation of a sign language and should therefore be given special attention during lexicographic work? (3) Is a conventional dictionary a valid tool for describing an undocumented sign language? The results indicate that lexicographic work has a central part to play in language documentation, both as part of basic research on undocumented sign languages and for status planning. Existing dictionary work has contributed new knowledge about the languages and the language communities. The lexicographic work adds to the linguistic advocacy work done by the community itself with the aim of vitalizing the language, empowering the community, receiving governmental recognition for the language, and improving the linguistic (human) rights of the language users. The history of signed languages as low status languages has consequences for language planning and lexicography. One challenge that the study discusses is the relationship between the sign-language community and the hearing sign linguist. In order to make it possible for the community itself to take the lead in a language planning process, raising linguistic awareness within the community is crucial. The results give rise to questions of whether lexicographic work is of more importance for status planning than for corpus planning. A conventional dictionary as a tool for describing an undocumented sign language is criticised. The study discusses differences between signed and spoken/written languages that are challenging for lexicographic presentations. Alternative electronic lexicographic approaches including both lexicon and grammar are also discussed. Keywords: sign language, Finland-Swedish Sign Language, Albanian Sign Language, Kosovar Sign Language, language documentation and description, language planning, lexicography
Resumo:
In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, made with traditional finite-state morphology tools, and demonstrate rapid building of Northern Sámi and English spell checkers from tools and resources available from the Internet.
Resumo:
Language software applications encounter new words, e.g., acronyms, technical terminology, names or compounds of such words. In order to add new words to a lexicon, we need to indicate their inflectional paradigm. We present a new generally applicable method for creating an entry generator, i.e. a paradigm guesser, for finite-state transducer lexicons. As a guesser tends to produce numerous suggestions, it is important that the correct suggestions be among the first few candidates. We prove some formal properties of the method and evaluate it on Finnish, English and Swedish full-scale transducer lexicons. We use the open-source Helsinki Finite-State Technology to create finitestate transducer lexicons from existing lexical resources and automatically derive guessers for unknown words. The method has a recall of 82-87 % and a precision of 71-76 % for the three test languages. The model needs no external corpus and can therefore serve as a baseline.
Resumo:
N-gram language models and lexicon-based word-recognition are popular methods in the literature to improve recognition accuracies of online and offline handwritten data. However, there are very few works that deal with application of these techniques on online Tamil handwritten data. In this paper, we explore methods of developing symbol-level language models and a lexicon from a large Tamil text corpus and their application to improving symbol and word recognition accuracies. On a test database of around 2000 words, we find that bigram language models improve symbol (3%) and word recognition (8%) accuracies and while lexicon methods offer much greater improvements (30%) in terms of word recognition, there is a large dependency on choosing the right lexicon. For comparison to lexicon and language model based methods, we have also explored re-evaluation techniques which involve the use of expert classifiers to improve symbol and word recognition accuracies.
Resumo:
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated given the summary topics. This ensures that our summaries always highlight the crux of the document without paying any attention to the grammar and the structure of the documents. Finally, we evaluate our summaries on the DUC 2002 Single document summarization data corpus using ROUGE measures. Our summaries had higher ROUGE values and better semantic similarity with the documents than the DUC summaries.
Resumo:
When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.
Resumo:
There are many popular models available for classification of documents like Naïve Bayes Classifier, k-Nearest Neighbors and Support Vector Machine. In all these cases, the representation is based on the “Bag of words” model. This model doesn't capture the actual semantic meaning of a word in a particular document. Semantics are better captured by proximity of words and their occurrence in the document. We propose a new “Bag of Phrases” model to capture this discriminative power of phrases for text classification. We present a novel algorithm to extract phrases from the corpus using the well known topic model, Latent Dirichlet Allocation(LDA), and to integrate them in vector space model for classification. Experiments show a better performance of classifiers with the new Bag of Phrases model against related representation models.
Resumo:
Latent variable methods, such as PLCA (Probabilistic Latent Component Analysis) have been successfully used for analysis of non-negative signal representations. In this paper, we formulate PLCS (Probabilistic Latent Component Segmentation), which models each time frame of a spectrogram as a spectral distribution. Given the signal spectrogram, the segmentation boundaries are estimated using a maximum-likelihood approach. For an efficient solution, the algorithm imposes a hard constraint that each segment is modelled by a single latent component. The hard constraint facilitates the solution of ML boundary estimation using dynamic programming. The PLCS framework does not impose a parametric assumption unlike earlier ML segmentation techniques. PLCS can be naturally extended to model coarticulation between successive phones. Experiments on the TIMIT corpus show that the proposed technique is promising compared to most state of the art speech segmentation algorithms.
Resumo:
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Resumo:
Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages. (C) 2014 Acoustical Society of America.
Resumo:
This paper describes a spatio-temporal registration approach for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics. (C) 2014 Acoustical Society of America