867 resultados para corpus, collocations, corpus linguistics, EPTIC
Resumo:
Esta pesquisa tem como objetivo ressaltar a relevância da utilização do gênero textual carta do leitor nas aulas de Língua Portuguesa como recurso para ampliar a competência textual e argumentativa dos discentes. Visando explorar as características funcionais e estruturais da carta, é possível despertar nos estudantes um conhecimento aplicável aos demais gêneros de tipologia argumentativa e enriquecer, consequentemente, suas observações como leitor, pois o trabalho com a carta promove a análise e a reflexão de seu conteúdo e de seus elementos composicionais. Tendo em vista os princípios da Linguística Sistêmico-Funcional, o trabalho foi desenvolvido seguindo os pressupostos das Metafunções da Linguagem (HALLIDAY; MATTHIESSEN, 2004) e da Teoria da Valoração (MARTIN; WHITE, 2005). O corpus analisado é composto de quatro cartas do leitor, avaliadas de forma demonstrativa, e outras vinte e seis, apresentadas com avaliações qualitativas e quantitativas por meio de tabela. Também são utilizados trechos selecionados de outras dez cartas do leitor. O material engloba as trocas interpessoais e as ocorrências de Engajamento, Julgamento e Apreciação. O texto dos autores das cartas foi analisado por meio da identificação de marcas linguísticas de interpessoalidade e, principalmente, de Valoração. Além disso, outros eixos temáticos são explorados, como os cortes impostos às cartas, que criam a carta do editor, e a importância da revisão textual. Os resultados quantitativos e qualitativos da pesquisa revelam que as cartas são escritas com a finalidade de estabelecer trocas de informações e opiniões acerca de comportamentos humanos e, às vezes, de avaliações estéticas de composição, de reação ou de valor social, em que o posicionamento do autor é revelado por seu Engajamento autoral. Desse modo, verifica-se que o uso das cartas do leitor, em sala de aula, possibilita aos alunos ampliar as competências de exposição de ideias, de interação com o interlocutor e de inserção em diferentes práticas sociais
Resumo:
Esta pesquisa tem como foco observar como alunos de escola básica utilizam os operadores argumentativos como recurso de coesão textual em uma redação dissertativo-argumentativa, modelo Enem. As produções foram constituídas durante o ano letivo de 2014, por alunos do 2 ano do ensino médio de uma escola pública. O trabalho foi desenvolvido à luz da Teoria da Argumentação, da Linguística Textual bem como das concepções de texto e gênero textual. O corpus analisado é constituído de 20 produções textuais de 10 alunos de uma escola básica. São 10 redações produzidas no início do ano e 10 no final do ano letivo mencionado. Com base nessas produções, foi possível estabelecer uma comparação no que diz respeito ao uso de operadores argumentativos como recurso coesivo. Constatou-se que esses elementos são utilizados tanto na primeira redação produzida quanto na segunda. Entretanto a ênfase no uso e na diversidade desses recursos foi maior nas produções do 2 semestre. Esses resultados foram apresentados de forma qualitativa e quantitativa, por meio de gráficos, para uma melhor visualização dos resultados. As reflexões acerca de leitura e produção de textos, problemas discursivos, textuais e linguísticos desenvolvidas durante o ano letivo, com o grupo de alunos que produziu os textos-corpus, são fatores que concorrem para esse resultado.
Resumo:
This paper presents an agenda-based user simulator which has been extended to be trainable on real data with the aim of more closely modelling the complex rational behaviour exhibited by real users. The train-able part is formed by a set of random decision points that may be encountered during the process of receiving a system act and responding with a user act. A sample-based method is presented for using real user data to estimate the parameters that control these decisions. Evaluation results are given both in terms of statistics of generated user behaviour and the quality of policies trained with different simulators. Compared to a handcrafted simulator, the trained system provides a much better fit to corpus data and evaluations suggest that this better fit should result in improved dialogue performance. © 2010 Association for Computational Linguistics.
Resumo:
This report describes a computational system with which phonologists may describe a natural language in terms of autosegmental phonology, currently the most advanced theory pertaining to the sound systems of human languages. This system allows linguists to easily test autosegmental hypotheses against a large corpus of data. The system was designed primarily with tonal systems in mind, but also provides support for tree or feature matrix representation of phonemes (as in The Sound Pattern of English), as well as syllable structures and other aspects of phonological theory. Underspecification is allowed, and trees may be specified before, during, and after rule application. The association convention is automatically applied, and other principles such as the conjunctivity condition are supported. The method of representation was designed such that rules are designated in as close a fashion as possible to the existing conventions of autosegmental theory while adhering to a textual constraint for maximum portability.
Resumo:
The Zipf curves of log of frequency against log of rank for a large English corpus of 500 million word tokens, 689,000 word types and for a large Spanish corpus of 16 million word tokens, 139,000 word types are shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank they turn to give a slope close to –2. This is apparently mainly due to foreign words and place names. Other Zipf curves for highlyinflected Indo-European languages, Irish and ancient Latin, are also given. Because of the larger number of word types per lemma, they remain flatter than the English curve maintaining a slope of –1 until turning points of about ranks 30,000 for Irish and 10,000 for Latin. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for both English and Spanish, 30,000 for Irish and 10,000 for Latin.
Resumo:
This article explores ‘temporal framing’ in the oral conte. The starting point is a recent theoretical debate around the temporal structure of narrative discourse which has highlighted a fundamental tension between the approaches of two of the most influential current theoretical models, one of which is ‘framing theory’. The specific issue concerns the role of temporal adverbials appearing at the head of the clause (e.g. dates, relative temporal adverbials such as le lendemain) versus that of temporal ‘connectives’ such as puis, ensuite, etc. Through an analysis of a corpus of contes performed at the Conservatoire contemporain de Littérature Orale, I shall explore temporal framing in the light of this theoretical debate, and shall argue that, as with other types of narrative discourse, framing is primarily a structural rather than a temporal device in oral narrative. In a final section, I shall further argue, using Kintsch’s construction-integration model of narrative processing, that framing is fundamental to the cognitive processes involved in oral story performance.
Resumo:
This article is concerned with the description of data relating to a case of morphosyntactic variation and change in French: verbal agreement patterns with the noun majorité (‘majority’). Quantitative analysis of apparent-time data from cloze (‘gap-fill’) tests suggests that plural agreement with this noun is increasing; data from a diachronic corpus study confirm that this is the case. A number of factors are found to constrain this variation and change, including most importantly the presence and number of a post-modifying noun phrase (e.g. la majorité des hommes). The implications of these findings for the definition of collective nouns in French are discussed.
Resumo:
In this paper, we introduce an application of matrix factorization to produce corpus-derived, distributional
models of semantics that demonstrate cognitive plausibility. We find that word representations
learned by Non-Negative Sparse Embedding (NNSE), a variant of matrix factorization, are sparse,
effective, and highly interpretable. To the best of our knowledge, this is the first approach which
yields semantic representation of words satisfying these three desirable properties. Though extensive
experimental evaluations on multiple real-world tasks and datasets, we demonstrate the superiority
of semantic models learned by NNSE over other state-of-the-art baselines.
Resumo:
In most previous research on distributional semantics, Vector Space Models (VSMs) of words are built either from topical information (e.g., documents in which a word is present), or from syntactic/semantic types of words (e.g., dependency parse links of a word in sentences), but not both. In this paper, we explore the utility of combining these two representations to build VSM for the task of semantic composition of adjective-noun phrases. Through extensive experiments on benchmark datasets, we find that even though a type-based VSM is effective for semantic composition, it is often outperformed by a VSM built using a combination of topic- and type-based statistics. We also introduce a new evaluation task wherein we predict the composed vector representation of a phrase from the brain activity of a human subject reading that phrase. We exploit a large syntactically parsed corpus of 16 billion tokens to build our VSMs, with vectors for both phrases and words, and make them publicly available.
Resumo:
Vector space models (VSMs) represent word meanings as points in a high dimensional space. VSMs are typically created using a large text corpora, and so represent word semantics as observed in text. We present a new algorithm (JNNSE) that can incorporate a measure of semantics not previously used to create VSMs: brain activation data recorded while people read words. The resulting model takes advantage of the complementary strengths and weaknesses of corpus and brain activation data to give a more complete representation of semantics. Evaluations show that the model 1) matches a behavioral measure of semantics more closely, 2) can be used to predict corpus data for unseen words and 3) has predictive power that generalizes across brain imaging technologies and across subjects. We believe that the model is thus a more faithful representation of mental vocabularies.
Resumo:
This paper presents a machine learning approach to sarcasm detection on Twitter in two languages – English and Czech. Although there has been some research in sarcasm detection in languages other than English (e.g., Dutch, Italian, and Brazilian Portuguese), our work is the first attempt at sarcasm detection in the Czech language. We created a large Czech Twitter corpus consisting of 7,000 manually-labeled tweets and provide it to the community. We evaluate two classifiers with various combinations of features on both the Czech and English datasets. Furthermore, we tackle the issues of rich Czech morphology by examining different preprocessing techniques. Experiments show that our language-independent approach significantly outperforms adapted state-of-the-art methods in English (F-measure 0.947) and also represents a strong baseline for further research in Czech (F-measure 0.582).
Resumo:
While reading times are often used to measure working memory load, frequency effects (such as surprisal or n-gram frequencies) also have strong confounding effects on reading times. This work uses a naturalistic audio corpus with magnetoencephalographic (MEG) annotations to measure working memory load during sentence processing. Alpha oscillations in posterior regions of the brain have been found to correlate with working memory load in non-linguistic tasks (Jensen et al., 2002), and the present study extends these findings to working memory load caused by syntactic center embeddings. Moreover, this work finds that frequency effects in naturally-occurring stimuli do not significantly contribute to neural oscillations in any frequency band, which suggests that many modeling claims could be tested on this sort of data even without controlling for frequency effects.
Resumo:
Tese de doutoramento, Linguística (Linguística Aplicada), Universidade de Lisboa, Faculdade de Letras, 2016
Resumo:
La présente recherche porte sur la théorie des parties du discours dans l’Antiquité grecque et plus particulièrement sur la réflexion entourant les conjonctions. Le premier chapitre se concentre sur la définition de la conjonction trouvée dans la Poétique d’Aristote, de même que sur les autres passages du corpus aristotélicien qui permettent de mieux cerner les contours d’une entité grammaticale encore mal définie. Le second chapitre porte sur la conjonction dans les recherches logico-grammaticales de l’école stoïcienne. La définition stoïcienne de la conjonction, de même que les différentes catégories de conjonctions identifiées par les Stoïciens sont examinées. Le rôle central des conjonctions au sein de la théorie stoïcienne des propositions complexes est souligné et met en lumière l’interrelation étroite entre la logique et la grammaire à ce point du développement de la théorie grammaticale. Le chapitre final porte sur la définition et les catégories de conjonctions trouvées dans la Tekhnè grammatikè, un bref manuel de grammaire attribué au philologue alexandrin Denys le Thrace. L’influence de la théorie stoïcienne des propositions complexes sur cette première tentative de systématisation grammaticale est mise en lumière, de même que l’interférence de préoccupations philologiques.
Resumo:
L’un des aspects les plus percutants des avancées de la technologie des quinze dernières années a trait à la communication médiée par ordinateur : clavardage, messagerie instantanée, courrier électronique, forums de discussion, blogues, sites de réseautage social, etc. En plus d’avoir eu un impact significatif sur la société contemporaine, ces outils de communication ont largement modifié les pratiques d’écriture. Notre objet d’étude est le clavardage en groupe qui offre la possibilité aux scripteurs de communiquer simultanément entre eux. Cet outil de communication présente deux caractéristiques importantes aux plans discursif et communicationnel. Premièrement, on admet de façon générale que le clavardage est une forme de communication hybride : le code utilisé est l’écrit, mais les échanges de messages entrent dans une structure de dialogue qui rappelle l’oral. Deuxièmement, le caractère spontané du clavardage impose la rapidité, tant pour l’encodage que pour le décodage des messages. Dans le cadre d’une étude comparative réalisée sur les pratiques scripturales des clavardeurs francophones (Tatossian et Dagenais 2008), nous avons établi quatre catégories générales pour rendre compte de toutes les variantes scripturales de notre corpus : procédés abréviatifs, substitutions de graphèmes, neutralisations en finale absolue et procédés expressifs. Nous voulons maintenant tester la solidité de notre typologie pour des langues dont le degré de correspondance phonético-graphique diffère. En vertu de l’hypothèse de la profondeur de l’orthographe (orthographic depth hypothesis [ODH]; Katz et Frost 1992) selon laquelle un système orthographique transparent (comme l’italien, l’espagnol ou le serbo-croate) transpose les phonèmes directement dans l’orthographe, nous vérifierons si nos résultats pour le français peuvent être généralisés à des langues dont l’orthographe est dite « transparente » (l’espagnol) comparativement à des langues dont l’orthographe est dite « opaque » (le français et l’anglais). Pour chacune des langues, nous avons voulu répondre à deux question, soit : 1. De quelle manière peut-on classifier les usages scripturaux attestés ? 2. Ces usages graphiques sont-ils les mêmes chez les adolescents et les adultes aux plans qualitatif et quantitatif ? Les phénomènes scripturaux du clavardage impliquent également l’identité générationnelle. L’adolescence est une période caractérisée par la quête d’identité. L’étude de Sebba (2003) sur l’anglais démontre qu’il existe un rapport entre le « détournement de l’orthographe » et la construction identitaire chez les adolescents (par ex. les graffitis, la CMO). De plus, dans ces espaces communicationnels, nous assistons à la formation de communautés d’usagers fondée sur des intérêts communs (Crystal 2006), comme l’est la communauté des adolescents. Pour la collecte des corpus, nous recourrons à des échanges effectués au moyen du protocole Internet Relay Chat (IRC). Aux fins de notre étude, nous délimitons dans chacune des langues deux sous-corpus sociolinguistiquement distincts : le premier constitué à partir de forums de clavardage destinés aux adolescents, le second à partir de forums pour adultes. Pour chacune des langues, nous avons analysé 4 520 énoncés extraits de divers canaux IRC pour adolescents et pour adultes. Nous dressons d’abord un inventaire quantifié des différents phénomènes scripturaux recensés et procédons ensuite à la comparaison des résultats.