998 resultados para corpus analysis
Resumo:
This paper discusses three important aspects of John Sinclair’s legacy: the corpus, lexicography, and the notion of ‘corpus-driven’. The corpus represents his concern with the nature of linguistic evidence. Lexicography is for him the canonical mode of language description at the lexical level. And his belief that the corpus should ‘drive’ the description is reflected in his constant attempts to utilize the emergent computer technologies to automate the initial stages of analysis and defer the intuitive, interpretative contributions of linguists to increasingly later stages in the process. Sinclair’s model of corpus-driven lexicography has spread far beyond its initial implementation at Cobuild, to most EFL dictionaries, to native-speaker dictionaries (e.g. the New Oxford Dictionary of English, and many national language dictionaries in emerging or re-emerging speech communities) and bilingual dictionaries (e.g. Collins, Oxford-Hachette).
Resumo:
Previous research into formulaic language has focussed on specialised groups of people (e.g. L1 acquisition by infants and adult L2 acquisition) with ordinary adult native speakers of English receiving less attention. Additionally, whilst some features of formulaic language have been used as evidence of authorship (e.g. the Unabomber’s use of you can’t eat your cake and have it too) there has been no systematic investigation into this as a potential marker of authorship. This thesis reports the first full-scale study into the use of formulaic sequences by individual authors. The theory of formulaic language hypothesises that formulaic sequences contained in the mental lexicon are shaped by experience combined with what each individual has found to be communicatively effective. Each author’s repertoire of formulaic sequences should therefore differ. To test this assertion, three automated approaches to the identification of formulaic sequences are tested on a specially constructed corpus containing 100 short narratives. The first approach explores a limited subset of formulaic sequences using recurrence across a series of texts as the criterion for identification. The second approach focuses on a word which frequently occurs as part of formulaic sequences and also investigates alternative non-formulaic realisations of the same semantic content. Finally, a reference list approach is used. Whilst claiming authority for any reference list can be difficult, the proposed method utilises internet examples derived from lists prepared by others, a procedure which, it is argued, is akin to asking large groups of judges to reach consensus about what is formulaic. The empirical evidence supports the notion that formulaic sequences have potential as a marker of authorship since in some cases a Questioned Document was correctly attributed. Although this marker of authorship is not universally applicable, it does promise to become a viable new tool in the forensic linguist’s tool-kit.
Resumo:
This research focuses on Native Language Identification (NLID), and in particular, on the linguistic identifiers of L1 Persian speakers writing in English. This project comprises three sub-studies; the first study devises a coding system to account for interlingual features present in a corpus of L1 Persian speakers blogging in English, and a corpus of L1 English blogs. Study One then demonstrates that it is possible to use interlingual identifiers to distinguish authorship by L1 Persian speakers. Study Two examines the coding system in relation to the L1 Persian corpus and a corpus of L1 Azeri and L1 Pashto speakers. The findings of this section indicate that the NLID method and features designed are able to discriminate between L1 influences from different languages. Study Three focuses on elicited data, in which participants were tasked with disguising their language to appear as L1 Persian speakers writing in English. This study indicated that there was a significant difference between the features in the L1 Persian corpus, and the corpus of disguise texts. The findings of this research indicate that NLID and the coding system devised have a very strong potential to aid forensic authorship analysis in investigative situations. Unlike existing research, this project focuses predominantly on blogs, as opposed to student data, making the findings more appropriate to forensic casework data.
Resumo:
The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects. The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs from different regions of the country. Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss, the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration and digital preservation of the dialectological archive. Within the framework of this project more than the half of the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization activities are done in the Institute on a regular basis. As a result a large collection of sound files has been gathered. Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available for phonological and linguistic research. Such corpora typically include besides the sound files two basic elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus. In our work we will present considerations on how these tasks could be realized in the case of the corpus of Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be used in similar institutions storing folklore archives, history related spoken records etc.
Resumo:
This paper describes the followed methodology to automatically generate titles for a corpus of questions that belong to sociological opinion polls. Titles for questions have a twofold function: (1) they are the input of user searches and (2) they inform about the whole contents of the question and possible answer options. Thus, generation of titles can be considered as a case of automatic summarization. However, the fact that summarization had to be performed over very short texts together with the aforementioned quality conditions imposed on new generated titles led the authors to follow knowledge-rich and domain-dependent strategies for summarization, disregarding the more frequent extractive techniques for summarization.
Resumo:
This study analyses the current role of police-suspect interview discourse in the England & Wales criminal justice system, with a focus on its use as evidence. A central premise is that the interview should be viewed not as an isolated and self-contained discursive event, but as one link in a chain of events which together constitute the criminal justice process. It examines: (1) the format changes undergone by interview data after the interview has taken place, and (2) how the other links in the chain – both before and after the interview – affect the interview-room interaction itself. It thus examines the police interview as a multi-format, multi-purpose and multi-audience mode of discourse. An interdisciplinary and multi-method discourse-analytic approach is taken, combining elements of conversation analysis, pragmatics, sociolinguistics and critical discourse analysis. Data from a new corpus of recent police-suspect interviews, collected for this study, are used to illustrate previously unaddressed problems with the current process, mainly in the form of two detailed case studies. Additional data are taken from the case of Dr. Harold Shipman. The analysis reveals several causes for concern, both in aspects of the interaction in the interview room, and in the subsequent treatment of interview material as evidence, especially in the light of s.34 of the Criminal Justice and Public Order Act 1994. The implications of the findings for criminal justice are considered, along with some practical recommendations for improvements. Overall, this study demonstrates the need for increased awareness within the criminal justice system of the many linguistic factors affecting interview evidence.
Resumo:
In this article I argue that the study of the linguistic aspects of epistemology has become unhelpfully focused on the corpus-based study of hedging and that a corpus-driven approach can help to improve upon this. Through focusing on a corpus of texts from one discourse community (that of genetics) and identifying frequent tri-lexical clusters containing highly frequent lexical items identified as keywords, I undertake an inductive analysis identifying patterns of epistemic significance. Several of these patterns are shown to be hedging devices and the whole corpus frequencies of the most salient of these, candidate and putative, are then compared to the whole corpus frequencies for comparable wordforms and clusters of epistemic significance. Finally I interviewed a ‘friendly geneticist’ in order to check my interpretation of some of the terms used and to get an expert interpretation of the overall findings. In summary I argue that the highly unexpected patterns of hedging found in genetics demonstrate the value of adopting a corpus-driven approach and constitute an advance in our current understanding of how to approach the relationship between language and epistemology.
Resumo:
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
A critical discourse analysis on the (self) representation of Hillary R. Clinton in public discourse
Resumo:
El rol de la mujer dentro de la sociedad ha sido y aún es un tema de mucha controversia. Incluso en nuestra sociedad, se suscitan debates sobre si les está permitido a las mujeres ocupar ciertos ámbitos laborales que han estado siempre dominados por una fuerte presencia masculina, como es el caso del ámbito político. Además, en muchos países aún ni siquiera están reconocidos los derechos de las mujeres, y mientras que, en otras culturas, a pesar de que la ley vela por los derechos humanos sin importar la raza, la religión o el género, la realidad es que incluso en las culturas más desarrolladas existe desigualdad de género y estereotipos que afectan el desenvolvimiento de la mujer. Sin embargo, a pesar de que aun la desigualdad de género está presente en la sociedad, es innegable que la situación actual es mucho más positiva para la implicación de las mujeres incluso dentro de ámbitos de la sociedad, que décadas atrás sería impensable, como la política. En esta línea, toda esta situación ha suscitado el interés de muchos investigadores y lingüistas, que han dedicado tiempo a investigaciones sobre las relaciones entre discurso y género, y sobre la representación mediática de mujeres que tienen cierta influencia en el ámbito público, y cómo la desigualdad de género afecta su imagen pública. Si bien es cierto, durante mucho tiempo el ámbito de la política ha estado dominado por presencia masculina, ahora la situación ha cambiado. En las últimas décadas, se ha hecho evidente una gran presencia de mujeres dentro de la política, mujeres que a comparación de la situación vivida décadas atrás, ahora tienen la posibilidad de presentarse incluso como candidatas a la presidencia, como es el caso de Hillary Clinton. En este sentido muchas corrientes feministas han contribuido en gran medida a esta nueva situación. Ahora bien, en vista de toda esta situación, el presente estudio de investigación intentará dar respuesta a las siguientes preguntas. ¿Hasta qué punto los estereotipos de género están aún presentes en la sociedad? ¿La representación mediática de una figura política está realmente basada en su conducta y en su actividad discursiva, o está influida por esquemas e ideas preconcebidas de género? Teniendo en cuenta que hoy en día hay una mayor presencia femenina dentro del ámbito político, una de mis hipótesis iniciales es que la situación de los estereotipos de género ha disminuido. Además, se espera que la forma en la que Hillary Clinton se representa a sí misma como una mujer y como una política esté menos perjudicada por estos esquemas. El objetivo de este estudio es, primeramente, llevar a cabo un análisis sobre diez discursos de Hillary Clinton, desde el 15 de junio de 2015, fecha en la que Hillary Clinton lanzo su candidatura a la presidencia, hasta el 26 de abril de 2016, para a través de este análisis poder identificar como Hillary Clinton se caracteriza a sí misma en sus discursos políticos, y asimismo identificar si los esquemas convencionales sobre género afectan su auto representación. Con este objetivo, el enfoque de este estudio se va a centrar en análisis cuantitativos y cualitativos sobre la frecuencia de palabras, seguido de un análisis crítico del discurso sobre la auto representación de Hillary en sus discursos. Además, siguiendo la línea de investigación de Tannen (1996), se realizará un análisis sobre los usos de los pronombres “nosotros” y “yo”, para adquirir una mayor perspectiva sobre esta situación. Seguidamente, teniendo en cuenta que los medios de comunicación reflejan ideologías sociales, este estudio ha sido también diseñado para analizar diez artículos de noticias sobre los discursos previamente analizados de Hillary Clinton. De esta manera, se examinará si los estereotipos de género están presentes en la representación mediática de Hillary Clinton, para seguidamente analizar si la interpretación mediática de la candidata a la presidencia está realmente relacionada con los discursos analizados o, si por lo contrario están influidos por estereotipos y esquemas de género. Para cumplir con este objetivo, los datos recopilados para este corpus consisten en exactamente diez artículos que reporten sobre los discursos estudiados en el primer análisis, y la actuación de Hillary Clinton. Estos artículos fueron recogidos de cuatro de los periódicos más importantes de los Estados Unidos, que son New York Times, Wall Street Journal, Los Angeles times y The Washington Post. En este caso el análisis estará centrado en la frecuencia de palabras y en el uso de reporting verbs, siguiendo la línea de investigación de Caldas – Coulthard (1995). Se espera que el presente estudio pueda servir para mayores investigaciones sobre cuestiones de género, y de esta manera contribuir a la creación de teorías que puedan explicar mejor la situación de las mujeres dentro de la política. Para finalizar, aún queda mucho que investigar en esta disciplina, e incluso más por descubrir
Resumo:
This research analyzes the average previous stressed vowels [ε] and [e] and later [ɔ] and [o] in nominal and verbal forms in the 1st person singular and 3rd person singular and plural in the present tense, specifically the umlaut process of mid vowels /e/ and /o/, which assimilate in /ε/ and /ᴐ/ in stressed position. The general objective of this research is to describe and quantify the occurrence of umlaut and subsequently analyze in which words there is regularity or not. As specific objectives we have: i) to compile and to label an oral, spontaneous, synchronic and regional corpus, from radio programs produced in the city of Ituiutaba, Minas Gerais; ii) to describe the characteristics of the corpus to be compiled; iii) to investigate the alternating timbre of mid vowels in stressed position; iv) to identify instances of nominal and verbal umlaut of the middle vowels in stressed position; v) to describe the identified cases of nominal and verbal umlaut; vi) to analyze the probable causes for the variation of the middle vowels. To perform the proposed analysis, we have adopted as a theoretical-methodological basis multi-representational models: Phonology of Use (BYBEE, 2001) and Exemplar Theory (PIERREHUMBERT, 2001) combined with the precepts of Corpus Linguistics (BEBER SARDINHA, 2004). The corpus consisted of 16 radio programs – eight political and eight religious – from the city of Ituiutaba-MG, with recordings of about 20 to 40 minutes. We note, by means of the results generated by WordSmith Tools® software, version 6.0 (SCOTT, 2012), that the analyzed forms show little variation, which shows that the umlaut is a process already lexicalized in participants of the radio programs analyzed. We conclude that the results converge with the proposal of the Phonology of Use (BYBEE, 2001; PHILLIPS, 1984) that less frequent words that have no phonetic environment conducive to change, are changed first.
Resumo:
This research investigated the nasality of vowels in the spontaneous speech of inhabitants of the quilombola communities of Brejo dos Crioulos and Poções (MG). As a theoretical framework, we based on the assumptions of Phonetics and Phonology, in renowned scholars on the investigation of nasality (CAGLIARI, 1977; CÂMARA JR., 1984, 2013; BISOL, 2013; ABAURRE; PAGOTTO, 1996; SILVA, 2015), with subsidies of the Corpus Linguistics. Its general goal was to investigate the occurrence of nasality, in the dialect of these quilombola communities, and their linguistic behavior, considering the linguistic factors that can interfere in the phenomenon. Specifically it was aimed to a) detect the occurrence of nasalized vowels with the help of the resources that the Corpus Linguistics provides (Praat and WorldSmith Tolls); b) discriminate the different types of occurring contexts of nasalized vowels; c) make quantitative and qualitative analyzes of the nasalized vowels in the study corpus; d) describe and analyze the behavior of nasalized vowels and; e) contrast the values of F1 and F2 of the oral and nasalized vowels. It was hypothesized that the nasality happens because it is conditioned by the nasal segment following the nasalized vowel - phonological process of “assimilation” - its position as the primary stress and grammatical category. It was believed that the quilombolas communities of Brejo dos Crioulos and Poções produce nasalized vowels in their speech and this linguistic phenomenon is favored by the adjacent presence of consonants or nasal vowels. Furthermore, it was hypothesized that the values of F1 and F2 of oral and nasalized vowels in these communities are distinct. The following research questions were elaborated: (i) is the presence of nasalized vowels in the speech of these quilombola communities conditioned to the presence of a nasal sound segment? (ii) does the nasal sound segment following the nasalized vowel favor the occurrence of the nasality phenomenon? is there a difference between the values of F1 and F2 of the oral and nasalized vowels in both quilombola communities considered? To compose our corpus, 24 interviews recordings were used (12 female speakers and 12 male speakers), a total of 24 participants. It was found that the following nasal sound segment tends to condition the nasalized vowel. In general, it assimilates the lowering of the soft palate of nasal consonant segment immediately following, but there are cases of nasal vowel segment - regressive assimilation; the stressed syllable tends to favor the nasality, but it occurs in pretonic and postonic position as well; F1 and F2 values of oral and nasalized vowels in the quilombola communities of Poções and Brejo dos Crioulos are distinct: the group of Brejo dos Crioulos tends to produce the F1 of oral and nasalized vowels more lowered than the group of Poções and the F2, in a more anterior position. The nasality tends to occur in verbs and nouns, although it is not specific to a grammatical category. This research found cases of spurious nasalization, confirming previous studies. In turn, it revealed cases of lexical items with favorable context for nasalization, but with its non-occurrence. This last case, considered as the lowering of the uniform soft palate in PB, presented pronounced vowels without the soft palate lowering. That is, it was detected variation in the phenomenon of nasalization in PB. With this work, it was promoted the discussion about nasality, in order to contribute to the linguistic studies about the functioning of Brazilian Portuguese in this geographical context.
Resumo:
In this paper we compare the robustness of several types of stylistic markers to help discriminate authorship at sentence level. We train a SVM-based classifier using each set of features separately and perform sentence-level authorship analysis over corpus of editorials published in a Portuguese quality newspaper. Results show that features based on POS information, punctuation and word / sentence length contribute to a more robust sentence-level authorship analysis. © Springer-Verlag Berlin Heidelberg 2010.
Resumo:
Al evaluar los contactos de Plutarco con otras culturas contemporáneas, los investigadores todavía no han llegado a un consenso acerca de la relación entre el queronense y la literatura cristiano-primitiva. Un buen ejemplo de esto aparece al atender al motivo de la creación del alma humana. La intención de las próximas páginas es, tras un análisis de los textos plutarqueos, atender a estos posibles contactos con NHC, los heresiólogos y el Corpus Hermeticum a fin de dilucidar sus similitudes y diferencias.
Resumo:
Based on the concept of the triple basic structure of human communication by Poyatos (1994a, 1994b) and on the analytical and theoretical implications that derive from this, the present paper conceives the human communication as an indivisible whole in which verbal communication can not be separated from body behavior. This paper analyzes nonverbal categories used in oral communication. The corpus consists of an oral narration in Galician from which we highlighted certain kinemes (minimum units of body movement with meaning) by using the model proposed by Bouvet (2001), in order to explain the non-verbal categories with examples taken from said recordings.
Resumo:
From ecological tourism to ecotourism: lexical analysis of an emerging tourism. This article deals with the lexicon created in connection with a recent form of tourism: the ecological tourism or ecotourism. The rise of this type of tourism encourages the creation of new concepts and products that are named with new words and expressions with different procedures of formation. From the name itself ecotourism, then expressed as the acronym ecotourism, we analyze the formation of other related words, as well as their formal variation and use. For this, we have worked with a specific corpus of electronic tourist texts and different digital sources and databases.