949 resultados para corpus analysis


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The goal of this study is to determine if various measures of contraction rate are regionally patterned in written Standard American English. In order to answer this question, this study employs a corpus-based approach to data collection and a statistical approach to data analysis. Based on a spatial autocorrelation analysis of the values of eleven measures of contraction across a 25 million word corpus of letters to the editor representing the language of 200 cities from across the contiguous United States, two primary regional patterns were identified: easterners tend to produce relatively few standard contractions (not contraction, verb contraction) compared to westerners, and northeasterners tend to produce relatively few non-standard contractions (to contraction, non-standard not contraction) compared to southeasterners. These findings demonstrate that regional linguistic variation exists in written Standard American English and that regional linguistic variation is more common than is generally assumed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

University students encounter difficulties with academic English because of its vocabulary, phraseology, and variability, and also because academic English differs in many respects from general English, the language which they have experienced before starting their university studies. Although students have been provided with many dictionaries that contain some helpful information on words used in academic English, these dictionaries remain focused on the uses of words in general English. There is therefore a gap in the dictionary market for a dictionary for university students, and this thesis provides a proposal for such a dictionary (called the Dictionary of Academic English; DOAE) in the form of a model which depicts how the dictionary should be designed, compiled, and offered to students. The model draws on state-of-the-art techniques in lexicography, dictionary-use research, and corpus linguistics. The model demanded the creation of a completely new corpus of academic language (Corpus of Academic Journal Articles; CAJA). The main advantages of the corpus are its large size (83.5 million words) and balance. Having access to a large corpus of academic language was essential for a corpus-driven approach to data analysis. A good corpus balance in terms of domains enabled a detailed domain-labelling of senses, patterns, collocates, etc. in the dictionary database, which was then used to tailor the output according to the needs of different types of student. The model proposes an online dictionary that is designed as an online dictionary from the outset. The proposed dictionary is revolutionary in the way it addresses the needs of different types of student. It presents students with a dynamic dictionary whose contents can be customised according to the user's native language, subject of study, variant spelling preferences, and/or visual preferences (e.g. black and white).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This research sets out to compare the values in British and German political discourse, especially the discourse of social policy, and to analyse their relationship to political culture through an analysis of the values of health care reform. The work proceeds from the hypothesis that the known differences in political culture between the two countries will be reflected in the values of political discourse, and takes a comparison of two major recent legislative debates on health care reform as a case study. The starting point in the first chapter is a brief comparative survey of the post-war political cultures of the two countries, including a brief account of the historical background to their development and an overview of explanatory theoretical models. From this are developed the expected contrasts in values in accordance with the hypothesis. The second chapter explains the basis for selecting the corpus texts and the contextual information which needs to be recorded to make a comparative analysis, including the context and content of the reform proposals which comprise the case study. It examines any contextual factors which may need to be taken into account in the analysis. The third and fourth chapters explain the analytical method, which is centred on the use of definition-based taxonomies of value items and value appeal methods to identify, on a sentence-by-sentence basis, the value items in the corpus texts and the methods used to make appeals to those value items. The third chapter is concerned with the classification and analysis of values, the fourth with the classification and analysis of value appeal methods. The fifth chapter will present and explain the results of the analysis, and the sixth will summarize the conclusions and make suggestions for further research.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper discusses three important aspects of John Sinclair’s legacy: the corpus, lexicography, and the notion of ‘corpus-driven’. The corpus represents his concern with the nature of linguistic evidence. Lexicography is for him the canonical mode of language description at the lexical level. And his belief that the corpus should ‘drive’ the description is reflected in his constant attempts to utilize the emergent computer technologies to automate the initial stages of analysis and defer the intuitive, interpretative contributions of linguists to increasingly later stages in the process. Sinclair’s model of corpus-driven lexicography has spread far beyond its initial implementation at Cobuild, to most EFL dictionaries, to native-speaker dictionaries (e.g. the New Oxford Dictionary of English, and many national language dictionaries in emerging or re-emerging speech communities) and bilingual dictionaries (e.g. Collins, Oxford-Hachette).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Previous research into formulaic language has focussed on specialised groups of people (e.g. L1 acquisition by infants and adult L2 acquisition) with ordinary adult native speakers of English receiving less attention. Additionally, whilst some features of formulaic language have been used as evidence of authorship (e.g. the Unabomber’s use of you can’t eat your cake and have it too) there has been no systematic investigation into this as a potential marker of authorship. This thesis reports the first full-scale study into the use of formulaic sequences by individual authors. The theory of formulaic language hypothesises that formulaic sequences contained in the mental lexicon are shaped by experience combined with what each individual has found to be communicatively effective. Each author’s repertoire of formulaic sequences should therefore differ. To test this assertion, three automated approaches to the identification of formulaic sequences are tested on a specially constructed corpus containing 100 short narratives. The first approach explores a limited subset of formulaic sequences using recurrence across a series of texts as the criterion for identification. The second approach focuses on a word which frequently occurs as part of formulaic sequences and also investigates alternative non-formulaic realisations of the same semantic content. Finally, a reference list approach is used. Whilst claiming authority for any reference list can be difficult, the proposed method utilises internet examples derived from lists prepared by others, a procedure which, it is argued, is akin to asking large groups of judges to reach consensus about what is formulaic. The empirical evidence supports the notion that formulaic sequences have potential as a marker of authorship since in some cases a Questioned Document was correctly attributed. Although this marker of authorship is not universally applicable, it does promise to become a viable new tool in the forensic linguist’s tool-kit.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This research focuses on Native Language Identification (NLID), and in particular, on the linguistic identifiers of L1 Persian speakers writing in English. This project comprises three sub-studies; the first study devises a coding system to account for interlingual features present in a corpus of L1 Persian speakers blogging in English, and a corpus of L1 English blogs. Study One then demonstrates that it is possible to use interlingual identifiers to distinguish authorship by L1 Persian speakers. Study Two examines the coding system in relation to the L1 Persian corpus and a corpus of L1 Azeri and L1 Pashto speakers. The findings of this section indicate that the NLID method and features designed are able to discriminate between L1 influences from different languages. Study Three focuses on elicited data, in which participants were tasked with disguising their language to appear as L1 Persian speakers writing in English. This study indicated that there was a significant difference between the features in the L1 Persian corpus, and the corpus of disguise texts. The findings of this research indicate that NLID and the coding system devised have a very strong potential to aid forensic authorship analysis in investigative situations. Unlike existing research, this project focuses predominantly on blogs, as opposed to student data, making the findings more appropriate to forensic casework data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The paper presents our considerations related to the creation of a digital corpus of Bulgarian dialects. The dialectological archive of Bulgarian language consists of more than 250 audio tapes. All tapes were recorded between 1955 and 1965 in the course of regular dialectological expeditions throughout the country. The records typically contain interviews with inhabitants of small villages in Bulgaria. The topics covered are usually related to such issues as birth, everyday life, marriage, family relationship, death, etc. Only a few tapes contain folk songs from different regions of the country. Taking into account the progressive deterioration of the magnetic media and the realistic prospects of data loss, the Institute for Bulgarian Language at the Academy of Sciences launched in 1997 a project aiming at restoration and digital preservation of the dialectological archive. Within the framework of this project more than the half of the records was digitized, de-noised and stored on digital recording media. Since then restoration and digitization activities are done in the Institute on a regular basis. As a result a large collection of sound files has been gathered. Our further efforts are aimed at the creation of a digital corpus of Bulgarian dialects, which will be made available for phonological and linguistic research. Such corpora typically include besides the sound files two basic elements: a transcription, aligned with the sound file, and a set of standardized metadata that defines the corpus. In our work we will present considerations on how these tasks could be realized in the case of the corpus of Bulgarian dialects. Our suggestions will be based on a comparative analysis of existing methods and techniques to build such corpora, and by selecting the ones that fit closer to the particular needs. Our experience can be used in similar institutions storing folklore archives, history related spoken records etc.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper describes the followed methodology to automatically generate titles for a corpus of questions that belong to sociological opinion polls. Titles for questions have a twofold function: (1) they are the input of user searches and (2) they inform about the whole contents of the question and possible answer options. Thus, generation of titles can be considered as a case of automatic summarization. However, the fact that summarization had to be performed over very short texts together with the aforementioned quality conditions imposed on new generated titles led the authors to follow knowledge-rich and domain-dependent strategies for summarization, disregarding the more frequent extractive techniques for summarization.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study analyses the current role of police-suspect interview discourse in the England & Wales criminal justice system, with a focus on its use as evidence. A central premise is that the interview should be viewed not as an isolated and self-contained discursive event, but as one link in a chain of events which together constitute the criminal justice process. It examines: (1) the format changes undergone by interview data after the interview has taken place, and (2) how the other links in the chain – both before and after the interview – affect the interview-room interaction itself. It thus examines the police interview as a multi-format, multi-purpose and multi-audience mode of discourse. An interdisciplinary and multi-method discourse-analytic approach is taken, combining elements of conversation analysis, pragmatics, sociolinguistics and critical discourse analysis. Data from a new corpus of recent police-suspect interviews, collected for this study, are used to illustrate previously unaddressed problems with the current process, mainly in the form of two detailed case studies. Additional data are taken from the case of Dr. Harold Shipman. The analysis reveals several causes for concern, both in aspects of the interaction in the interview room, and in the subsequent treatment of interview material as evidence, especially in the light of s.34 of the Criminal Justice and Public Order Act 1994. The implications of the findings for criminal justice are considered, along with some practical recommendations for improvements. Overall, this study demonstrates the need for increased awareness within the criminal justice system of the many linguistic factors affecting interview evidence.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this article I argue that the study of the linguistic aspects of epistemology has become unhelpfully focused on the corpus-based study of hedging and that a corpus-driven approach can help to improve upon this. Through focusing on a corpus of texts from one discourse community (that of genetics) and identifying frequent tri-lexical clusters containing highly frequent lexical items identified as keywords, I undertake an inductive analysis identifying patterns of epistemic significance. Several of these patterns are shown to be hedging devices and the whole corpus frequencies of the most salient of these, candidate and putative, are then compared to the whole corpus frequencies for comparable wordforms and clusters of epistemic significance. Finally I interviewed a ‘friendly geneticist’ in order to check my interpretation of some of the terms used and to get an expert interpretation of the overall findings. In summary I argue that the highly unexpected patterns of hedging found in genetics demonstrate the value of adopting a corpus-driven approach and constitute an advance in our current understanding of how to approach the relationship between language and epistemology.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space

Relevância:

30.00% 30.00%

Publicador:

Resumo:

El rol de la mujer dentro de la sociedad ha sido y aún es un tema de mucha controversia. Incluso en nuestra sociedad, se suscitan debates sobre si les está permitido a las mujeres ocupar ciertos ámbitos laborales que han estado siempre dominados por una fuerte presencia masculina, como es el caso del ámbito político. Además, en muchos países aún ni siquiera están reconocidos los derechos de las mujeres, y mientras que, en otras culturas, a pesar de que la ley vela por los derechos humanos sin importar la raza, la religión o el género, la realidad es que incluso en las culturas más desarrolladas existe desigualdad de género y estereotipos que afectan el desenvolvimiento de la mujer. Sin embargo, a pesar de que aun la desigualdad de género está presente en la sociedad, es innegable que la situación actual es mucho más positiva para la implicación de las mujeres incluso dentro de ámbitos de la sociedad, que décadas atrás sería impensable, como la política. En esta línea, toda esta situación ha suscitado el interés de muchos investigadores y lingüistas, que han dedicado tiempo a investigaciones sobre las relaciones entre discurso y género, y sobre la representación mediática de mujeres que tienen cierta influencia en el ámbito público, y cómo la desigualdad de género afecta su imagen pública. Si bien es cierto, durante mucho tiempo el ámbito de la política ha estado dominado por presencia masculina, ahora la situación ha cambiado. En las últimas décadas, se ha hecho evidente una gran presencia de mujeres dentro de la política, mujeres que a comparación de la situación vivida décadas atrás, ahora tienen la posibilidad de presentarse incluso como candidatas a la presidencia, como es el caso de Hillary Clinton. En este sentido muchas corrientes feministas han contribuido en gran medida a esta nueva situación. Ahora bien, en vista de toda esta situación, el presente estudio de investigación intentará dar respuesta a las siguientes preguntas. ¿Hasta qué punto los estereotipos de género están aún presentes en la sociedad? ¿La representación mediática de una figura política está realmente basada en su conducta y en su actividad discursiva, o está influida por esquemas e ideas preconcebidas de género? Teniendo en cuenta que hoy en día hay una mayor presencia femenina dentro del ámbito político, una de mis hipótesis iniciales es que la situación de los estereotipos de género ha disminuido. Además, se espera que la forma en la que Hillary Clinton se representa a sí misma como una mujer y como una política esté menos perjudicada por estos esquemas. El objetivo de este estudio es, primeramente, llevar a cabo un análisis sobre diez discursos de Hillary Clinton, desde el 15 de junio de 2015, fecha en la que Hillary Clinton lanzo su candidatura a la presidencia, hasta el 26 de abril de 2016, para a través de este análisis poder identificar como Hillary Clinton se caracteriza a sí misma en sus discursos políticos, y asimismo identificar si los esquemas convencionales sobre género afectan su auto representación. Con este objetivo, el enfoque de este estudio se va a centrar en análisis cuantitativos y cualitativos sobre la frecuencia de palabras, seguido de un análisis crítico del discurso sobre la auto representación de Hillary en sus discursos. Además, siguiendo la línea de investigación de Tannen (1996), se realizará un análisis sobre los usos de los pronombres “nosotros” y “yo”, para adquirir una mayor perspectiva sobre esta situación. Seguidamente, teniendo en cuenta que los medios de comunicación reflejan ideologías sociales, este estudio ha sido también diseñado para analizar diez artículos de noticias sobre los discursos previamente analizados de Hillary Clinton. De esta manera, se examinará si los estereotipos de género están presentes en la representación mediática de Hillary Clinton, para seguidamente analizar si la interpretación mediática de la candidata a la presidencia está realmente relacionada con los discursos analizados o, si por lo contrario están influidos por estereotipos y esquemas de género. Para cumplir con este objetivo, los datos recopilados para este corpus consisten en exactamente diez artículos que reporten sobre los discursos estudiados en el primer análisis, y la actuación de Hillary Clinton. Estos artículos fueron recogidos de cuatro de los periódicos más importantes de los Estados Unidos, que son New York Times, Wall Street Journal, Los Angeles times y The Washington Post. En este caso el análisis estará centrado en la frecuencia de palabras y en el uso de reporting verbs, siguiendo la línea de investigación de Caldas – Coulthard (1995). Se espera que el presente estudio pueda servir para mayores investigaciones sobre cuestiones de género, y de esta manera contribuir a la creación de teorías que puedan explicar mejor la situación de las mujeres dentro de la política. Para finalizar, aún queda mucho que investigar en esta disciplina, e incluso más por descubrir

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This research analyzes the average previous stressed vowels [ε] and [e] and later [ɔ] and [o] in nominal and verbal forms in the 1st person singular and 3rd person singular and plural in the present tense, specifically the umlaut process of mid vowels /e/ and /o/, which assimilate in /ε/ and /ᴐ/ in stressed position. The general objective of this research is to describe and quantify the occurrence of umlaut and subsequently analyze in which words there is regularity or not. As specific objectives we have: i) to compile and to label an oral, spontaneous, synchronic and regional corpus, from radio programs produced in the city of Ituiutaba, Minas Gerais; ii) to describe the characteristics of the corpus to be compiled; iii) to investigate the alternating timbre of mid vowels in stressed position; iv) to identify instances of nominal and verbal umlaut of the middle vowels in stressed position; v) to describe the identified cases of nominal and verbal umlaut; vi) to analyze the probable causes for the variation of the middle vowels. To perform the proposed analysis, we have adopted as a theoretical-methodological basis multi-representational models: Phonology of Use (BYBEE, 2001) and Exemplar Theory (PIERREHUMBERT, 2001) combined with the precepts of Corpus Linguistics (BEBER SARDINHA, 2004). The corpus consisted of 16 radio programs – eight political and eight religious – from the city of Ituiutaba-MG, with recordings of about 20 to 40 minutes. We note, by means of the results generated by WordSmith Tools® software, version 6.0 (SCOTT, 2012), that the analyzed forms show little variation, which shows that the umlaut is a process already lexicalized in participants of the radio programs analyzed. We conclude that the results converge with the proposal of the Phonology of Use (BYBEE, 2001; PHILLIPS, 1984) that less frequent words that have no phonetic environment conducive to change, are changed first.