946 resultados para Corpus-based phraseology
Resumo:
This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.
Resumo:
In this dissertation, I present an overall methodological framework for studying linguistic alternations, focusing specifically on lexical variation in denoting a single meaning, that is, synonymy. As the practical example, I employ the synonymous set of the four most common Finnish verbs denoting THINK, namely ajatella, miettiä, pohtia and harkita ‘think, reflect, ponder, consider’. As a continuation to previous work, I describe in considerable detail the extension of statistical methods from dichotomous linguistic settings (e.g., Gries 2003; Bresnan et al. 2007) to polytomous ones, that is, concerning more than two possible alternative outcomes. The applied statistical methods are arranged into a succession of stages with increasing complexity, proceeding from univariate via bivariate to multivariate techniques in the end. As the central multivariate method, I argue for the use of polytomous logistic regression and demonstrate its practical implementation to the studied phenomenon, thus extending the work by Bresnan et al. (2007), who applied simple (binary) logistic regression to a dichotomous structural alternation in English. The results of the various statistical analyses confirm that a wide range of contextual features across different categories are indeed associated with the use and selection of the selected think lexemes; however, a substantial part of these features are not exemplified in current Finnish lexicographical descriptions. The multivariate analysis results indicate that the semantic classifications of syntactic argument types are on the average the most distinctive feature category, followed by overall semantic characterizations of the verb chains, and then syntactic argument types alone, with morphological features pertaining to the verb chain and extra-linguistic features relegated to the last position. In terms of overall performance of the multivariate analysis and modeling, the prediction accuracy seems to reach a ceiling at a Recall rate of roughly two-thirds of the sentences in the research corpus. The analysis of these results suggests a limit to what can be explained and determined within the immediate sentential context and applying the conventional descriptive and analytical apparatus based on currently available linguistic theories and models. The results also support Bresnan’s (2007) and others’ (e.g., Bod et al. 2003) probabilistic view of the relationship between linguistic usage and the underlying linguistic system, in which only a minority of linguistic choices are categorical, given the known context – represented as a feature cluster – that can be analytically grasped and identified. Instead, most contexts exhibit degrees of variation as to their outcomes, resulting in proportionate choices over longer stretches of usage in texts or speech.
Resumo:
The present study provides a usage-based account of how three grammatical structures, declarative content clauses, interrogative content clause and as-predicative constructions, are used in academic research articles. These structures may be used in both knowledge claims and citations, and they often express evaluative meanings. Using the methodology of quantitative corpus linguistics, I investigate how the culture of the academic discipline influences the way in which these constructions are used in research articles. The study compares the rates of occurrence of these grammatical structures and investigates their co-occurrence patterns in articles representing four different disciplines (medicine, physics, law, and literary criticism). The analysis is based on a purpose-built 2-million-word corpus, which has been part-of-speech tagged. The analysis demonstrates that the use of these grammatical structures varies between disciplines, and further shows that the differences observed in the corpus data are linked with differences in the nature of knowledge and the patterns of enquiry. The constructions in focus tend to be more frequently used in the soft disciplines, law and literary criticism, where their co-occurrence patterns are also more varied. This reflects both the greater variety of topics discussed in these disciplines, and the higher frequency of references to statements made by other researchers. Knowledge-building in the soft fields normally requires a careful contextualisation of the arguments, giving rise to statements reporting earlier research employing the constructions in focus. In contrast, knowledgebuilding in the hard fields is typically a cumulative process, based on agreed-upon methods of analysis. This characteristic is reflected in the structure and contents of research reports, which offer fewer opportunities for using these constructions.
Resumo:
Tämän pro gradu -lopputyön aiheena on englannin kielen modaalisten apuverbien ns. ydinjoukko: will, would, can, could, shall, should, may, might ja must. Semantiikan kannalta nämä apuverbit ovat erityisen kompleksisia: niiden tulkinnassa on usein huomattavaa monivivahteisuutta, vaikka perinteiset kieliopit antavat ymmärtää niillä olevan kaksi tai kolme toisistaan selkeästi erillään olevaa merkitystä. Ne asettavatkin vieraan kielen oppimisympäristössä erityisiä haasteita. Viimeaikainen kehitys korpuslingvistiikan metodeissa on tuottanut entistä tarkempia kuvauksia siitä, miten modaalisia apuverbejä nykyenglannissa käytetään ja mihin suuntaan niiden kehitys on lyhyenkin ajan sisällä kulkenut. Tämän tutkielman tavoitteena on ollut verrata näiden uusien tutkimusten tuloksia siihen todellisuuteen, jonka englannin kielen lukiotasoinen oppimateriaali Suomessa opiskelijalle tarjoaa. Lähdin siitä, että opetussuunnitelman vaatima autenttisuus ja kommunikaativisuus kieltenopetuksessa tulisi näkyä tasapuolisena modaalisten apuverbien kohteluna. Alkuperäinen hypoteesini kuitenkin oli, että siinä miten modaalisuus ilmenee autenttisessa ympäristössä ja siinä miten se esitetään oppikirjoissa, on poikkeavuuksia. Lähestymistapani tähän tutkielmaan oli korpuslähtöinen. Valitsin kahdesta lukion kirjasarjasta ne kirjat, joissa modaaliset apuverbit mainittiin eksplisiittisesti. Skannasin jokaisen neljästä eri kirjasta löytyvän (kokonaisen) tekstin ja rakensin näistä aineksista pienen korpuksen. Tästä korpuksesta hain korpusanalyyseihin tarkoitetulla ohjelmalla kaikki lauseet, joissa esiintyi modaalisia apuverbejä. Tämän jälkeen analysoin jokaisen modaalisen apuverbin semanttisesti lauseyhteydessään. Tämän analyysin tuloksena pystyin rakentamaan taulukoita ja vertailemaan tuloksia uusimpien tutkimusten tuloksiin. Tämän tutkielman perusteella poikkeavuuksia on olemassa. Yleisesti ottaen modaalisten apuverbien keskinäinen frekvenssi oli oikean suuntainen: mitään apuverbiä ei ollut käytetty merkittävästi enemmän tai vähemmän kuin mitä viimeaikaisen tutkimuksen valossa olisi suotavaa. Sen sijaan apuverbien semanttisessa jakaumassa oli paikoin suuriakin eroja siinä, mitkä merkitykset oppikirjoissa painottuivat ja mitkä taas nykyenglannissa vaikuttaisivat olevan frekvensseiltään suurempia. Erityisesti can ja must erottuivat joukosta siinä, että oppikirjojen tarjoama kuva niiden käytöstä on päinvastainen kuin mitä voisi odottaa: can-verbin käyttö painottui selvästi tarkoittamaan ’kykyä’ eikä ’mahdollisuutta’, joka nykytutkimuksen valossa on sen pääasiallinen käyttötapa. Toisaalta must tarkoitti aineistossa ylikorostuneesti ’pakkoa’, kun se useimmiten nykyään tarkoittaa yhtä usein ’johtopäätöstä’ kuin ’pakkoa’. Lisäksi ’lupaa’ pyydettiin aineistossa merkillisen harvoin. Tulosten perusteella esitän, että oppikirjojen tekijät yleisellä tasolla luopuisivat kielioppikirjojen luutuneista käsityksistä ja uskaltaisivat altistaa opiskelijat koko modaalisten apuverbien merkityskirjolle.
Resumo:
Temporal dynamics and speaker characteristics are two important features of speech that distinguish speech from noise. In this paper, we propose a method to maximally extract these two features of speech for speech enhancement. We demonstrate that this can reduce the requirement for prior information about the noise, which can be difficult to estimate for fast-varying noise. Given noisy speech, the new approach estimates clean speech by recognizing long segments of the clean speech as whole units. In the recognition, clean speech sentences, taken from a speech corpus, are used as examples. Matching segments are identified between the noisy sentence and the corpus sentences. The estimate is formed by using the longest matching segments found in the corpus sentences. Longer speech segments as whole units contain more distinct dynamics and richer speaker characteristics, and can be identified more accurately from noise than shorter speech segments. Therefore, estimation based on the longest recognized segments increases the noise immunity and hence the estimation accuracy. The new approach consists of a statistical model to represent up to sentence-long temporal dynamics in the corpus speech, and an algorithm to identify the longest matching segments between the noisy sentence and the corpus sentences. The algorithm is made more robust to noise uncertainty by introducing missing-feature based noise compensation into the corpus sentences. Experiments have been conducted on the TIMIT database for speech enhancement from various types of nonstationary noise including song, music, and crosstalk speech. The new approach has shown improved performance over conventional enhancement algorithms in both objective and subjective evaluations.
Resumo:
Computational models of meaning trained on naturally occurring text successfully model human performance on tasks involving simple similarity measures, but they characterize meaning in terms of undifferentiated bags of words or topical dimensions. This has led some to question their psychological plausibility (Murphy, 2002; Schunn, 1999). We present here a fully automatic method for extracting a structured and comprehensive set of concept descriptions directly from an English part-of-speech-tagged corpus. Concepts are characterized by weighted properties, enriched with concept-property types that approximate classical relations such as hypernymy and function. Our model outperforms comparable algorithms in cognitive tasks pertaining not only to concept-internal structures (discovering properties of concepts, grouping properties by property type) but also to inter-concept relations (clustering into superordinates), suggesting the empirical validity of the property-based approach. Copyright © 2009 Cognitive Science Society, Inc. All rights reserved.
Resumo:
This paper presents a new approach to single-channel speech enhancement involving both noise and channel distortion (i.e., convolutional noise). The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise. Third, we present an iterative algorithm for improved speech estimates. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement. Index Terms: corpus-based speech model, longest matching segment, speech enhancement, speech recognition
Resumo:
Dissertação de mest., Natural Language Processing & Human Language Technology, Faculdade de Ciências Humanas e Sociais, Univ. do Algarve, 2011
Resumo:
Research in social psychology has shown that public attitudes towards feminism are mostly based on stereotypical views linking feminism with leftist politics and lesbian orientation. It is claimed that such attitudes are due to the negative and sexualised media construction of feminism. Studies concerned with the media representation of feminism seem to confirm this tendency. While most of this research provides significant insights into the representation of feminism, the findings are often based on a small sample of texts. Also, most of the research was conducted in an Anglo-American setting. This study attempts to address some of the shortcomings of previous work by examining the discourse of feminism in a large corpus of German and British newspaper data. It does so by employing the tools of Corpus Linguistics. By investigating the collocation profiles of the search term feminism, we provide evidence of salient discourse patterns surrounding feminism in two different cultural contexts.