19 resultados para lexicon

em Aston University Research Archive


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this article, we present the first open-access lexical database that provides phonological representations for 120,000 Italian word forms. Each of these also includes syllable boundaries and stress markings and a comprehensive range of lexical statistics. Using data derived from this lexicon, we have also generated a set of derived databases and provided estimates of positional frequency use for Italian phonemes, syllables, syllable onsets and codas, and character and phoneme bigrams. These databases are freely available from phonitalia.org. This article describes the methods, content, and summarizing statistics for these databases. In a first application of this database, we also demonstrate how the distribution of phonological substitution errors made by Italian aphasic patients is related to phoneme frequency. © 2013 Psychonomic Society, Inc.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We describe the case of a dysgraphic aphasic individual-S.G.W.-who, in writing to dictation, produced high rates of formally related errors consisting of both lexical substitutions and what we call morphological-compound errors involving legal or illegal combinations of morphemes. These errors were produced in the context of a minimal number of semantic errors. We could exclude problems with phonological discrimination and phonological short-term memory. We also excluded rapid decay of lexical information and/or weak activation of word forms and letter representations since S.G.W.'s spelling showed no effect of delay and no consistent length effects, but, instead, paradoxical complexity effects with segmental, lexical, and morphological errors that were more complex than the target. The case of S.G.W. strongly resembles that of another dysgraphic individual reported in the literature-D.W.-suggesting that this pattern of errors can be replicated across patients. In particular, both patients show unusual errors resulting in the production of neologistic compounds (e.g., "bed button" in response to "bed"). These patterns can be explained if we accept two claims: (a) Brain damage can produce both a reduction and an increase in lexical activation; and (b) there are direct connections between phonological and orthographic lexical representations (a third spelling route). We suggest that both patients are suffering from a difficulty of lexical selection resulting from excessive activation of formally related lexical representations. This hypothesis is strongly supported by S.G.W.'s worse performance in spelling to dictation than in written naming, which shows that a phonological input, activating a cohort of formally related lexical representations, increases selection difficulties. © 2014 Taylor & Francis.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper asserts the increasing importance of academic English in an increasingly Anglophone world, and looks at the differences between academic English and general English, especially in terms of vocabulary. The creation of wordlists has played an important role in trying to establish the academic English lexicon, but these wordlists are not based on appropriate data, or are implemented inappropriately. There is as yet no adequate dictionary of academic English, and this paper reports on new efforts at Aston University to create a suitable corpus on which such a dictionary could be based.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Few names resonate more loudly from the French Fourth Republic than that of Pierre Poujade, and few terms exude such a sulfurous odour as le poujadisme. Between 1953 and 1958, the Poujadists secured their place in modern French history, winning 52 seats in the National Assembly and inscribing a lasting entry in the lexicon of political protest. Taking as its starting point the fiftieth anniversary of Poujade’s movement held in its birthplace of Saint-Céré in July 2003, this article reassesses Poujadism fifty years on from its heyday. It considers Poujadism as the first important anti-globalisation movement in post-war France, a locus for the conflict between ‘stalemate’ traditionalism and socio-economic modernisation. It examines the trajectory of the Poujadists from anti-tax movement to political party, arguing the difficulty of defining Poujadism in classic political terms. In particular, the article takes issue with the perception of Poujadism as an extreme-right ideology and interprets it instead as a form of populist protest lacking a solid doctrinal core and opportunistic in its exploitation of political issues and allies. As such, it is argued, Poujadism represents a complex synthesis of both right-wing and left-wing values and discourses, as impervious to definition today as it was fifty years ago. The article considers the brief alliance of convenience between Poujade and Le Pen, and locates in Le Pen’s early Poujadist experience some of the methods and even some of the arguments used by the FN today. It concludes by discussing Poujade’s political activities after 1958, tracing his long-term conversion from violent opposition to the State under the Fourth Republic to co-operation under the Fifth. The author draws here on correspondence with Pierre Poujade up until his death in August 2003.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Single word production requires that phoneme activation is maintained while articulatory conversion is taking place. Word serial recall, connected speech and non-word production (repetition and spelling) are all assumed to involve a phonological output buffer. A crucial question is whether the same memory resources are also involved in single word production. We investigate this question by assessing length and positional effects in the single word repetition and reading of six aphasic patients. We expect a damaged buffer to result in error rates per phoneme which increase with word length and in position effects. Although our patients had trouble with phoneme activation (they made mainly errors of phoneme selection), they did not show the effects expected from a buffer impairment. These results show that phoneme activation cannot be automatically equated with a buffer. We hypothesize that the phonemes of existing words are kept active though permanent links to the word node. Thus, the sustained activation needed for their articulation will come from the lexicon and will have different characteristics from the activation needed for the short-term retention of an unbound set of units. We conclude that there is no need and no evidence for a phonological buffer in single word production.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Recently discovered sources indicate that the Jewish population of East Frisia in Northwest Germany used a variety based on Western Yiddish as an in-group vernacular well into the 20th century. The East Frisian Jewish variety shows contact-induced traces of Low German, mainly in the lexicon but also in a number of morphological structures. This study does not only analyzes the influence of Low German on the East Frisian Jewish variety but also asks the question, whether three hundred years of language contact have led to traces of the Jewish variety in east Frisian Low German.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This article presents two novel approaches for incorporating sentiment prior knowledge into the topic model for weakly supervised sentiment analysis where sentiment labels are considered as topics. One is by modifying the Dirichlet prior for topic-word distribution (LDA-DP), the other is by augmenting the model objective function through adding terms that express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria (LDA-GE). We conducted extensive experiments on English movie review data and multi-domain sentiment dataset as well as Chinese product reviews about mobile phones, digital cameras, MP3 players, and monitors. The results show that while both LDA-DP and LDAGE perform comparably to existing weakly supervised sentiment classification algorithms, they are much simpler and computationally efficient, rendering themmore suitable for online and real-time sentiment classification on the Web. We observed that LDA-GE is more effective than LDA-DP, suggesting that it should be preferred when considering employing the topic model for sentiment analysis. Moreover, both models are able to extract highly domain-salient polarity words from text.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon. Preferences on expectations of sentiment labels of those lexicon words are expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than exiting weakly-supervised sentiment classification methods despite using no labeled documents.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Previous research into formulaic language has focussed on specialised groups of people (e.g. L1 acquisition by infants and adult L2 acquisition) with ordinary adult native speakers of English receiving less attention. Additionally, whilst some features of formulaic language have been used as evidence of authorship (e.g. the Unabomber’s use of you can’t eat your cake and have it too) there has been no systematic investigation into this as a potential marker of authorship. This thesis reports the first full-scale study into the use of formulaic sequences by individual authors. The theory of formulaic language hypothesises that formulaic sequences contained in the mental lexicon are shaped by experience combined with what each individual has found to be communicatively effective. Each author’s repertoire of formulaic sequences should therefore differ. To test this assertion, three automated approaches to the identification of formulaic sequences are tested on a specially constructed corpus containing 100 short narratives. The first approach explores a limited subset of formulaic sequences using recurrence across a series of texts as the criterion for identification. The second approach focuses on a word which frequently occurs as part of formulaic sequences and also investigates alternative non-formulaic realisations of the same semantic content. Finally, a reference list approach is used. Whilst claiming authority for any reference list can be difficult, the proposed method utilises internet examples derived from lists prepared by others, a procedure which, it is argued, is akin to asking large groups of judges to reach consensus about what is formulaic. The empirical evidence supports the notion that formulaic sequences have potential as a marker of authorship since in some cases a Questioned Document was correctly attributed. Although this marker of authorship is not universally applicable, it does promise to become a viable new tool in the forensic linguist’s tool-kit.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Browsing constitutes an important part of the user information searching process on the Web. In this paper, we present a browser plug-in called ESpotter, which recognizes entities of various types on Web pages and highlights them according to their types to assist user browsing. ESpotter uses a range of standard named entity recognition techniques. In addition, a key new feature of ESpotter is that it addresses the problem of multiple domains on the Web by adapting lexicon and patterns to these domains.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The influence of text messaging on language has been hotly debated especially in relation to spelling and the lexicon, but the impact of SMS on syntax has received less attention.This article focuses on manipulations within the verbal domain, as language evolution points towards a consistent trend going from synthetic to analytical forms (Bybee et al. 1994), which goes against the need for concision in texting. Based on an authentic corpus of about 500 SMS (Fairon et al. 2006b), the present study shows condensation strategies that are similar to those already described, yet reveals specific features such as the absence of aphaeresis and the scarcity of apocope, as well as the overuse of synthetic forms. It can thus be concluded that while SMS writing displays oral characteristics, it cannot obviously be assimilated to speech; in addition, it may well slow down language evolution and support the conservation of short standard forms.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Sentiment analysis on Twitter has attracted much attention recently due to its wide applications in both, commercial and public sectors. In this paper we present SentiCircles, a lexicon-based approach for sentiment analysis on Twitter. Different from typical lexicon-based approaches, which offer a fixed and static prior sentiment polarities of words regardless of their context, SentiCircles takes into account the co-occurrence patterns of words in different contexts in tweets to capture their semantics and update their pre-assigned strength and polarity in sentiment lexicons accordingly. Our approach allows for the detection of sentiment at both entity-level and tweet-level. We evaluate our proposed approach on three Twitter datasets using three different sentiment lexicons to derive word prior sentiments. Results show that our approach significantly outperforms the baselines in accuracy and F-measure for entity-level subjectivity (neutral vs. polar) and polarity (positive vs. negative) detections. For tweet-level sentiment detection, our approach performs better than the state-of-the-art SentiStrength by 4-5% in accuracy in two datasets, but falls marginally behind by 1% in F-measure in the third dataset.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Sentiment lexicons for sentiment analysis offer a simple, yet effective way to obtain the prior sentiment information of opinionated words in texts. However, words' sentiment orientations and strengths often change throughout various contexts in which the words appear. In this paper, we propose a lexicon adaptation approach that uses the contextual semantics of words to capture their contexts in tweet messages and update their prior sentiment orientations and/or strengths accordingly. We evaluate our approach on one state-of-the-art sentiment lexicon using three different Twitter datasets. Results show that the sentiment lexicons adapted by our approach outperform the original lexicon in accuracy and F-measure in two datasets, but give similar accuracy and slightly lower F-measure in one dataset.