973 resultados para lexical analysis
Resumo:
Two decades after its inception, Latent Semantic Analysis(LSA) has become part and parcel of every modern introduction to Information Retrieval. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows: (1) that LSA recovers latent semantic factors underlying the document space, (2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and (3) that the latter can best be achieved by Singular Value Decomposition. For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.
Resumo:
Abstract This dissertation is a cross-linguistic study of lexical iconicity. The study is based on a genealogically stratified sample of 237 languages. The aim is to contribute with an empirical study to the growing dialogue focusing on different forms of lexical iconicity. The conceptual framework of the present study is based on an analysis of types and means of lexical iconicity in the sample languages. Archaeological and cultural evidence are used to tie lexical iconicity to its context. Phenomena related to lexical iconicity are studied both cross-linguistically and language-specifically. The cognitive difference between imitation and symbolism is essential. Lexical iconicity is not only about the iconic relationship between form and referents, but also about how certain iconic properties may become conventional, means used to create sound symbolism. All the sample languages show some evidence of lexical iconicity, demonstrating that it is a universal feature. Nine comparisons of onomatopoeic verbs and nouns, with samples varying between six and 141 languages, show that typologically highly different languages use similar means for creating words based on sound imitation. Two cross-linguistic comparisons of bird names demonstrate that a vast majority of the Eurasian names of the common cuckoo and the world-wide names of crow and raven of the 141 genera are onomatopoeic.
Resumo:
The aim was to analyse the growth and compositional development of the receptive and expressive lexicons between the ages 0,9 and 2;0 in the full-term (FT) and the very-low-birth-weight (VLBW) children who are acquiring Finnish. The associations between the expressive lexicon and grammar at 1;6 and 2;0 in the FT children were also studied. In addition, the language skills of the VLBW children at 2;0 were analysed, as well as the predictive value of early lexicon to the later language performance. Four groups took part in the studies: the longitudinal (N = 35) and cross-sectional (N = 146) samples of the FT children, and the longitudinal (N = 32) and cross-sectional (N = 66) samples of VLBW children. The data was gathered by applying of the structured parental rating method (the Finnish version of the Communicative Development Inventory), through analysis of the children´s spontaneous speech and by administering a a formal test (Reynell Developmental Language Scales). The FT children acquired their receptive lexicons earlier, at a faster rate and with larger individual variation than their expressive lexicons. The acquisition rate of the expressive lexicon increased from slow to faster in most children (91%). Highly parallel developmental paths for lexical semantic categories were detected in the receptive and expressive lexicons of the Finnish children when they were analysed in relation to the growth of the lexicon size, as described in the literature for children acquiring other languages. The emergence of grammar was closely associated with expressive lexical growth. The VLBW children acquired their receptive lexicons at a slower rate and had weaker language skills at 2;0 than the full-term children. The compositional development of both lexicons happened at a slower rate in the VLBW children when compared to the FT controls. However, when the compositional development was analysed in relation to the growth of lexicon size, this development occurred qualitatively in a nearly parallel manner in the VLBW children as in the FT children. Early receptive and expressive lexicon sizes were significantly associated with later language skills in both groups. The effect of the background variables (gender, length of the mother s basic education, birth weight) on the language development in the FT and the VLBW children differed. The results provide new information of early language acquisition by the Finnish FT and VLBW children. The results support the view that the early acquisition of the semantic lexical categories is related to lexicon growth. The current findings also propose that the early grammatical acquisition is closely related to the growth of expressive vocabulary size. The language development of the VLBW children should be followed in clinical work.
Resumo:
This paper presents the preliminary analysis of Kannada WordNet and the set of relevant computational tools. Although the design has been inspired by the famous English WordNet, and to certain extent, by the Hindi WordNet, the unique features of Kannada WordNet are graded antonyms and meronymy relationships, nominal as well as verbal compoundings, complex verb constructions and efficient underlying database design (designed to handle storage and display of Kannada unicode characters). Kannada WordNet would not only add to the sparse collection of machine-readable Kannada dictionaries, but also will give new insights into the Kannada vocabulary. It provides sufficient interface for applications involved in Kannada machine translation, spell checker and semantic analyser.
Resumo:
For sign languages used by deaf communities, linguistic corpora have until recently been unavailable, due to the lack of a writing system and a written culture in these communities, and the very recent advent of digital video. Recent improvements in video and computer technology have now made larger sign language datasets possible; however, large sign language datasets that are fully machine-readable are still elusive. This is due to two challenges. 1. Inconsistencies that arise when signs are annotated by means of spoken/written language. 2. The fact that many parts of signed interaction are not necessarily fully composed of lexical signs (equivalent of words), instead consisting of constructions that are less conventionalised. As sign language corpus building progresses, the potential for some standards in annotation is beginning to emerge. But before this project, there were no attempts to standardise these practices across corpora, which is required to be able to compare data crosslinguistically. This project thus had the following aims: 1. To develop annotation standards for glosses (lexical/word level) 2. To test their reliability and validity 3. To improve current software tools that facilitate a reliable workflow Overall the project aimed not only to set a standard for the whole field of sign language studies throughout the world but also to make significant advances toward two of the world’s largest machine-readable datasets for sign languages – specifically the BSL Corpus (British Sign Language, http://bslcorpusproject.org) and the Corpus NGT (Sign Language of the Netherlands, http://www.ru.nl/corpusngt).
Resumo:
O objetivo principal deste trabalho é estudar estratégias de indeterminação de sentido em um corpus de conversas telefônicas interceptadas, considerando que a produção de sentido é um processo cognitivo dependente do contexto. Delimitamos a linguística cognitiva como a área na qual essa pesquisa se encontra inserida, para melhor compreender os fundamentos e os pressupostos norteadores da Teoria dos Modelos Cognitivos Idealizados (TMCI) e da Teoria da Mesclagem Conceptual (blending), tendo como base, principalmente, os estudos de Lakoff (1987), Fauconnier (1997) e Fauconnier e Turner (2002). No decorrer do trabalho propomo-nos responder às seguintes questões de pesquisa: a) que estratégias de indeterminação de sentido são mais frequentemente usadas nestas conversas? b) que elementos do contexto e do cotexto permitem a delimitação do sentido do item lexical em determinada conversa? c) como funcionam, no corpus, as estratégias de indeterminação de sentido e de que forma elas contribuem para sustentar determinado tipo de relação interpessoal? Para responder a estas questões de pesquisa, das 22 gravações de conversas telefônicas de atores sociais envolvidos com tráfico de armas e drogas, sequestro e extorsão, fornecidas pela Coordenadoria de Segurança e Inteligência do Ministério Público do Rio de Janeiro, selecionamos 10 conversas, em função da sua qualidade sonora, para serem transcritas e para proceder à análise qualitativa do uso da polissemia e da vagueza lexical. A partir das discussões teóricas e das análises desenvolvidas, concluímos que a polissemia representa a estratégia de indeterminação de sentido mais frequente no corpus desta pesquisa e que a mesma pode ser entendida como um processo de mesclagem conceptual, que sofre influências sociais e culturais: é a dinamicidade do pensamento e da linguagem que geram a polissemia. Concluímos também que a vagueza lexical é utilizada, no corpus, como um recurso linguístico para referência a assuntos ilícitos. Os itens lexicais analisados instanciam esquemas mentais abstratos que têm seus sentidos realizados a partir de pistas linguísticas e extralinguísticas que apontam para um processo interacional que pode ser entendido como um enquadre de transações comerciais (tráfico de drogas)
Resumo:
Infants' speech perception abilities change through the first year of life, from broad sensitivity to a wide range of speech contrasts to becoming more finely attuned to their native language. What remains unclear, however, is how this perceptual change relates to brain responses to native language contrasts in terms of the functional specialization of the left and right hemispheres. Here, to elucidate the developmental changes in functional lateralization accompanying this perceptual change, we conducted two experiments on Japanese infants using Japanese lexical pitch-accent, which changes word meanings with the pitch pattern within words. In the first behavioral experiment, using visual habituation, we confirmed that infants at both 4 and 10 months have sensitivities to the lexical pitch-accent pattern change embedded in disyllabic words. In the second experiment, near-infrared spectroscopy was used to measure cortical hemodynamic responses in the left and right hemispheres to the same lexical pitch-accent pattern changes and their pure tone counterparts. We found that brain responses to the pitch change within words differed between 4- and 10-month-old infants in terms of functional lateralization: Left hemisphere dominance for the perception of the pitch change embedded in words was seen only in the 10-month-olds. These results suggest that the perceptual change in Japanese lexical pitch-accent may be related to a shift in functional lateralization from bilateral to left hemisphere dominance.
Resumo:
The percentage of subjects recalling each unit in a list or prose passage is considered as a dependent measure. When the same units are recalled in different tasks, processing is assumed to be the same; when different units are recalled, processing is assumed to be different. Two collections of memory tasks are presented, one for lists and one for prose. The relations found in these two collections are supported by an extensive reanalysis of the existing prose memory literature. The same set of words were learned by 13 different groups of subjects under 13 different conditions. Included were intentional free-recall tasks, incidental free recall following lexical decision, and incidental free recall following ratings of orthographic distinctiveness and emotionality. Although the nine free-recall tasks varied widely with regard to the amount of recall, the relative probability of recall for the words was very similar among the tasks. Imagery encoding and recognition produced relative probabilities of recall that were different from each other and from the free-recall tasks. Similar results were obtained with a prose passage. A story was learned by 13 different groups of subjects under 13 different conditions. Eight free-recall tasks, which varied with respect to incidental or intentional learning, retention interval, and the age of the subjects, produced similar relative probabilities of recall, whereas recognition and prompted recall produced relative probabilities of recall that were different from each other and from the free-recall tasks. A review of the prose literature was undertaken to test the generality of these results. Analysis of variance is the most common statistical procedure in this literature. If the relative probability of recall of units varied across conditions, a units by condition interaction would be expected. For the 12 studies that manipulated retention interval, an average of 21% of the variance was accounted for by the main effect of retention interval, 17% by the main effect of units, and only 2% by the retention interval by units interaction. Similarly, for the 12 studies that varied the age of the subjects, 6% of the variance was accounted for by the main effect of age, 32% by the main effect of units, and only 1% by the interaction of age by units.(ABSTRACT TRUNCATED AT 400 WORDS)
Resumo:
We present the results of exploratory experiments using lexical valence extracted from brain using electroencephalography (EEG) for sentiment analysis. We selected 78 English words (36 for training and 42 for testing), presented as stimuli to 3 English native speakers. EEG signals were recorded from the subjects while they performed a mental imaging task for each word stimulus. Wavelet decomposition was employed to extract EEG features from the time-frequency domain. The extracted features were used as inputs to a sparse multinomial logistic regression (SMLR) classifier for valence classification, after univariate ANOVA feature selection. After mapping EEG signals to sentiment valences, we exploited the lexical polarity extracted from brain data for the prediction of the valence of 12 sentences taken from the SemEval-2007 shared task, and compared it against existing lexical resources.
Resumo:
In order to explore the impact of a degraded semantic system on the structure of language production, we analysed transcripts from autobiographical memory interviews to identify naturally-occurring speech errors by eight patients with semantic dementia (SD) and eight age-matched normal speakers. Relative to controls, patients were significantly more likely to (a) substitute and omit open class words, (b) substitute (but not omit) closed class words, (c) substitute incorrect complex morphological forms and (d) produce semantically and/or syntactically anomalous sentences. Phonological errors were scarce in both groups. The study confirms previous evidence of SD patients’ problems with open class content words which are replaced by higher frequency, less specific terms. It presents the first evidence that SD patients have problems with closed class items and make syntactic as well as semantic speech errors, although these grammatical abnormalities are mostly subtle rather than gross. The results can be explained by the semantic deficit which disrupts the representation of a pre-verbal message, lexical retrieval and the early stages of grammatical encoding.
Resumo:
Treffers-Daller and Korybski propose to operationalize language dominance on the basis of measures of lexical diversity, as computed, in this particular study, on transcripts of stories told by Polish-English bilinguals in each of their languages They compute four different Indices of Language Dominance (ILD) on the basis of two different measures of lexical diversity, the Index of Guiraud (Guiraud, 1954) and HD-D (McCarthy & Jarvis, 2007). They compare simple indices, which are based on subtracting scores from one language from scores for another language, to more complex indices based on the formula Birdsong borrowed from the field of handedness, namely the ratio of (Difference in Scores) / (Sum of Scores). Positive scores on each of these Indices of Language Dominance mean that informants are more English-dominant and negative scores that they are more Polish-dominant. The authors address the difficulty of comparing scores across languages by carefully lemmatizing the data. Following Flege, Mackay and Piske (2002) they also look into the validity of these indices by investigating to what extent they can predict scores on other, independently measured variables. They use correlations and regression analysis for this, which has the advantage that the dominance indices are used as continuous variables and arbitrary cut-off points between balanced and dominant bilinguals need not be chosen. However, they also show how the computation of z-scores can help facilitate a discussion about the appropriateness of different cut-off points across different data sets and measurement scales in those cases where researchers consider it necessary to make categorial distinctions between balanced and dominant bilinguals. Treffers-Daller and Korybski correlate the ILD scores with four other variables, namely Length of Residence in the UK, attitudes towards English and life in the UK, frequency of usage of English at home and frequency of code-switching. They found that the indices correlated significantly with most of these variables, but there were clear differences between the Guiraud-based indices and the HDD-based indices. In a regression analysis three of the measures were also found to be a significant predictor of English language usage at home. They conclude that the correlations and the regression analyses lend strong support to the validity of their approach to language dominance.
Resumo:
This article suggests a theoretical and methodological framework for a systematic contrastive discourse analysis across languages and discourse communities through keywords, constituting a lexical approach to discourse analysis which is considered to be particularly fruitful for comparative analysis. We use a corpus assisted methodology, presuming meaning to be constituted, revealed and constrained by collocation environment. We compare the use of the keyword intégration and Integration in French and German public discourses about migration on the basis of newspaper corpora built from two French and German newspapers from 1998 to 2011. We look at the frequency of these keywords over the given time span, group collocates into thematic categories and discuss indicators of discursive salience by comparing the development of collocation profiles over time in both corpora as well as the occurrence of neologisms and compounds based on intégration/Integration.
Resumo:
This paper presents an approach for assisting low-literacy readers in accessing Web online information. The oEducational FACILITAo tool is a Web content adaptation tool that provides innovative features and follows more intuitive interaction models regarding accessibility concerns. Especially, we propose an interaction model and a Web application that explore the natural language processing tasks of lexical elaboration and named entity labeling for improving Web accessibility. We report on the results obtained from a pilot study on usability analysis carried out with low-literacy users. The preliminary results show that oEducational FACILITAo improves the comprehension of text elements, although the assistance mechanisms might also confuse users when word sense ambiguity is introduced, by gathering, for a complex word, a list of synonyms with multiple meanings. This fact evokes a future solution in which the correct sense for a complex word in a sentence is identified, solving this pervasive characteristic of natural languages. The pilot study also identified that experienced computer users find the tool to be more useful than novice computer users do.
Resumo:
This study aims to test Robertson’s lexical transfer principle, which posits that Chinese learners use demonstratives (particularly this) and the numeral one as markers of definiteness and indefiniteness. This is tested by analysing Chinese learners’ written production collected from the Spoken and Written English Corpus of Chinese Learners 2.0 (SWECCL 2.0). The purpose is to understand the variation of article usage by adult Chinese learners of English. More specifically, the study examines to what extent articles, possessive and demonstrative pronouns are used in Chinese learners’ English and how definite and indefinite articles are used by the Chinese learners. Findings suggest that Robertson’s lexical transfer principle was corroborated by the present study. In addition, Chinese learners prefer to use demonstrative determiners, the possessive determiner our, and the numeral one to perform the function of marking definiteness and indefiniteness. In particular, the learners try to use the demonstrative determiners that and this in the anaphoric function instead of the definite article, and the demonstrative determiner those is frequently used in the cataphoric function. What is more, the learners use the numeral one as a marker of indefiniteness, and it is also used as a marker of definiteness in the anaphoric function. Further, the possessive determiner our is used as a marker of definiteness in larger situation uses referring to something unique. To this end, the study is able to show that the definite article is used to perform the function of marking indefiniteness, and in some particular contexts the definite article functions as a Chinese specifier in Chinese learners’ English. Also, the indefinite article is frequently used in quantifier phrases but is rarely used in other functions. There are three main reasons that may explain why Chinese learners use determiners variously. Firstly, the choice of determiners by Chinese learners is influenced by linguistic contexts. Secondly, because of learning strategies, Chinese learners try to ignore the anaphoric function and cataphoric function that they are not yet ready to process in article usage. Thirdly, interlanguage grammar influences the optionality in the use of articles.
Resumo:
Pós-graduação em Estudos Linguísticos - IBILCE