974 resultados para Lexical items


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Pós-graduação em Estudos Linguísticos - IBILCE

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Pós-graduação em Estudos Linguísticos - IBILCE

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This thesis concerns artificially intelligent natural language processing systems that are capable of learning the properties of lexical items (properties like verbal valency or inflectional class membership) autonomously while they are fulfilling their tasks for which they have been deployed in the first place. Many of these tasks require a deep analysis of language input, which can be characterized as a mapping of utterances in a given input C to a set S of linguistically motivated structures with the help of linguistic information encoded in a grammar G and a lexicon L: G + L + C → S (1) The idea that underlies intelligent lexical acquisition systems is to modify this schematic formula in such a way that the system is able to exploit the information encoded in S to create a new, improved version of the lexicon: G + L + S → L' (2) Moreover, the thesis claims that a system can only be considered intelligent if it does not just make maximum usage of the learning opportunities in C, but if it is also able to revise falsely acquired lexical knowledge. So, one of the central elements in this work is the formulation of a couple of criteria for intelligent lexical acquisition systems subsumed under one paradigm: the Learn-Alpha design rule. The thesis describes the design and quality of a prototype for such a system, whose acquisition components have been developed from scratch and built on top of one of the state-of-the-art Head-driven Phrase Structure Grammar (HPSG) processing systems. The quality of this prototype is investigated in a series of experiments, in which the system is fed with extracts of a large English corpus. While the idea of using machine-readable language input to automatically acquire lexical knowledge is not new, we are not aware of a system that fulfills Learn-Alpha and is able to deal with large corpora. To instance four major challenges of constructing such a system, it should be mentioned that a) the high number of possible structural descriptions caused by highly underspeci ed lexical entries demands for a parser with a very effective ambiguity management system, b) the automatic construction of concise lexical entries out of a bulk of observed lexical facts requires a special technique of data alignment, c) the reliability of these entries depends on the system's decision on whether it has seen 'enough' input and d) general properties of language might render some lexical features indeterminable if the system tries to acquire them with a too high precision. The cornerstone of this dissertation is the motivation and development of a general theory of automatic lexical acquisition that is applicable to every language and independent of any particular theory of grammar or lexicon. This work is divided into five chapters. The introductory chapter first contrasts three different and mutually incompatible approaches to (artificial) lexical acquisition: cue-based queries, head-lexicalized probabilistic context free grammars and learning by unification. Then the postulation of the Learn-Alpha design rule is presented. The second chapter outlines the theory that underlies Learn-Alpha and exposes all the related notions and concepts required for a proper understanding of artificial lexical acquisition. Chapter 3 develops the prototyped acquisition method, called ANALYZE-LEARN-REDUCE, a framework which implements Learn-Alpha. The fourth chapter presents the design and results of a bootstrapping experiment conducted on this prototype: lexeme detection, learning of verbal valency, categorization into nominal count/mass classes, selection of prepositions and sentential complements, among others. The thesis concludes with a review of the conclusions and motivation for further improvements as well as proposals for future research on the automatic induction of lexical features.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The group analysed some syntactic and phonological phenomena that presuppose the existence of interrelated components within the lexicon, which motivate the assumption that there are some sublexicons within the global lexicon of a speaker. This result is confirmed by experimental findings in neurolinguistics. Hungarian speaking agrammatic aphasics were tested in several ways, the results showing that the sublexicon of closed-class lexical items provides a highly automated complex device for processing surface sentence structure. Analysing Hungarian ellipsis data from a semantic-syntactic aspect, the group established that the lexicon is best conceived of being as split into at least two main sublexicons: the store of semantic-syntactic feature bundles and a separate store of sound forms. On this basis they proposed a format for representing open-class lexical items whose meanings are connected via certain semantic relations. They also proposed a new classification of verbs to account for the contribution of the aspectual reading of the sentence depending on the referential type of the argument, and a new account of the syntactic and semantic behaviour of aspectual prefixes. The partitioned sets of lexical items are sublexicons on phonological grounds. These sublexicons differ in terms of phonotactic grammaticality. The degrees of phonotactic grammaticality are tied up with the problem of psychological reality, of how many degrees of this native speakers are sensitive to. The group developed a hierarchical construction network as an extension of the original General Inheritance Network formalism and this framework was then used as a platform for the implementation of the grammar fragments.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the context of a synchronic lexical study of the Ede varieties of West Africa, this paper investigates whether the use of different criteria sets to judge the similarity of lexical features in different language varieties yields matching conclusions regarding the relative relationships and clustering of the investigated varieties and thus leads to similar recommendations for further sociolinguistic research. Word lists elicited in 28 Ede varieties were analyzed with the inspection method. To explore the effects of different similarity judgment criteria, two different similarity judgment criteria sets were applied to the elicited data to identify similar lexical items. The quantification of these similarity decisions led to the computation of two similarity matrices which were subsequently analyzed by means of correlation analysis and multidimensional scaling. The findings of this analysis suggest compatible conclusions regarding the relative relationships and clustering of the investigated Ede varieties. However, the matching clustering results do not necessarily lead to the same recommendations for more in-depth sociolinguistic research, when interpreted in terms of an absolute lexical similarity threshold. The indicated ambiguities suggest the usefulness of focusing on the relative, rather than absolute in establishing recommendations for further sociolinguistic research.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the present study we focus on the interaction between the acquisition of new words and text organisation. In the acquisition of new words we emphasise the acquisition of paradigmatic relations such as hyponymy, meronymy and semantic sets. We work with a group of girls attending a private school for adolescents in serious difficulties. The subjects are from disadvantaged families. Their writing skills were very poor. When asked to describe a garden, they write a short text of a single paragraph, the lexical items were generic, there were no adjectives, and all of them use mainly existential verbs. The intervention plan assumed that subjects must to be exposed to new words, working out its meaning. In presence of referents subjects were taught new words making explicit the intended relation of the new term to a term already known. In the classroom subjects were asked to write all the words they knew drawing the relationships among them. They talk about the words specifying the relation making explicit pragmatic directions like is a kind of, is a part of or are all x. After that subjects were exposed to the task of choosing perspective. The work presented in this paper accounts for significant differences in the text of the subjects before and after the intervention. While working new words subjects were organising their lexicon and learning to present a whole entity in perspective.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Aim: In Western Europe, HIV/AIDS prevention has been based on the provision of information intended to lead the public to voluntarily adapt their behaviour so as to avoid the risk of virus transmission. Whether conveyed in a written or oral form, the messages of prevention are essentially verbal. Sociolinguistic research confirms that, even within a given culture, the meaning attributed to lexical items varies. It was hypothesised that understandings of the terms used in HIV/AIDS prevention in French-speaking Switzerland would vary, and research was undertaken to identify the level and nature of this variation both between and among those who transmit (prevention providers) and those who receive (the public) the messages. Method/issue: All HIV/AIDS prevention material available in French-speaking Switzerland in 2004 was assembled and a corpus of 50 key documents identified. Two series of lexical items were generated from this corpus: one composed of technical terms potentially difficult to understand, and the other, of terms used in everyday language with implicit, and therefore potentially variable, meaning. The two lists of terms were investigated in qualitative interviews in stratified purposive samples of the general public (n=60) and prevention providers (n=30), using standard socio-linguistic methodology. A further quantitative study (CATI) in the general population (17 - 49 yrs.; n=500) investigated understandings of 15 key prevention terms found in the qualitative research to have been associated with high levels of dissension. Results/comments: Selected aspects of the results will be presented. In illustration: meanings attributed to the different terms in both the public and the providers varied. For example, when a relationship is described as "stable", this may be understood as implying exclusive sexual relations or long duration, with an interaction between the two traits; the term "sexual intercourse" may or may not be used to refer to oral sex; "making love" may or may not necessarily include an act of penetration; the pre-ejaculate is qualified by some as sperm, and by others not... Understanding of frequently used "technical" terms in prevention was far from universal; for example, around only a half of respondents understood the meaning of "safer sex". Degree of understanding of these terms was linked to education, whereas variability in meaning in everyday language was not linked to socio-economic variables. Discussion: Findings indicate the need for more awareness regarding the heterogeneity of meaning around the terms regularly used in prevention. Greater attention should be paid to the formulation of prevention messages, and providers should take precautions to ensure that the meanings they wish to convey are those perceived by the receivers of their messages. Wherever possible, terms used should be defined and meanings rendered explicit.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A crucial step for understanding how lexical knowledge is represented is to describe the relative similarity of lexical items, and how it influences language processing. Previous studies of the effects of form similarity on word production have reported conflicting results, notably within and across languages. The aim of the present study was to clarify this empirical issue to provide specific constraints for theoretical models of language production. We investigated the role of phonological neighborhood density in a large-scale picture naming experiment using fine-grained statistical models. The results showed that increasing phonological neighborhood density has a detrimental effect on naming latencies, and re-analyses of independently obtained data sets provide supplementary evidence for this effect. Finally, we reviewed a large body of evidence concerning phonological neighborhood density effects in word production, and discussed the occurrence of facilitatory and inhibitory effects in accuracy measures. The overall pattern shows that phonological neighborhood generates two opposite forces, one facilitatory and one inhibitory. In cases where speech production is disrupted (e.g. certain aphasic symptoms), the facilitatory component may emerge, but inhibitory processes dominate in efficient naming by healthy speakers. These findings are difficult to accommodate in terms of monitoring processes, but can be explained within interactive activation accounts combining phonological facilitation and lexical competition.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

L'objectiu d'aquest article és el d'examinar les característiques lèxiques del castellà parlat pels catalanoparlants a partir de les ocurrències recollides en la ciutat de Lleida. Cal tenir en compte que els trets que de fineixen els ítems lèxics que hi analitzem estan determinats per les condicions sociolingüístiques de les dues llengües en contacte. La transferència contínua que es produeix entre els dos sistemes lingüístics fa que el lèxic emprat sigui també peculiar i estigui restringit als territoris de parla catalana.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Can crowdsourcing solutions serve many masters? Can they be beneficial for both, for the layman or native speakers of minority languages on the one hand and serious linguistic research on the other? How did an infrastructure that was designed to support linguistics turn out to be a solution for raising awareness of native languages? Since 2012 the National Library of Finland has been developing the Digitisation Project for Kindred Languages, in which the key objective is to support a culture of openness and interaction in linguistic research, but also to promote crowdsourcing as a tool for participation of the language community in research. In the course of the project, over 1,200 monographs and nearly 111,000 pages of newspapers in Finno-Ugric languages will be digitised and made available in the Fenno-Ugrica digital collection. This material was published in the Soviet Union in the 1920s and 1930s, and users have had only sporadic access to the material. The publication of open-access and searchable materials from this period is a goldmine for researchers. Historians, social scientists and laymen with an interest in specific local publications can now find text materials pertinent to their studies. The linguistically-oriented population can also find writings to delight them: (1) lexical items specific to a given publication, and (2) orthographically-documented specifics of phonetics. In addition to the open access collection, we developed an open source code OCR editor that enables the editing of machine-encoded text for the benefit of linguistic research. This tool was necessary since these rare and peripheral prints often include already archaic characters, which are neglected by modern OCR software developers but belong to the historical context of kindred languages, and are thus an essential part of the linguistic heritage. When modelling the OCR editor, it was essential to consider both the needs of researchers and the capabilities of lay citizens, and to have them participate in the planning and execution of the project from the very beginning. By implementing the feedback iteratively from both groups, it was possible to transform the requested changes as tools for research that not only supported the work of linguistics but also encouraged the citizen scientists to face the challenge and work with the crowdsourcing tools for the benefit of research. This presentation will not only deal with the technical aspects, developments and achievements of the infrastructure but will highlight the way in which user groups, researchers and lay citizens were engaged in a process as an active and communicative group of users and how their contributions were made to mutual benefit.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The topic of the present doctoral dissertation is the analysis of the phonological and tonal structures of a previously largely undescribed language, namely Samue. It is a Gur language belonging to the Niger-Congo language phulym, which is spoken in Burkina Faso. The data were collected during the fieldwork period in a Sama village; the data include 1800 lexical items, thousands of elicited sentences and 30 oral texts. The data were first transcribed phonetically and then the phonological and tonal analyses were conducted. The results show that the phonological system of Samue with the phoneme inventory and phonological processes has the same characteristics as other related Gur languages, although some particularities were found, such as the voicing and lenition of stop consonants in medial positions. Tonal analysis revealed three level tones, which have both lexical and grammatical functions. A particularity of the tonal system is the regressive Mid tone spreading in the verb phrase. The theoretical framework used in the study is Optimality theory. Optimality theory is rarely used in the analysis of an entire language system, and thus an objective was to see whether the theory was applicable to this type of work. Within the tonal analysis especially, some language specific constraints had to be created, although the basic Optimality Theory principle is the universal nature of the constraints. These constraints define the well-formedness of the language structures and they are differently ranked in different languages. This study gives new insights about typological phenomena in Gur languages. It is also a fundamental starting point for the Samue language in relation to the establishment of an orthography. From the theoretical point of view, the study proves that Optimality theory is largely applicable in the analysis of an entire sound system.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The emerging technologies have recently challenged the libraries to reconsider their role as a mere mediator between the collections, researchers, and wider audiences (Sula, 2013), and libraries, especially the nationwide institutions like national libraries, haven’t always managed to face the challenge (Nygren et al., 2014). In the Digitization Project of Kindred Languages, the National Library of Finland has become a node that connects the partners to interplay and work for shared goals and objectives. In this paper, I will be drawing a picture of the crowdsourcing methods that have been established during the project to support both linguistic research and lingual diversity. The National Library of Finland has been executing the Digitization Project of Kindred Languages since 2012. The project seeks to digitize and publish approximately 1,200 monograph titles and more than 100 newspapers titles in various, and in some cases endangered Uralic languages. Once the digitization has been completed in 2015, the Fenno-Ugrica online collection will consist of 110,000 monograph pages and around 90,000 newspaper pages to which all users will have open access regardless of their place of residence. The majority of the digitized literature was originally published in the 1920s and 1930s in the Soviet Union, and it was the genesis and consolidation period of literary languages. This was the era when many Uralic languages were converted into media of popular education, enlightenment, and dissemination of information pertinent to the developing political agenda of the Soviet state. The ‘deluge’ of popular literature in the 1920s to 1930s suddenly challenged the lexical orthographic norms of the limited ecclesiastical publications from the 1880s onward. Newspapers were now written in orthographies and in word forms that the locals would understand. Textbooks were written to address the separate needs of both adults and children. New concepts were introduced in the language. This was the beginning of a renaissance and period of enlightenment (Rueter, 2013). The linguistically oriented population can also find writings to their delight, especially lexical items specific to a given publication, and orthographically documented specifics of phonetics. The project is financially supported by the Kone Foundation in Helsinki and is part of the Foundation’s Language Programme. One of the key objectives of the Kone Foundation Language Programme is to support a culture of openness and interaction in linguistic research, but also to promote citizen science as a tool for the participation of the language community in research. In addition to sharing this aspiration, our objective within the Language Programme is to make sure that old and new corpora in Uralic languages are made available for the open and interactive use of the academic community as well as the language societies. Wordlists are available in 17 languages, but without tokenization, lemmatization, and so on. This approach was verified with the scholars, and we consider the wordlists as raw data for linguists. Our data is used for creating the morphological analyzers and online dictionaries at the Helsinki and Tromsø Universities, for instance. In order to reach the targets, we will produce not only the digitized materials but also their development tools for supporting linguistic research and citizen science. The Digitization Project of Kindred Languages is thus linked with the research of language technology. The mission is to improve the usage and usability of digitized content. During the project, we have advanced methods that will refine the raw data for further use, especially in the linguistic research. How does the library meet the objectives, which appears to be beyond its traditional playground? The written materials from this period are a gold mine, so how could we retrieve these hidden treasures of languages out of the stack that contains more than 200,000 pages of literature in various Uralic languages? The problem is that the machined-encoded text (OCR) contains often too many mistakes to be used as such in research. The mistakes in OCRed texts must be corrected. For enhancing the OCRed texts, the National Library of Finland developed an open-source code OCR editor that enabled the editing of machine-encoded text for the benefit of linguistic research. This tool was necessary to implement, since these rare and peripheral prints did often include already perished characters, which are sadly neglected by the modern OCR software developers, but belong to the historical context of kindred languages and thus are an essential part of the linguistic heritage (van Hemel, 2014). Our crowdsourcing tool application is essentially an editor of Alto XML format. It consists of a back-end for managing users, permissions, and files, communicating through a REST API with a front-end interface—that is, the actual editor for correcting the OCRed text. The enhanced XML files can be retrieved from the Fenno-Ugrica collection for further purposes. Could the crowd do this work to support the academic research? The challenge in crowdsourcing lies in its nature. The targets in the traditional crowdsourcing have often been split into several microtasks that do not require any special skills from the anonymous people, a faceless crowd. This way of crowdsourcing may produce quantitative results, but from the research’s point of view, there is a danger that the needs of linguists are not necessarily met. Also, the remarkable downside is the lack of shared goal or the social affinity. There is no reward in the traditional methods of crowdsourcing (de Boer et al., 2012). Also, there has been criticism that digital humanities makes the humanities too data-driven and oriented towards quantitative methods, losing the values of critical qualitative methods (Fish, 2012). And on top of that, the downsides of the traditional crowdsourcing become more imminent when you leave the Anglophone world. Our potential crowd is geographically scattered in Russia. This crowd is linguistically heterogeneous, speaking 17 different languages. In many cases languages are close to extinction or longing for language revitalization, and the native speakers do not always have Internet access, so an open call for crowdsourcing would not have produced appeasing results for linguists. Thus, one has to identify carefully the potential niches to complete the needed tasks. When using the help of a crowd in a project that is aiming to support both linguistic research and survival of endangered languages, the approach has to be a different one. In nichesourcing, the tasks are distributed amongst a small crowd of citizen scientists (communities). Although communities provide smaller pools to draw resources, their specific richness in skill is suited for complex tasks with high-quality product expectations found in nichesourcing. Communities have a purpose and identity, and their regular interaction engenders social trust and reputation. These communities can correspond to research more precisely (de Boer et al., 2012). Instead of repetitive and rather trivial tasks, we are trying to utilize the knowledge and skills of citizen scientists to provide qualitative results. In nichesourcing, we hand in such assignments that would precisely fill the gaps in linguistic research. A typical task would be editing and collecting the words in such fields of vocabularies where the researchers do require more information. For instance, there is lack of Hill Mari words and terminology in anatomy. We have digitized the books in medicine, and we could try to track the words related to human organs by assigning the citizen scientists to edit and collect words with the OCR editor. From the nichesourcing’s perspective, it is essential that altruism play a central role when the language communities are involved. In nichesourcing, our goal is to reach a certain level of interplay, where the language communities would benefit from the results. For instance, the corrected words in Ingrian will be added to an online dictionary, which is made freely available for the public, so the society can benefit, too. This objective of interplay can be understood as an aspiration to support the endangered languages and the maintenance of lingual diversity, but also as a servant of ‘two masters’: research and society.