932 resultados para Lexical database
Resumo:
Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions.
Resumo:
Princeton WordNet (WN.Pr) lexical database has motivated efficient compilations of bulky relational lexicons since its inception in the 1980's. The EuroWordNet project, the first multilingual initiative built upon WN.Pr, opened up ways of building individual wordnets, and interrelating them by means of the so-called Inter-Lingual-Index, an unstructured list of the WN.Pr synsets. Other important initiative, relying on a slightly different method of building multilingual wordnets, is the MultiWordNet project, where the key strategy is building language specific wordnets keeping as much as possible of the semantic relations available in the WN.Pr. This paper, in particular, stresses that the additional advantage of using WN.Pr lexical database as a resource for building wordnets for other languages is to explore possibilities of implementing an automatic procedure to map the WN.Pr conceptual relations as hyponymy, co-hyponymy, troponymy, meronymy, cause, and entailment onto the lexical database of the wordnet under construction, a viable possibility, for those are language-independent relations that hold between lexicalized concepts, not between lexical units. Accordingly, combining methods from both initiatives, this paper presents the ongoing implementation of the WN.Br lexical database and the aforementioned automation procedure illustrated with a sample of the automatic encoding of the hyponymy and co-hyponymy relations.
Resumo:
En aquest treball es realitza una descripció dels llenguatges que constitueixen la base de la construcció de la Web Semàntica: l'XML, l'RDF i l'OWL. En concret, es realitza un estudi de la base de dades lèxica WordNet. Finalment, es presenta el disseny i la implementació d'una ontologia per representar les relacions lèxiques dels mots del català. A partir d'aquesta ontologia es crea una petita base de dades basada en la temàtica dels animals de companyia. Aquest cas pràctic permet extreure conclusions sobre els avantatges d'introduir metadades en els documents electrònics, i sobre les facilitats que ofereixen les aplicacions actuals pel desenvolupament d'aquest tipus de documents.
Resumo:
Tämän tutkimuksen kohde on suomen kirjoitetun yleiskielen morfosyntaktisten yhdyssanarakenteiden produktiivisuus. Tutkimuksen tärkein päämäärä on selvittää, kuinka ahkerasti erilaisia suomen kielen suomia mahdollisuuksia käytetään uusien yhdyssanojen muodostuksessa. Käytännöllistä produktiivisuutta kartoittava tutkimus täydentää kielioppien ja sanastonkuvausten antamaa kuvaa kielestä. Tutkimuksen kohteena oleva kielimuoto on kaikille kielenkäyttäjille yhteinen kirjoitettu yleiskieli. Tutkimusaineisto koostuu 28 091 uudesta yhdyssanasta, jotka on kerätty painetun median kielestä vuosina 2000–2009. Aineiston pohjana on Kotimaisten kielten keskuksen Nykysuomen sanastotietokanta, johon poimitaan uusia ja uudella tavalla käytettyjä sanoja ensisijaisesti sanakirjatyön ja kielenhuollon tarpeisiin. Tutkimusaihetta lähestytään useiden yhdyssanan osien muotoa, sanaluokkaa, määrää ja pituutta koskevien alakysymysten kautta. Tutkimus etenee yksittäisten muut-tujien käsittelystä muuttujien keskinäisiä suhteita tarkasteleviin malleihin. Tutkimuksessa käytetään kaksivaiheista metodia: Metodin ensimmäinen askel on uudessa sanastossa havaittujen rakenteiden tyyppifrekvenssin tilastollinen analyysi. Toinen askel on varsinkin matalafrekvenssisten tai tilastollisessa analyysissa muuten poikkeavaksi osoittautuneiden rakenteiden kvalitatiivinen tarkastelu. Metodi on kehitetty tätä tutkimusta varten, sillä aiemmin produktiivisuuden mittaamisessa käytetyt menetelmät eivät sellaisenaan sovi suomen kielen yhdyssanarakenteiden tutkimukseen. Tutkimusmetodien kehittäminen on tutkimuksen keskeinen tavoite. Tutkimus osoittaa, että suomen kielen uudet yhdyssanat ovat rakenteeltaan homogeenisempiä, kuin aiempia kielenkuvauksia lukemalla voisi olettaa. Uusi suomen kielen yhdyssana on todennäköisimmin kahdesta substantiivista yhdistämällä muodostettu substantiivi, jonka alkuosa on nominatiivissa eikä kongruoi jälkiosan kanssa. Ennakko-oletusta huomattavasti yleisempiä ovat myös prefiksinkaltaisella alkuosalla alkavat yhdyssanat. Genetiivialkuiset yhdyssanat puolestaan ovat ennakko-oletusta harvinaisempia. Kaikki kieliopillisesti mahdolliset yhdyssanarakenteet eivät ole lainkaan produktiivisia kielenkäytön tasolla. Tutkimus on luonteeltaan kielen rakennetta kartoittavaa perustutkimusta. Tutkimustulosten tärkeimmät sovellusalat ovat kieliteknologia ja sananmuodostuksen opetus. Tutkimus avaa useita näkökulmia jatkotutkimukselle.
Resumo:
Cette recherche porte sur l’interface entre la sémantique lexicale et la syntaxe, et elle s’inscrit dans le cadre du projet de base lexicale DiCo (acronyme pour Dictionnaire de combinatoire) à l’Observatoire de Linguistique Sens-Texte [OLST] de l’Université de Montréal. Le projet découle d'une volonté d'inscrire de façon concise et complète, à même le dictionnaire, le comportement syntaxique typique à chaque unité lexicale. Dans cette optique, nous encodons la cooccurrence des lexies nominales du DiCo avec leurs actants à l'intérieur d'un tableau de régime lexical (aussi connu sous le nom de schéma valenciel, structure argumentale, cadre de sous-catégorisation, structure prédicats-arguments, etc.), en notant entre autres les dépendances syntaxiques de surface impliquées. Dans ce mémoire, nous présentons les propriétés syntaxiques d'une dépendance nominale du français, celle que nous avons nommée attributive adnominale, de façon à exposer une méthodologie d'identification et de caractérisation des dépendances syntaxiques de surface. Nous donnons également la liste des dépendances nominales régies identifiées au cours de ce travail. Par la suite, nous exposons la création d'une base de données de régimes généralisés du français nommée CARNAVAL. Finalement, nous discutons des applications possibles de notre travail, particulièrement en ce qui a trait à la création d'une typologie des régimes lexicaux du français.
Resumo:
Dans la sémantique des cadres de Fillmore, les mots prennent leur sens par rapport au contexte événementiel ou situationnel dans lequel ils s’inscrivent. FrameNet, une ressource lexicale pour l’anglais, définit environ 1000 cadres conceptuels, couvrant l’essentiel des contextes possibles. Dans un cadre conceptuel, un prédicat appelle des arguments pour remplir les différents rôles sémantiques associés au cadre (par exemple : Victime, Manière, Receveur, Locuteur). Nous cherchons à annoter automatiquement ces rôles sémantiques, étant donné le cadre sémantique et le prédicat. Pour cela, nous entrainons un algorithme d’apprentissage machine sur des arguments dont le rôle est connu, pour généraliser aux arguments dont le rôle est inconnu. On utilisera notamment des propriétés lexicales de proximité sémantique des mots les plus représentatifs des arguments, en particulier en utilisant des représentations vectorielles des mots du lexique.
Resumo:
This paper discusses particular linguistic challenges in the task of reusing published dictionaries, conceived as structured sources of lexical information, in the compilation process of a machine-tractable thesaurus-like lexical database for Brazilian Portuguese. After delimiting the scope of the polysemous term thesaurus, the paper focuses on the improvement of the resulting object by a small team, in a form compatible with and inspired by WordNet guidelines, comments on the dictionary entries, addresses selected problems found in the process of extracting the relevant lexical information form the selected dictionaries, and provides some strategies to overcome them.
Resumo:
This paper presents the overall methodology that has been used to encode both the Brazilian Portuguese WordNet (WordNet.Br) standard language-independent conceptual-semantic relations (hyponymy, co-hyponymy, meronymy, cause, and entailment) and the so-called cross-lingual conceptual-semantic relations between different wordnets. Accordingly, after contextualizing the project and outlining the current lexical database structure and statistics, it describes the WordNet.Br editing GUI that was designed to aid the linguist in carrying out the tasks of building synsets, selecting sample sentences from corpora, writing synset concept glosses, and encoding both language-independent conceptual-semantic relations and cross-lingual conceptual-semantic relations between WordNet.Br and Princeton WordNet © Springer-Verlag Berlin Heidelberg 2006.
Resumo:
The need for the representation of both semantics and common sense and its organization in a lexical database or knowledge base has motivated the development of large projects, such as Wordnets, CYC and Mikrokosmos. Besides the generic bases, another approach is the construction of ontologies for specific domains. Among the advantages of such approach there is the possibility of a greater and more detailed coverage of a specific domain and its terminology. Domain ontologies are important resources in several tasks related to the language processing, especially in those related to information retrieval and extraction in textual bases. Information retrieval or even question and answer systems can benefit from the domain knowledge represented in an ontology. Besides embracing the terminology of the field, the ontology makes the relationships among the terms explicit. Copyright 2007 ACM.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
In the architecture of a natural language processing system based on linguistic knowledge, two types of component are important: the knowledge databases and the processing modules. One of the knowledge databases is the lexical database, which is responsible for providing the lexical unities and its properties to the processing modules. The systems that process two or more languages require bilingual and/or multilingual lexical databases. These databases can be constructed by aligning distinct monolingual databases. In this paper, we present the interlingua and the strategy of aligning the two monolingual databases in REBECA, which only stores concepts from the “wheeled vehicle” domain.
Resumo:
In this paper we present a whole Natural Language Processing (NLP) system for Spanish. The core of this system is the parser, which uses the grammatical formalism Lexical-Functional Grammars (LFG). Another important component of this system is the anaphora resolution module. To solve the anaphora, this module contains a method based on linguistic information (lexical, morphological, syntactic and semantic), structural information (anaphoric accessibility space in which the anaphor obtains the antecedent) and statistical information. This method is based on constraints and preferences and solves pronouns and definite descriptions. Moreover, this system fits dialogue and non-dialogue discourse features. The anaphora resolution module uses several resources, such as a lexical database (Spanish WordNet) to provide semantic information and a POS tagger providing the part of speech for each word and its root to make this resolution process easier.
Resumo:
This paper presents a Java-based hyperbolic-style browser designed to render RDF files as structured ontological maps. The program was motivated by the need to browse the content of a web-accessible ontology server: WEB KB-2. The ontology server contains descriptions of over 74,500 object types derived from the WordNet 1.7 lexical database and can be accessed using RDF syntax. Such a structure creates complications for hyperbolic-style displays. In WEB KB-2 there are 140 stable ontology link types and a hyperbolic display needs to filter and iconify the view so different link relations can be distinguished in multi-link views. Our browsing tool, OntoRama, is therefore motivated by two possibly interfering aims: the first to display up to 10 times the number of nodes in a hyperbolic-style view than using a conventional graphics display; secondly, to render the ontology with multiple links comprehensible in that view.
Resumo:
In this article, we present the first open-access lexical database that provides phonological representations for 120,000 Italian word forms. Each of these also includes syllable boundaries and stress markings and a comprehensive range of lexical statistics. Using data derived from this lexicon, we have also generated a set of derived databases and provided estimates of positional frequency use for Italian phonemes, syllables, syllable onsets and codas, and character and phoneme bigrams. These databases are freely available from phonitalia.org. This article describes the methods, content, and summarizing statistics for these databases. In a first application of this database, we also demonstrate how the distribution of phonological substitution errors made by Italian aphasic patients is related to phoneme frequency. © 2013 Psychonomic Society, Inc.