2 resultados para 080107 Natural Language Processing

em Central European University - Research Support Scheme


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Mr. Kubon's project was inspired by the growing need for an automatic, syntactic analyser (parser) of Czech, which could be used in the syntactic processing of large amounts of texts. Mr. Kubon notes that such a tool would be very useful, especially in the field of corpus linguistics, where creating a large-scale "tree bank" (a collection of syntactic representations of natural language sentences) is a very important step towards the investigation of the properties of a given language. The work involved in syntactically parsing a whole corpus in order to get a representative set of syntactic structures would be almost inconceivable without the help of some kind of robust (semi)automatic parser. The need for the automatic natural language parser to be robust increases with the size of the linguistic data in the corpus or in any other kind of text which is going to be parsed. Practical experience shows that apart from syntactically correct sentences, there are many sentences which contain a "real" grammatical error. These sentences may be corrected in small-scale texts, but not generally in the whole corpus. In order to be able to complete the overall project, it was necessary to address a number of smaller problems. These were; 1. the adaptation of a suitable formalism able to describe the formal grammar of the system; 2. the definition of the structure of the system's dictionary containing all relevant lexico-syntactic information, and the development of a formal grammar able to robustly parse Czech sentences from the test suite; 3. filling the syntactic dictionary with sample data allowing the system to be tested and debugged during its development (about 1000 words); 4. the development of a set of sample sentences containing a reasonable amount of grammatical and ungrammatical phenomena covering some of the most typical syntactic constructions being used in Czech. Number 3, building a formal grammar, was the main task of the project. The grammar is of course far from complete (Mr. Kubon notes that it is debatable whether any formal grammar describing a natural language may ever be complete), but it covers the most frequent syntactic phenomena, allowing for the representation of a syntactic structure of simple clauses and also the structure of certain types of complex sentences. The stress was not so much on building a wide coverage grammar, but on the description and demonstration of a method. This method uses a similar approach as that of grammar-based grammar checking. The problem of reconstructing the "correct" form of the syntactic representation of a sentence is closely related to the problem of localisation and identification of syntactic errors. Without a precise knowledge of the nature and location of syntactic errors it is not possible to build a reliable estimation of a "correct" syntactic tree. The incremental way of building the grammar used in this project is also an important methodological issue. Experience from previous projects showed that building a grammar by creating a huge block of metarules is more complicated than the incremental method, which begins with the metarules covering most common syntactic phenomena first, and adds less important ones later, especially from the point of view of testing and debugging the grammar. The sample of the syntactic dictionary containing lexico-syntactical information (task 4) now has slightly more than 1000 lexical items representing all classes of words. During the creation of the dictionary it turned out that the task of assigning complete and correct lexico-syntactic information to verbs is a very complicated and time-consuming process which would itself be worth a separate project. The final task undertaken in this project was the development of a method allowing effective testing and debugging of the grammar during the process of its development. The problem of the consistency of new and modified rules of the formal grammar with the rules already existing is one of the crucial problems of every project aiming at the development of a large-scale formal grammar of a natural language. This method allows for the detection of any discrepancy or inconsistency of the grammar with respect to a test-bed of sentences containing all syntactic phenomena covered by the grammar. This is not only the first robust parser of Czech, but also one of the first robust parsers of a Slavic language. Since Slavic languages display a wide range of common features, it is reasonable to claim that this system may serve as a pattern for similar systems in other languages. To transfer the system into any other language it is only necessary to revise the grammar and to change the data contained in the dictionary (but not necessarily the structure of primary lexico-syntactic information). The formalism and methods used in this project can be used in other Slavic languages without substantial changes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Grigorij Kreidlin (Russia). A Comparative Study of Two Semantic Systems: Body Russian and Russian Phraseology. Mr. Kreidlin teaches in the Department of Theoretical and Applied Linguistics of the State University of Humanities in Moscow and worked on this project from August 1996 to July 1998. The classical approach to non-verbal and verbal oral communication is based on a traditional separation of body and mind. Linguists studied words and phrasemes, the products of mind activities, while gestures, facial expressions, postures and other forms of body language were left to anthropologists, psychologists, physiologists, and indeed to anyone but linguists. Only recently have linguists begun to turn their attention to gestures and semiotic and cognitive paradigms are now appearing that raise the question of designing an integral model for the unified description of non-verbal and verbal communicative behaviour. This project attempted to elaborate lexical and semantic fragments of such a model, producing a co-ordinated semantic description of the main Russian gestures (including gestures proper, postures and facial expressions) and their natural language analogues. The concept of emblematic gestures and gestural phrasemes and of their semantic links permitted an appropriate description of the transformation of a body as a purely physical substance into a body as a carrier of essential attributes of Russian culture - the semiotic process called the culturalisation of the human body. Here the human body embodies a system of cultural values and displays them in a text within the area of phraseology and some other important language domains. The goal of this research was to develop a theory that would account for the fundamental peculiarities of the process. The model proposed is based on the unified lexicographic representation of verbal and non-verbal units in the Dictionary of Russian Gestures, which the Mr. Kreidlin had earlier complied in collaboration with a group of his students. The Dictionary was originally oriented only towards reflecting how the lexical competence of Russian body language is represented in the Russian mind. Now a special type of phraseological zone has been designed to reflect explicitly semantic relationships between the gestures in the entries and phrasemes and to provide the necessary information for a detailed description of these. All the definitions, rules of usage and the established correlations are written in a semantic meta-language. Several classes of Russian gestural phrasemes were identified, including those phrasemes and idioms with semantic definitions close to those of the corresponding gestures, those phraseological units that have lost touch with the related gestures (although etymologically they are derived from gestures that have gone out of use), and phrasemes and idioms which have semantic traces or reflexes inherited from the meaning of the related gestures. The basic assumptions and practical considerations underlying the work were as follows. (1) To compare meanings one has to be able to state them. To state the meaning of a gesture or a phraseological expression, one needs a formal semantic meta-language of propositional character that represents the cognitive and mental aspects of the codes. (2) The semantic contrastive analysis of any semiotic codes used in person-to-person communication also requires a single semantic meta-language, i.e. a formal semantic language of description,. This language must be as linguistically and culturally independent as possible and yet must be open to interpretation through any culture and code. Another possible method of conducting comparative verbal-non-verbal semantic research is to work with different semantic meta-languages and semantic nets and to learn how to combine them, translate from one to another, etc. in order to reach a common basis for the subsequent comparison of units. (3) The practical work in defining phraseological units and organising the phraseological zone in the Dictionary of Russian Gestures unexpectedly showed that semantic links between gestures and gestural phrasemes are reflected not only in common semantic elements and syntactic structure of semantic propositions, but also in general and partial cognitive operations that are made over semantic definitions. (4) In comparative semantic analysis one should take into account different values and roles of inner form and image components in the semantic representation of non-verbal and verbal units. (5) For the most part, gestural phrasemes are direct semantic derivatives of gestures. The cognitive and formal techniques can be regarded as typological features for the future functional-semantic classification of gestural phrasemes: two phrasemes whose meaning can be obtained by the same cognitive or purely syntactic operations (or types of operations) over the meanings of the corresponding gestures, belong by definition to one and the same class. The nature of many cognitive operations has not been studied well so far, but the first steps towards its comprehension and description have been taken. The research identified 25 logically possible classes of relationships between a gesture and a gestural phraseme. The calculation is based on theoretically possible formal (set-theory) correlations between signifiers and signified of the non-verbal and verbal units. However, in order to examine which of them are realised in practice a complete semantic and lexicographic description of all (not only central) everyday emblems and gestural phrasemes is required and this unfortunately does not yet exist. Mr. Kreidlin suggests that the results of the comparative analysis of verbal and non-verbal units could also be used in other research areas such as the lexicography of emotions.