Biblioteca Digital

969 resultados para Natural language interface

A Fuzzy Ontology-Driven Approach to Semantic Interoperability in e-Government Big Data

Relevância:

80.00% 80.00%

Publicador:

Resumo:

With the increasing production of information from e-government initiatives, there is also the need to transform a large volume of unstructured data into useful information for society. All this information should be easily accessible and made available in a meaningful and effective way in order to achieve semantic interoperability in electronic government services, which is a challenge to be pursued by governments round the world. Our aim is to discuss the context of e-Government Big Data and to present a framework to promote semantic interoperability through automatic generation of ontologies from unstructured information found in the Internet. We propose the use of fuzzy mechanisms to deal with natural language terms and present some related works found in this area. The results achieved in this study are based on the architectural definition and major components and requirements in order to compose the proposed framework. With this, it is possible to take advantage of the large volume of information generated from e-Government initiatives and use it to benefit society.

Probing the statistical properties of unknown texts: application to the Voynich manuscript

Relevância:

80.00% 80.00%

Publicador:

Resumo:

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

Security control brought back to the user

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The activity of the Ph.D. student Juri Luca De Coi involved the research field of policy languages and can be divided in three parts. The first part of the Ph.D. work investigated the state of the art in policy languages, ending up with: (i) identifying the requirements up-to-date policy languages have to fulfill; (ii) defining a policy language able to fulfill such requirements (namely, the Protune policy language); and (iii) implementing an infrastructure able to enforce policies expressed in the Protune policy language. The second part of the Ph.D. work focused on simplifying the activity of defining policies and ended up with: (i) identifying a subset of the controlled natural language ACE to express Protune policies; (ii) implementing a mapping between ACE policies and Protune policies; and (iii) adapting the ACE Editor to guide users step by step when defining ACE policies. The third part of the Ph.D. work tested the feasibility of the chosen approach by applying it to meaningful real-world problems, among which: (i) development of a security layer on top of RDF stores; and (ii) efficient policy-aware access to metadata stores. The research activity has been performed in tight collaboration with the Leibniz Universität Hannover and further European partners within the projects REWERSE, TENCompetence and OKKAM.

Frame-driven Extraction of Linked Data and Ontologies from Text

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Ontology design and population -core aspects of semantic technologies- re- cently have become fields of great interest due to the increasing need of domain-specific knowledge bases that can boost the use of Semantic Web. For building such knowledge resources, the state of the art tools for ontology design require a lot of human work. Producing meaningful schemas and populating them with domain-specific data is in fact a very difficult and time-consuming task. Even more if the task consists in modelling knowledge at a web scale. The primary aim of this work is to investigate a novel and flexible method- ology for automatically learning ontology from textual data, lightening the human workload required for conceptualizing domain-specific knowledge and populating an extracted schema with real data, speeding up the whole ontology production process. Here computational linguistics plays a fundamental role, from automati- cally identifying facts from natural language and extracting frame of relations among recognized entities, to producing linked data with which extending existing knowledge bases or creating new ones. In the state of the art, automatic ontology learning systems are mainly based on plain-pipelined linguistics classifiers performing tasks such as Named Entity recognition, Entity resolution, Taxonomy and Relation extraction [11]. These approaches present some weaknesses, specially in capturing struc- tures through which the meaning of complex concepts is expressed [24]. Humans, in fact, tend to organize knowledge in well-defined patterns, which include participant entities and meaningful relations linking entities with each other. In literature, these structures have been called Semantic Frames by Fill- 6 Introduction more [20], or more recently as Knowledge Patterns [23]. Some NLP studies has recently shown the possibility of performing more accurate deep parsing with the ability of logically understanding the structure of discourse [7]. In this work, some of these technologies have been investigated and em- ployed to produce accurate ontology schemas. The long-term goal is to collect large amounts of semantically structured information from the web of crowds, through an automated process, in order to identify and investigate the cognitive patterns used by human to organize their knowledge.

Automatic induction of lexical features

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis concerns artificially intelligent natural language processing systems that are capable of learning the properties of lexical items (properties like verbal valency or inflectional class membership) autonomously while they are fulfilling their tasks for which they have been deployed in the first place. Many of these tasks require a deep analysis of language input, which can be characterized as a mapping of utterances in a given input C to a set S of linguistically motivated structures with the help of linguistic information encoded in a grammar G and a lexicon L: G + L + C → S (1) The idea that underlies intelligent lexical acquisition systems is to modify this schematic formula in such a way that the system is able to exploit the information encoded in S to create a new, improved version of the lexicon: G + L + S → L' (2) Moreover, the thesis claims that a system can only be considered intelligent if it does not just make maximum usage of the learning opportunities in C, but if it is also able to revise falsely acquired lexical knowledge. So, one of the central elements in this work is the formulation of a couple of criteria for intelligent lexical acquisition systems subsumed under one paradigm: the Learn-Alpha design rule. The thesis describes the design and quality of a prototype for such a system, whose acquisition components have been developed from scratch and built on top of one of the state-of-the-art Head-driven Phrase Structure Grammar (HPSG) processing systems. The quality of this prototype is investigated in a series of experiments, in which the system is fed with extracts of a large English corpus. While the idea of using machine-readable language input to automatically acquire lexical knowledge is not new, we are not aware of a system that fulfills Learn-Alpha and is able to deal with large corpora. To instance four major challenges of constructing such a system, it should be mentioned that a) the high number of possible structural descriptions caused by highly underspeci ed lexical entries demands for a parser with a very effective ambiguity management system, b) the automatic construction of concise lexical entries out of a bulk of observed lexical facts requires a special technique of data alignment, c) the reliability of these entries depends on the system's decision on whether it has seen 'enough' input and d) general properties of language might render some lexical features indeterminable if the system tries to acquire them with a too high precision. The cornerstone of this dissertation is the motivation and development of a general theory of automatic lexical acquisition that is applicable to every language and independent of any particular theory of grammar or lexicon. This work is divided into five chapters. The introductory chapter first contrasts three different and mutually incompatible approaches to (artificial) lexical acquisition: cue-based queries, head-lexicalized probabilistic context free grammars and learning by unification. Then the postulation of the Learn-Alpha design rule is presented. The second chapter outlines the theory that underlies Learn-Alpha and exposes all the related notions and concepts required for a proper understanding of artificial lexical acquisition. Chapter 3 develops the prototyped acquisition method, called ANALYZE-LEARN-REDUCE, a framework which implements Learn-Alpha. The fourth chapter presents the design and results of a bootstrapping experiment conducted on this prototype: lexeme detection, learning of verbal valency, categorization into nominal count/mass classes, selection of prepositions and sentential complements, among others. The thesis concludes with a review of the conclusions and motivation for further improvements as well as proposals for future research on the automatic induction of lexical features.

Quantificazione del rischio occupazionale: Indicatori, indici e metodologia fuzzy

Relevância:

80.00% 80.00%

Publicador:

Resumo:

La tesi affronta il concetto di esposizione al rischio occupazionale e il suo scopo è quello di indagare l’ambiente di lavoro e il comportamento dei lavoratori, con l'obiettivo di ridurre il tasso di incidenza degli infortuni sul lavoro ed eseguire la riduzione dei rischi. In primo luogo, è proposta una nuova metodologia denominata MIMOSA (Methodology for the Implementation and Monitoring of Occupational SAfety), che quantifica il livello di "salute e sicurezza" di una qualsiasi impresa. Al fine di raggiungere l’obiettivo si è reso necessario un approccio multidisciplinare in cui concetti d’ingegneria e di psicologia sono stati combinati per sviluppare una metodologia di previsione degli incidenti e di miglioramento della sicurezza sul lavoro. I risultati della sperimentazione di MIMOSA hanno spinto all'uso della Logica Fuzzy nel settore della sicurezza occupazionale per migliorare la metodologia stessa e per superare i problemi riscontrati nell’incertezza della raccolta dei dati. La letteratura mostra che i fattori umani, la percezione del rischio e il comportamento dei lavoratori in relazione al rischio percepito, hanno un ruolo molto importante nella comparsa degli incidenti. Questa considerazione ha portato ad un nuovo approccio e ad una seconda metodologia che consiste nella prevenzione di incidenti, non solo sulla base dell'analisi delle loro dinamiche passate. Infatti la metodologia considera la valutazione di un indice basato sui comportamenti proattivi dei lavoratori e sui danni potenziali degli eventi incidentali evitati. L'innovazione consiste nell'applicazione della Logica Fuzzy per tener conto dell’"indeterminatezza" del comportamento umano e del suo linguaggio naturale. In particolare l’applicazione è incentrata sulla proattività dei lavoratori e si prefigge di impedire l'evento "infortunio", grazie alla generazione di una sorta d’indicatore di anticipo. Questa procedura è stata testata su un’azienda petrolchimica italiana.

Vantaggi e limiti dell'integrazione tra sistemi di traduzione assistita e sistemi di traduzione automatica

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Computer-assisted translation (or computer-aided translation or CAT) is a form of language translation in which a human translator uses computer software in order to facilitate the translation process. Machine translation (MT) is the automated process by which a computerized system produces a translated text or speech from one natural language to another. Both of them are leading and promising technologies in the translation industry; it therefore seems important that translation students and professional translators become familiar with this relatively new types of technology. Whether used together, not only might these two different types of systems reduce translation time, but also lead to a further improvement in the field of translation technologies. The dissertation consists of four chapters. The first one surveys the chronological development of MT and CAT tools, the emergence of pre-editing, post-editing and controlled language and the very last frontiers in this sector. The second one provide a general overview on the four main CAT tools that are used nowadays and tested hereto. The third chapter is dedicated to the experimentations that have been conducted in order to analyze and evaluate the performance of the four integrated systems that are the core subject of this dissertation. Finally, the fourth chapter deals with the issue of terminological equivalence in interlinguistic translation. The purpose of this dissertation is not to provide an objective and definitive solution to the complex issues that arise at any time in the field of translation technologies, this aim being well away from being achieved, but to supply information about the limits and potentiality that are typical of those instruments which are now essential to any professional translator.

BabelNet e il problema della Word Sense Disambiguation

Relevância:

80.00% 80.00%

Publicador:

Resumo:

La Word Sense Disambiguation è un problema informatico appartenente al campo di studi del Natural Language Processing, che consiste nel determinare il senso di una parola a seconda del contesto in cui essa viene utilizzata. Se un processo del genere può apparire banale per un essere umano, può risultare d'altra parte straordinariamente complicato se si cerca di codificarlo in una serie di istruzioni esguibili da una macchina. Il primo e principale problema necessario da affrontare per farlo è quello della conoscenza: per operare una disambiguazione sui termini di un testo, un computer deve poter attingere da un lessico che sia il più possibile coerente con quello di un essere umano. Sebbene esistano altri modi di agire in questo caso, quello di creare una fonte di conoscenza machine-readable è certamente il metodo che permette di affrontare il problema in maniera più diretta. Nel corso di questa tesi si cercherà, come prima cosa, di spiegare in cosa consiste la Word Sense Disambiguation, tramite una descrizione breve ma il più possibile dettagliata del problema. Nel capitolo 1 esso viene presentato partendo da alcuni cenni storici, per poi passare alla descrizione dei componenti fondamentali da tenere in considerazione durante il lavoro. Verranno illustrati concetti ripresi in seguito, che spaziano dalla normalizzazione del testo in input fino al riassunto dei metodi di classificazione comunemente usati in questo campo. Il capitolo 2 è invece dedicato alla descrizione di BabelNet, una risorsa lessico-semantica multilingua di recente costruzione nata all'Università La Sapienza di Roma. Verranno innanzitutto descritte le due fonti da cui BabelNet attinge la propria conoscenza, WordNet e Wikipedia. In seguito saranno illustrati i passi della sua creazione, dal mapping tra le due risorse base fino alla definizione di tutte le relazioni che legano gli insiemi di termini all'interno del lessico. Infine viene proposta una serie di esperimenti che mira a mettere BabelNet su un banco di prova, prima per verificare la consistenza del suo metodo di costruzione, poi per confrontarla, in termini di prestazioni, con altri sistemi allo stato dell'arte sottoponendola a diversi task estrapolati dai SemEval, eventi internazionali dedicati alla valutazione dei problemi WSD, che definiscono di fatto gli standard di questo campo. Nel capitolo finale vengono sviluppate alcune considerazioni sulla disambiguazione, introdotte da un elenco dei principali campi applicativi del problema. Vengono in questa sede delineati i possibili sviluppi futuri della ricerca, ma anche i problemi noti e le strade recentemente intraprese per cercare di portare le prestazioni della Word Sense Disambiguation oltre i limiti finora definiti.

Deep learning for computer vision: a comparison between convolutional neural networks and hierarchical temporal memories on object recognition tasks

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In recent years, Deep Learning techniques have shown to perform well on a large variety of problems both in Computer Vision and Natural Language Processing, reaching and often surpassing the state of the art on many tasks. The rise of deep learning is also revolutionizing the entire field of Machine Learning and Pattern Recognition pushing forward the concepts of automatic feature extraction and unsupervised learning in general. However, despite the strong success both in science and business, deep learning has its own limitations. It is often questioned if such techniques are only some kind of brute-force statistical approaches and if they can only work in the context of High Performance Computing with tons of data. Another important question is whether they are really biologically inspired, as claimed in certain cases, and if they can scale well in terms of "intelligence". The dissertation is focused on trying to answer these key questions in the context of Computer Vision and, in particular, Object Recognition, a task that has been heavily revolutionized by recent advances in the field. Practically speaking, these answers are based on an exhaustive comparison between two, very different, deep learning techniques on the aforementioned task: Convolutional Neural Network (CNN) and Hierarchical Temporal memory (HTM). They stand for two different approaches and points of view within the big hat of deep learning and are the best choices to understand and point out strengths and weaknesses of each of them. CNN is considered one of the most classic and powerful supervised methods used today in machine learning and pattern recognition, especially in object recognition. CNNs are well received and accepted by the scientific community and are already deployed in large corporation like Google and Facebook for solving face recognition and image auto-tagging problems. HTM, on the other hand, is known as a new emerging paradigm and a new meanly-unsupervised method, that is more biologically inspired. It tries to gain more insights from the computational neuroscience community in order to incorporate concepts like time, context and attention during the learning process which are typical of the human brain. In the end, the thesis is supposed to prove that in certain cases, with a lower quantity of data, HTM can outperform CNN.

Named Entity Extraction: la piattaforma Gate/Annie

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In questo lavoro si introducono i concetti di base di Natural Language Processing, soffermandosi su Information Extraction e analizzandone gli ambiti applicativi, le attività principali e la differenza rispetto a Information Retrieval. Successivamente si analizza il processo di Named Entity Recognition, focalizzando l’attenzione sulle principali problematiche di annotazione di testi e sui metodi per la valutazione della qualità dell’estrazione di entità. Infine si fornisce una panoramica della piattaforma software open-source di language processing GATE/ANNIE, descrivendone l’architettura e i suoi componenti principali, con approfondimenti sugli strumenti che GATE offre per l'approccio rule-based a Named Entity Recognition.

Un’applicazione di elaborazione del linguaggio naturale per il gioco di "Indovina chi?" su robot umanoide

Relevância:

80.00% 80.00%

Publicador:

Resumo:

La tesi è stata incentrata sul gioco «Indovina chi?» per l’identificazione da parte del robot Nao di un personaggio tramite la sua descrizione. In particolare la descrizione avviene tramite domande e risposte L’obiettivo della tesi è la progettazione di un sistema in grado di capire ed elaborare dei dati comunicati usando un sottoinsieme del linguaggio naturale, estrapolarne le informazioni chiave e ottenere un riscontro con informazioni date in precedenza. Si è quindi programmato il robot Nao in modo che sia in grado di giocare una partita di «Indovina chi?» contro un umano comunicando tramite il linguaggio naturale. Sono state implementate regole di estrazione e categorizzazione per la comprensione del testo utilizzando Cogito, una tecnologia brevettata dall'azienda Expert System. In questo modo il robot è in grado di capire le risposte e rispondere alle domande formulate dall'umano mediante il linguaggio naturale. Per il riconoscimento vocale è stata utilizzata l'API di Google e PyAudio per l'utilizzo del microfono. Il programma è stato implementato in Python e i dati dei personaggi sono memorizzati in un database che viene interrogato e modificato dal robot. L'algoritmo del gioco si basa su calcoli probabilistici di vittoria del robot e sulla scelta delle domande da proporre in base alle risposte precedentemente ricevute dall'umano. Le regole semantiche realizzate danno la possibilità al giocatore di formulare frasi utilizzando il linguaggio naturale, inoltre il robot è in grado di distinguere le informazioni che riguardano il personaggio da indovinare senza farsi ingannare. La percentuale di vittoria del robot ottenuta giocando 20 partite è stata del 50%. Il data base è stato sviluppato in modo da poter realizzare un identikit completo di una persona, oltre a quello dei personaggi del gioco. È quindi possibile ampliare il progetto per altri scopi, oltre a quello del gioco, nel campo dell'identificazione.

Tags in the Catalogue: Insights From a Usability Study of LibraryThing for Libraries

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Library of Congress Subject Headings (LCSH), the standard subject language used in library catalogues, are often criticized for their lack of currency, biased language, and atypical syndetic structure. Conversely, folksonomies (or tags), which rely on the natural language of their users, offer a flexibility often lacking in controlled vocabularies and may offer a means of augmenting more rigid controlled vocabularies such as LCSH. Content analysis studies have demonstrated the potential for folksonomies to be used as a means of enhancing subject access to materials, and libraries are beginning to integrate tagging systems into their catalogues. This study examines the utility of tags as a means of enhancing subject access to materials in library online public access catalogues (OPACs) through usability testing with the LibraryThing for Libraries catalogue enhancements. Findings indicate that while they cannot replace LCSH, tags do show promise for aiding information seeking in OPACs. In the context of information systems design, the study revealed that while folksonomies have the potential to enhance subject access to materials, that potential is severely limited by the current inability of catalogue interfaces to support tag-based searches alongside standard catalogue searches.

Recent Development in ParaSol: Breadth for Depth and XSLT based web concordancing with CWB

Relevância:

80.00% 80.00%

Publicador:

A Comparative Study of Two Semantic Systems: Body Russian and Russian Phraseology

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Grigorij Kreidlin (Russia). A Comparative Study of Two Semantic Systems: Body Russian and Russian Phraseology. Mr. Kreidlin teaches in the Department of Theoretical and Applied Linguistics of the State University of Humanities in Moscow and worked on this project from August 1996 to July 1998. The classical approach to non-verbal and verbal oral communication is based on a traditional separation of body and mind. Linguists studied words and phrasemes, the products of mind activities, while gestures, facial expressions, postures and other forms of body language were left to anthropologists, psychologists, physiologists, and indeed to anyone but linguists. Only recently have linguists begun to turn their attention to gestures and semiotic and cognitive paradigms are now appearing that raise the question of designing an integral model for the unified description of non-verbal and verbal communicative behaviour. This project attempted to elaborate lexical and semantic fragments of such a model, producing a co-ordinated semantic description of the main Russian gestures (including gestures proper, postures and facial expressions) and their natural language analogues. The concept of emblematic gestures and gestural phrasemes and of their semantic links permitted an appropriate description of the transformation of a body as a purely physical substance into a body as a carrier of essential attributes of Russian culture - the semiotic process called the culturalisation of the human body. Here the human body embodies a system of cultural values and displays them in a text within the area of phraseology and some other important language domains. The goal of this research was to develop a theory that would account for the fundamental peculiarities of the process. The model proposed is based on the unified lexicographic representation of verbal and non-verbal units in the Dictionary of Russian Gestures, which the Mr. Kreidlin had earlier complied in collaboration with a group of his students. The Dictionary was originally oriented only towards reflecting how the lexical competence of Russian body language is represented in the Russian mind. Now a special type of phraseological zone has been designed to reflect explicitly semantic relationships between the gestures in the entries and phrasemes and to provide the necessary information for a detailed description of these. All the definitions, rules of usage and the established correlations are written in a semantic meta-language. Several classes of Russian gestural phrasemes were identified, including those phrasemes and idioms with semantic definitions close to those of the corresponding gestures, those phraseological units that have lost touch with the related gestures (although etymologically they are derived from gestures that have gone out of use), and phrasemes and idioms which have semantic traces or reflexes inherited from the meaning of the related gestures. The basic assumptions and practical considerations underlying the work were as follows. (1) To compare meanings one has to be able to state them. To state the meaning of a gesture or a phraseological expression, one needs a formal semantic meta-language of propositional character that represents the cognitive and mental aspects of the codes. (2) The semantic contrastive analysis of any semiotic codes used in person-to-person communication also requires a single semantic meta-language, i.e. a formal semantic language of description,. This language must be as linguistically and culturally independent as possible and yet must be open to interpretation through any culture and code. Another possible method of conducting comparative verbal-non-verbal semantic research is to work with different semantic meta-languages and semantic nets and to learn how to combine them, translate from one to another, etc. in order to reach a common basis for the subsequent comparison of units. (3) The practical work in defining phraseological units and organising the phraseological zone in the Dictionary of Russian Gestures unexpectedly showed that semantic links between gestures and gestural phrasemes are reflected not only in common semantic elements and syntactic structure of semantic propositions, but also in general and partial cognitive operations that are made over semantic definitions. (4) In comparative semantic analysis one should take into account different values and roles of inner form and image components in the semantic representation of non-verbal and verbal units. (5) For the most part, gestural phrasemes are direct semantic derivatives of gestures. The cognitive and formal techniques can be regarded as typological features for the future functional-semantic classification of gestural phrasemes: two phrasemes whose meaning can be obtained by the same cognitive or purely syntactic operations (or types of operations) over the meanings of the corresponding gestures, belong by definition to one and the same class. The nature of many cognitive operations has not been studied well so far, but the first steps towards its comprehension and description have been taken. The research identified 25 logically possible classes of relationships between a gesture and a gestural phraseme. The calculation is based on theoretically possible formal (set-theory) correlations between signifiers and signified of the non-verbal and verbal units. However, in order to examine which of them are realised in practice a complete semantic and lexicographic description of all (not only central) everyday emblems and gestural phrasemes is required and this unfortunately does not yet exist. Mr. Kreidlin suggests that the results of the comparative analysis of verbal and non-verbal units could also be used in other research areas such as the lexicography of emotions.

A Robust Parser of Czech

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Mr. Kubon's project was inspired by the growing need for an automatic, syntactic analyser (parser) of Czech, which could be used in the syntactic processing of large amounts of texts. Mr. Kubon notes that such a tool would be very useful, especially in the field of corpus linguistics, where creating a large-scale "tree bank" (a collection of syntactic representations of natural language sentences) is a very important step towards the investigation of the properties of a given language. The work involved in syntactically parsing a whole corpus in order to get a representative set of syntactic structures would be almost inconceivable without the help of some kind of robust (semi)automatic parser. The need for the automatic natural language parser to be robust increases with the size of the linguistic data in the corpus or in any other kind of text which is going to be parsed. Practical experience shows that apart from syntactically correct sentences, there are many sentences which contain a "real" grammatical error. These sentences may be corrected in small-scale texts, but not generally in the whole corpus. In order to be able to complete the overall project, it was necessary to address a number of smaller problems. These were; 1. the adaptation of a suitable formalism able to describe the formal grammar of the system; 2. the definition of the structure of the system's dictionary containing all relevant lexico-syntactic information, and the development of a formal grammar able to robustly parse Czech sentences from the test suite; 3. filling the syntactic dictionary with sample data allowing the system to be tested and debugged during its development (about 1000 words); 4. the development of a set of sample sentences containing a reasonable amount of grammatical and ungrammatical phenomena covering some of the most typical syntactic constructions being used in Czech. Number 3, building a formal grammar, was the main task of the project. The grammar is of course far from complete (Mr. Kubon notes that it is debatable whether any formal grammar describing a natural language may ever be complete), but it covers the most frequent syntactic phenomena, allowing for the representation of a syntactic structure of simple clauses and also the structure of certain types of complex sentences. The stress was not so much on building a wide coverage grammar, but on the description and demonstration of a method. This method uses a similar approach as that of grammar-based grammar checking. The problem of reconstructing the "correct" form of the syntactic representation of a sentence is closely related to the problem of localisation and identification of syntactic errors. Without a precise knowledge of the nature and location of syntactic errors it is not possible to build a reliable estimation of a "correct" syntactic tree. The incremental way of building the grammar used in this project is also an important methodological issue. Experience from previous projects showed that building a grammar by creating a huge block of metarules is more complicated than the incremental method, which begins with the metarules covering most common syntactic phenomena first, and adds less important ones later, especially from the point of view of testing and debugging the grammar. The sample of the syntactic dictionary containing lexico-syntactical information (task 4) now has slightly more than 1000 lexical items representing all classes of words. During the creation of the dictionary it turned out that the task of assigning complete and correct lexico-syntactic information to verbs is a very complicated and time-consuming process which would itself be worth a separate project. The final task undertaken in this project was the development of a method allowing effective testing and debugging of the grammar during the process of its development. The problem of the consistency of new and modified rules of the formal grammar with the rules already existing is one of the crucial problems of every project aiming at the development of a large-scale formal grammar of a natural language. This method allows for the detection of any discrepancy or inconsistency of the grammar with respect to a test-bed of sentences containing all syntactic phenomena covered by the grammar. This is not only the first robust parser of Czech, but also one of the first robust parsers of a Slavic language. Since Slavic languages display a wide range of common features, it is reasonable to claim that this system may serve as a pattern for similar systems in other languages. To transfer the system into any other language it is only necessary to revise the grammar and to change the data contained in the dictionary (but not necessarily the structure of primary lexico-syntactic information). The formalism and methods used in this project can be used in other Slavic languages without substantial changes.

«
1
2
...
14
15
16
17
18
19
20
...
64
65
»