967 resultados para Natural language techniques, Semantic spaces, Random projection, Documents
                                
Resumo:
Pós-graduação em Estudos Linguísticos - IBILCE
                                
Resumo:
This paper seeks to understand-the process by which the child in kindergarten builds the idea of number. Therefore we developed a qualitative study of phenomenological approach that involved field work in the classroom with children of four and five years. Starting from their real-world contexts, their experiences and using the natural language tasks are designed to help the student to go beyond the already known, analyzing how they thinks and what knowledge they bring their lived experience. By interference carried expanded mathematical ideas acquired. The analysis and interpretation of research data shows that the idea of number is built by children from all kinds of relationships created between objects and the world around them, and the more diverse are these experiences, the greater the understanding opportunities and development of mathematical skills and competencies. It showed also that, in kindergarten, children tread just a few ways to build the idea of number
                                
Resumo:
This paper seeks to understand-the process by which the child in kindergarten builds the idea of number. Therefore we developed a qualitative study of phenomenological approach that involved field work in the classroom with children of four and five years. Starting from their real-world contexts, their experiences and using the natural language tasks are designed to help the student to go beyond the already known, analyzing how they thinks and what knowledge they bring their lived experience. By interference carried expanded mathematical ideas acquired. The analysis and interpretation of research data shows that the idea of number is built by children from all kinds of relationships created between objects and the world around them, and the more diverse are these experiences, the greater the understanding opportunities and development of mathematical skills and competencies. It showed also that, in kindergarten, children tread just a few ways to build the idea of number
                                
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
                                
Resumo:
The activity of the Ph.D. student Juri Luca De Coi involved the research field of policy languages and can be divided in three parts. The first part of the Ph.D. work investigated the state of the art in policy languages, ending up with: (i) identifying the requirements up-to-date policy languages have to fulfill; (ii) defining a policy language able to fulfill such requirements (namely, the Protune policy language); and (iii) implementing an infrastructure able to enforce policies expressed in the Protune policy language. The second part of the Ph.D. work focused on simplifying the activity of defining policies and ended up with: (i) identifying a subset of the controlled natural language ACE to express Protune policies; (ii) implementing a mapping between ACE policies and Protune policies; and (iii) adapting the ACE Editor to guide users step by step when defining ACE policies. The third part of the Ph.D. work tested the feasibility of the chosen approach by applying it to meaningful real-world problems, among which: (i) development of a security layer on top of RDF stores; and (ii) efficient policy-aware access to metadata stores. The research activity has been performed in tight collaboration with the Leibniz Universität Hannover and further European partners within the projects REWERSE, TENCompetence and OKKAM.
                                
Resumo:
This thesis concerns artificially intelligent natural language processing systems that are capable of learning the properties of lexical items (properties like verbal valency or inflectional class membership) autonomously while they are fulfilling their tasks for which they have been deployed in the first place. Many of these tasks require a deep analysis of language input, which can be characterized as a mapping of utterances in a given input C to a set S of linguistically motivated structures with the help of linguistic information encoded in a grammar G and a lexicon L: G + L + C → S (1) The idea that underlies intelligent lexical acquisition systems is to modify this schematic formula in such a way that the system is able to exploit the information encoded in S to create a new, improved version of the lexicon: G + L + S → L' (2) Moreover, the thesis claims that a system can only be considered intelligent if it does not just make maximum usage of the learning opportunities in C, but if it is also able to revise falsely acquired lexical knowledge. So, one of the central elements in this work is the formulation of a couple of criteria for intelligent lexical acquisition systems subsumed under one paradigm: the Learn-Alpha design rule. The thesis describes the design and quality of a prototype for such a system, whose acquisition components have been developed from scratch and built on top of one of the state-of-the-art Head-driven Phrase Structure Grammar (HPSG) processing systems. The quality of this prototype is investigated in a series of experiments, in which the system is fed with extracts of a large English corpus. While the idea of using machine-readable language input to automatically acquire lexical knowledge is not new, we are not aware of a system that fulfills Learn-Alpha and is able to deal with large corpora. To instance four major challenges of constructing such a system, it should be mentioned that a) the high number of possible structural descriptions caused by highly underspeci ed lexical entries demands for a parser with a very effective ambiguity management system, b) the automatic construction of concise lexical entries out of a bulk of observed lexical facts requires a special technique of data alignment, c) the reliability of these entries depends on the system's decision on whether it has seen 'enough' input and d) general properties of language might render some lexical features indeterminable if the system tries to acquire them with a too high precision. The cornerstone of this dissertation is the motivation and development of a general theory of automatic lexical acquisition that is applicable to every language and independent of any particular theory of grammar or lexicon. This work is divided into five chapters. The introductory chapter first contrasts three different and mutually incompatible approaches to (artificial) lexical acquisition: cue-based queries, head-lexicalized probabilistic context free grammars and learning by unification. Then the postulation of the Learn-Alpha design rule is presented. The second chapter outlines the theory that underlies Learn-Alpha and exposes all the related notions and concepts required for a proper understanding of artificial lexical acquisition. Chapter 3 develops the prototyped acquisition method, called ANALYZE-LEARN-REDUCE, a framework which implements Learn-Alpha. The fourth chapter presents the design and results of a bootstrapping experiment conducted on this prototype: lexeme detection, learning of verbal valency, categorization into nominal count/mass classes, selection of prepositions and sentential complements, among others. The thesis concludes with a review of the conclusions and motivation for further improvements as well as proposals for future research on the automatic induction of lexical features.
                                
Resumo:
La tesi affronta il concetto di esposizione al rischio occupazionale e il suo scopo è quello di indagare l’ambiente di lavoro e il comportamento dei lavoratori, con l'obiettivo di ridurre il tasso di incidenza degli infortuni sul lavoro ed eseguire la riduzione dei rischi. In primo luogo, è proposta una nuova metodologia denominata MIMOSA (Methodology for the Implementation and Monitoring of Occupational SAfety), che quantifica il livello di "salute e sicurezza" di una qualsiasi impresa. Al fine di raggiungere l’obiettivo si è reso necessario un approccio multidisciplinare in cui concetti d’ingegneria e di psicologia sono stati combinati per sviluppare una metodologia di previsione degli incidenti e di miglioramento della sicurezza sul lavoro. I risultati della sperimentazione di MIMOSA hanno spinto all'uso della Logica Fuzzy nel settore della sicurezza occupazionale per migliorare la metodologia stessa e per superare i problemi riscontrati nell’incertezza della raccolta dei dati. La letteratura mostra che i fattori umani, la percezione del rischio e il comportamento dei lavoratori in relazione al rischio percepito, hanno un ruolo molto importante nella comparsa degli incidenti. Questa considerazione ha portato ad un nuovo approccio e ad una seconda metodologia che consiste nella prevenzione di incidenti, non solo sulla base dell'analisi delle loro dinamiche passate. Infatti la metodologia considera la valutazione di un indice basato sui comportamenti proattivi dei lavoratori e sui danni potenziali degli eventi incidentali evitati. L'innovazione consiste nell'applicazione della Logica Fuzzy per tener conto dell’"indeterminatezza" del comportamento umano e del suo linguaggio naturale. In particolare l’applicazione è incentrata sulla proattività dei lavoratori e si prefigge di impedire l'evento "infortunio", grazie alla generazione di una sorta d’indicatore di anticipo. Questa procedura è stata testata su un’azienda petrolchimica italiana.
                                
Resumo:
Computer-assisted translation (or computer-aided translation or CAT) is a form of language translation in which a human translator uses computer software in order to facilitate the translation process. Machine translation (MT) is the automated process by which a computerized system produces a translated text or speech from one natural language to another. Both of them are leading and promising technologies in the translation industry; it therefore seems important that translation students and professional translators become familiar with this relatively new types of technology. Whether used together, not only might these two different types of systems reduce translation time, but also lead to a further improvement in the field of translation technologies. The dissertation consists of four chapters. The first one surveys the chronological development of MT and CAT tools, the emergence of pre-editing, post-editing and controlled language and the very last frontiers in this sector. The second one provide a general overview on the four main CAT tools that are used nowadays and tested hereto. The third chapter is dedicated to the experimentations that have been conducted in order to analyze and evaluate the performance of the four integrated systems that are the core subject of this dissertation. Finally, the fourth chapter deals with the issue of terminological equivalence in interlinguistic translation. The purpose of this dissertation is not to provide an objective and definitive solution to the complex issues that arise at any time in the field of translation technologies, this aim being well away from being achieved, but to supply information about the limits and potentiality that are typical of those instruments which are now essential to any professional translator.
                                
Resumo:
La Word Sense Disambiguation è un problema informatico appartenente al campo di studi del Natural Language Processing, che consiste nel determinare il senso di una parola a seconda del contesto in cui essa viene utilizzata. Se un processo del genere può apparire banale per un essere umano, può risultare d'altra parte straordinariamente complicato se si cerca di codificarlo in una serie di istruzioni esguibili da una macchina. Il primo e principale problema necessario da affrontare per farlo è quello della conoscenza: per operare una disambiguazione sui termini di un testo, un computer deve poter attingere da un lessico che sia il più possibile coerente con quello di un essere umano. Sebbene esistano altri modi di agire in questo caso, quello di creare una fonte di conoscenza machine-readable è certamente il metodo che permette di affrontare il problema in maniera più diretta. Nel corso di questa tesi si cercherà, come prima cosa, di spiegare in cosa consiste la Word Sense Disambiguation, tramite una descrizione breve ma il più possibile dettagliata del problema. Nel capitolo 1 esso viene presentato partendo da alcuni cenni storici, per poi passare alla descrizione dei componenti fondamentali da tenere in considerazione durante il lavoro. Verranno illustrati concetti ripresi in seguito, che spaziano dalla normalizzazione del testo in input fino al riassunto dei metodi di classificazione comunemente usati in questo campo. Il capitolo 2 è invece dedicato alla descrizione di BabelNet, una risorsa lessico-semantica multilingua di recente costruzione nata all'Università La Sapienza di Roma. Verranno innanzitutto descritte le due fonti da cui BabelNet attinge la propria conoscenza, WordNet e Wikipedia. In seguito saranno illustrati i passi della sua creazione, dal mapping tra le due risorse base fino alla definizione di tutte le relazioni che legano gli insiemi di termini all'interno del lessico. Infine viene proposta una serie di esperimenti che mira a mettere BabelNet su un banco di prova, prima per verificare la consistenza del suo metodo di costruzione, poi per confrontarla, in termini di prestazioni, con altri sistemi allo stato dell'arte sottoponendola a diversi task estrapolati dai SemEval, eventi internazionali dedicati alla valutazione dei problemi WSD, che definiscono di fatto gli standard di questo campo. Nel capitolo finale vengono sviluppate alcune considerazioni sulla disambiguazione, introdotte da un elenco dei principali campi applicativi del problema. Vengono in questa sede delineati i possibili sviluppi futuri della ricerca, ma anche i problemi noti e le strade recentemente intraprese per cercare di portare le prestazioni della Word Sense Disambiguation oltre i limiti finora definiti.
                                
Resumo:
In contrast to formal semantics, the conjunction and is nonsymmetrical in pragmatics. The events in Marc went to bed and fell asleep seem to have occurred chronologically although no explicit time reference is given. As the temporal interpretation appears to be weaker in Mia ate chocolate and drank milk, it seems that the kind and nature of events presented in a context influences the interpretation of the conjunction. This work focuses on contextual influences on the interpretation of the German conjunction und (‘and’). A variety of theoretic approaches are concerned with whether and contributes to the establishment of discourse coherence via pragmatic processes or whether the conjunction has complex semantic meaning. These approaches are discussed with respect to how they explain the temporal and additive interpretation of the conjunction and the role of context in the interpre-tation process. It turned out that most theoretic approaches do not consider the importance of different kinds of context in the interpretation process.rnIn experimental pragmatics there are currently only very few studies that investigate the inter-pretation of the conjunction. As there are no studies that investigate contextual influences on the interpretation of und systematically or investigate preschoolers interpretation of the con-junction, research questions such as How do (preschool) children interpret ‘und’? and Does the kind of events conjoined influence children’s and adults’ interpretation? are yet to be answered. Therefore, this dissertation systematically investigates how different types of context influence children’s interpretation of und. Three auditory comprehension studies were conducted in German. Of special interest was whether and how the order of events introduced in a context contributes to the temporal read-ing of the conjunction und. Results indicate that the interpretation of und is – at least in Ger-man – context-dependent: The conjunction is interpreted temporally more often when events that typical occur in a certain order are connected (typical contexts) compared to events with-out typical event order (neutral contexts). This suggests that the type of events conjoined in-fluences the interpretation process. Moreover, older children and adults interpret the conjunc-tion temporally more often than the younger cohorts if the conjoined events typically occur in a certain order. In neutral contexts, additive interpretations increase with age. 5-year-olds reject reversed order statements more often in typical contexts compared to neutral contexts. However, they have more difficulties with reversed order statements in typical contexts where they perform at chance level. This suggests that not only the type of event but also other age-dependent factors such as knowledge about scripts influence children’s performance. The type of event conjoined influences children’s and adults’ interpretation of the conjunction. There-fore, the influence of different event types and script knowledge on the interpretation process does not only have to be considered in future experimental studies on language acquisition and pragmatics but also in experimental pragmatics in general. In linguistic theories, context has to be given a central role and a commonly agreed definition of context that considers the consequences arising from different event types has to be agreed upon.
                                
Resumo:
In questo lavoro si introducono i concetti di base di Natural Language Processing, soffermandosi su Information Extraction e analizzandone gli ambiti applicativi, le attività principali e la differenza rispetto a Information Retrieval. Successivamente si analizza il processo di Named Entity Recognition, focalizzando l’attenzione sulle principali problematiche di annotazione di testi e sui metodi per la valutazione della qualità dell’estrazione di entità. Infine si fornisce una panoramica della piattaforma software open-source di language processing GATE/ANNIE, descrivendone l’architettura e i suoi componenti principali, con approfondimenti sugli strumenti che GATE offre per l'approccio rule-based a Named Entity Recognition.
                                
Resumo:
La tesi è stata incentrata sul gioco «Indovina chi?» per l’identificazione da parte del robot Nao di un personaggio tramite la sua descrizione. In particolare la descrizione avviene tramite domande e risposte L’obiettivo della tesi è la progettazione di un sistema in grado di capire ed elaborare dei dati comunicati usando un sottoinsieme del linguaggio naturale, estrapolarne le informazioni chiave e ottenere un riscontro con informazioni date in precedenza. Si è quindi programmato il robot Nao in modo che sia in grado di giocare una partita di «Indovina chi?» contro un umano comunicando tramite il linguaggio naturale. Sono state implementate regole di estrazione e categorizzazione per la comprensione del testo utilizzando Cogito, una tecnologia brevettata dall'azienda Expert System. In questo modo il robot è in grado di capire le risposte e rispondere alle domande formulate dall'umano mediante il linguaggio naturale. Per il riconoscimento vocale è stata utilizzata l'API di Google e PyAudio per l'utilizzo del microfono. Il programma è stato implementato in Python e i dati dei personaggi sono memorizzati in un database che viene interrogato e modificato dal robot. L'algoritmo del gioco si basa su calcoli probabilistici di vittoria del robot e sulla scelta delle domande da proporre in base alle risposte precedentemente ricevute dall'umano. Le regole semantiche realizzate danno la possibilità al giocatore di formulare frasi utilizzando il linguaggio naturale, inoltre il robot è in grado di distinguere le informazioni che riguardano il personaggio da indovinare senza farsi ingannare. La percentuale di vittoria del robot ottenuta giocando 20 partite è stata del 50%. Il data base è stato sviluppato in modo da poter realizzare un identikit completo di una persona, oltre a quello dei personaggi del gioco. È quindi possibile ampliare il progetto per altri scopi, oltre a quello del gioco, nel campo dell'identificazione.
                                
Resumo:
Library of Congress Subject Headings (LCSH), the standard subject language used in library catalogues, are often criticized for their lack of currency, biased language, and atypical syndetic structure. Conversely, folksonomies (or tags), which rely on the natural language of their users, offer a flexibility often lacking in controlled vocabularies and may offer a means of augmenting more rigid controlled vocabularies such as LCSH. Content analysis studies have demonstrated the potential for folksonomies to be used as a means of enhancing subject access to materials, and libraries are beginning to integrate tagging systems into their catalogues. This study examines the utility of tags as a means of enhancing subject access to materials in library online public access catalogues (OPACs) through usability testing with the LibraryThing for Libraries catalogue enhancements. Findings indicate that while they cannot replace LCSH, tags do show promise for aiding information seeking in OPACs. In the context of information systems design, the study revealed that while folksonomies have the potential to enhance subject access to materials, that potential is severely limited by the current inability of catalogue interfaces to support tag-based searches alongside standard catalogue searches.
                                
                                
Resumo:
Mr. Kubon's project was inspired by the growing need for an automatic, syntactic analyser (parser) of Czech, which could be used in the syntactic processing of large amounts of texts. Mr. Kubon notes that such a tool would be very useful, especially in the field of corpus linguistics, where creating a large-scale "tree bank" (a collection of syntactic representations of natural language sentences) is a very important step towards the investigation of the properties of a given language. The work involved in syntactically parsing a whole corpus in order to get a representative set of syntactic structures would be almost inconceivable without the help of some kind of robust (semi)automatic parser. The need for the automatic natural language parser to be robust increases with the size of the linguistic data in the corpus or in any other kind of text which is going to be parsed. Practical experience shows that apart from syntactically correct sentences, there are many sentences which contain a "real" grammatical error. These sentences may be corrected in small-scale texts, but not generally in the whole corpus. In order to be able to complete the overall project, it was necessary to address a number of smaller problems. These were; 1. the adaptation of a suitable formalism able to describe the formal grammar of the system; 2. the definition of the structure of the system's dictionary containing all relevant lexico-syntactic information, and the development of a formal grammar able to robustly parse Czech sentences from the test suite; 3. filling the syntactic dictionary with sample data allowing the system to be tested and debugged during its development (about 1000 words); 4. the development of a set of sample sentences containing a reasonable amount of grammatical and ungrammatical phenomena covering some of the most typical syntactic constructions being used in Czech. Number 3, building a formal grammar, was the main task of the project. The grammar is of course far from complete (Mr. Kubon notes that it is debatable whether any formal grammar describing a natural language may ever be complete), but it covers the most frequent syntactic phenomena, allowing for the representation of a syntactic structure of simple clauses and also the structure of certain types of complex sentences. The stress was not so much on building a wide coverage grammar, but on the description and demonstration of a method. This method uses a similar approach as that of grammar-based grammar checking. The problem of reconstructing the "correct" form of the syntactic representation of a sentence is closely related to the problem of localisation and identification of syntactic errors. Without a precise knowledge of the nature and location of syntactic errors it is not possible to build a reliable estimation of a "correct" syntactic tree. The incremental way of building the grammar used in this project is also an important methodological issue. Experience from previous projects showed that building a grammar by creating a huge block of metarules is more complicated than the incremental method, which begins with the metarules covering most common syntactic phenomena first, and adds less important ones later, especially from the point of view of testing and debugging the grammar. The sample of the syntactic dictionary containing lexico-syntactical information (task 4) now has slightly more than 1000 lexical items representing all classes of words. During the creation of the dictionary it turned out that the task of assigning complete and correct lexico-syntactic information to verbs is a very complicated and time-consuming process which would itself be worth a separate project. The final task undertaken in this project was the development of a method allowing effective testing and debugging of the grammar during the process of its development. The problem of the consistency of new and modified rules of the formal grammar with the rules already existing is one of the crucial problems of every project aiming at the development of a large-scale formal grammar of a natural language. This method allows for the detection of any discrepancy or inconsistency of the grammar with respect to a test-bed of sentences containing all syntactic phenomena covered by the grammar. This is not only the first robust parser of Czech, but also one of the first robust parsers of a Slavic language. Since Slavic languages display a wide range of common features, it is reasonable to claim that this system may serve as a pattern for similar systems in other languages. To transfer the system into any other language it is only necessary to revise the grammar and to change the data contained in the dictionary (but not necessarily the structure of primary lexico-syntactic information). The formalism and methods used in this project can be used in other Slavic languages without substantial changes.
 
                    