18 resultados para Natural Language Queries, NLPX, Bricks, XML-IR, Users
em Helda - Digital Repository of University of Helsinki
Resumo:
This paper introduces the META-NORD project which develops Nordic and Baltic part of the European open language resource infrastructure. META-NORD works on assembling, linking across languages, and making widely available the basic language resources used by developers, professionals and researchers to build specific products and applications. The goals of the project, overall approach and specific focus lines on wordnets, terminology resources and treebanks are described. Moreover, results achieved in first five months of the project, i.e. language whitepapers, metadata specification and IPR, are presented.
Resumo:
This dissertation is a theoretical study of finite-state based grammars used in natural language processing. The study is concerned with certain varieties of finite-state intersection grammars (FSIG) whose parsers define regular relations between surface strings and annotated surface strings. The study focuses on the following three aspects of FSIGs: (i) Computational complexity of grammars under limiting parameters In the study, the computational complexity in practical natural language processing is approached through performance-motivated parameters on structural complexity. Each parameter splits some grammars in the Chomsky hierarchy into an infinite set of subset approximations. When the approximations are regular, they seem to fall into the logarithmic-time hierarchyand the dot-depth hierarchy of star-free regular languages. This theoretical result is important and possibly relevant to grammar induction. (ii) Linguistically applicable structural representations Related to the linguistically applicable representations of syntactic entities, the study contains new bracketing schemes that cope with dependency links, left- and right branching, crossing dependencies and spurious ambiguity. New grammar representations that resemble the Chomsky-Schützenberger representation of context-free languages are presented in the study, and they include, in particular, representations for mildly context-sensitive non-projective dependency grammars whose performance-motivated approximations are linear time parseable. (iii) Compilation and simplification of linguistic constraints Efficient compilation methods for certain regular operations such as generalized restriction are presented. These include an elegant algorithm that has already been adopted as the approach in a proprietary finite-state tool. In addition to the compilation methods, an approach to on-the-fly simplifications of finite-state representations for parse forests is sketched. These findings are tightly coupled with each other under the theme of locality. I argue that the findings help us to develop better, linguistically oriented formalisms for finite-state parsing and to develop more efficient parsers for natural language processing. Avainsanat: syntactic parsing, finite-state automata, dependency grammar, first-order logic, linguistic performance, star-free regular approximations, mildly context-sensitive grammars
Resumo:
This research is based on the problems in secondary school algebra I have noticed in my own work as a teacher of mathematics. Algebra does not touch the pupil, it remains knowledge that is not used or tested. Furthermore the performance level in algebra is quite low. This study presents a model for 7th grade algebra instruction in order to make algebra more natural and useful to students. I refer to the instruction model as the Idea-based Algebra (IDEAA). The basic ideas of this IDEAA model are 1) to combine children's own informal mathematics with scientific mathematics ("math math") and 2) to structure algebra content as a "map of big ideas", not as a traditional sequence of powers, polynomials, equations, and word problems. This research project is a kind of design process or design research. As such, this project has three, intertwined goals: research, design and pedagogical practice. I also assume three roles. As a researcher, I want to learn about learning and school algebra, its problems and possibilities. As a designer, I use research in the intervention to develop a shared artefact, the instruction model. In addition, I want to improve the practice through intervention and research. A design research like this is quite challenging. Its goals and means are intertwined and change in the research process. Theory emerges from the inquiry; it is not given a priori. The aim to improve instruction is normative, as one should take into account what "good" means in school algebra. An important part of my study is to work out these paradigmatic questions. The result of the study is threefold. The main result is the instruction model designed in the study. The second result is the theory that is developed of the teaching, learning and algebra. The third result is knowledge of the design process. The instruction model (IDEAA) is connected to four main features of good algebra education: 1) the situationality of learning, 2) learning as knowledge building, in which natural language and intuitive thinking work as "intermediaries", 3) the emergence and diversity of algebra, and 4) the development of high performance skills at any stage of instruction.
Resumo:
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.
Resumo:
In this thesis we present and evaluate two pattern matching based methods for answer extraction in textual question answering systems. A textual question answering system is a system that seeks answers to natural language questions from unstructured text. Textual question answering systems are an important research problem because as the amount of natural language text in digital format grows all the time, the need for novel methods for pinpointing important knowledge from the vast textual databases becomes more and more urgent. We concentrate on developing methods for the automatic creation of answer extraction patterns. A new type of extraction pattern is developed also. The pattern matching based approach chosen is interesting because of its language and application independence. The answer extraction methods are developed in the framework of our own question answering system. Publicly available datasets in English are used as training and evaluation data for the methods. The techniques developed are based on the well known methods of sequence alignment and hierarchical clustering. The similarity metric used is based on edit distance. The main conclusions of the research are that answer extraction patterns consisting of the most important words of the question and of the following information extracted from the answer context: plain words, part-of-speech tags, punctuation marks and capitalization patterns, can be used in the answer extraction module of a question answering system. This type of patterns and the two new methods for generating answer extraction patterns provide average results when compared to those produced by other systems using the same dataset. However, most answer extraction methods in the question answering systems tested with the same dataset are both hand crafted and based on a system-specific and fine-grained question classification. The the new methods developed in this thesis require no manual creation of answer extraction patterns. As a source of knowledge, they require a dataset of sample questions and answers, as well as a set of text documents that contain answers to most of the questions. The question classification used in the training data is a standard one and provided already in the publicly available data.
Resumo:
Automatisk språkprocessering har efter mer än ett halvt sekel av forskning blivit ett mycket viktigt område inom datavetenskapen. Flera vetenskapligt viktiga problem har lösts och praktiska applikationer har nått programvarumarknaden. Disambiguering av ord innebär att hitta rätt betydelse för ett mångtydigt ord. Sammanhanget, de omkringliggande orden och kunskap om ämnesområdet är faktorer som kan användas för att disambiguera ett ord. Automatisk sammanfattning innebär att förkorta en text utan att den relevanta informationen går förlorad. Relevanta meningar kan plockas ur texten, eller så kan en ny, kortare text genereras på basen av fakta i den ursprungliga texten. Avhandlingen ger en allmän översikt och kort historik av språkprocesseringen och jämför några metoder för disambiguering av ord och automatisk sammanfattning. Problemområdenas likheter och skillnader lyfts fram och metodernas ställning inom datavetenskapen belyses.
Resumo:
We have presented an overview of the FSIG approach and related FSIG gram- mars to issues of very low complexity and parsing strategy. We ended up with serious optimism according to which most FSIG grammars could be decom- posed in a reasonable way and then processed efficiently.
Resumo:
FinnWordNet is a wordnet for Finnish that complies with the format of the Princeton WordNet (PWN) (Fellbaum, 1998). It was built by translating the PrincetonWordNet 3.0 synsets into Finnish by human translators. It is open source and contains 117000 synsets. The Finnish translations were inserted into the PWN structure resulting in a bilingual lexical database. In natural language processing (NLP), wordnets have been used for infusing computers with semantic knowledge assuming that humans already have a sufficient amount of this knowledge. In this paper we present a case study of using wordnets as an electronic dictionary. We tested whether native Finnish speakers benefit from using a wordnet while completing English sentence completion tasks. We found that using either an English wordnet or a bilingual English Finnish wordnet significantly improves performance in the task. This should be taken into account when setting standards and comparing human and computer performance on these tasks.
Resumo:
We use parallel weighted finite-state transducers to implement a part-of-speech tagger, which obtains state-of-the-art accuracy when used to tag the Europarl corpora for Finnish, Swedish and English. Our system consists of a weighted lexicon and a guesser combined with a bigram model factored into two weighted transducers. We use both lemmas and tag sequences in the bigram model, which guarantees reliable bigram estimates.
Resumo:
Finite-state methods have been adopted widely in computational morphology and related linguistic applications. To enable efficient development of finite-state based linguistic descriptions, these methods should be a freely available resource for academic language research and the language technology industry. The following needs can be identified: (i) a registry that maps the existing approaches, implementations and descriptions, (ii) managing the incompatibilities of the existing tools, (iii) increasing synergy and complementary functionality of the tools, (iv) persistent availability of the tools used to manipulate the archived descriptions, (v) an archive for free finite-state based tools and linguistic descriptions. Addressing these challenges contributes to building a common research infrastructure for advanced language technology.
Resumo:
Kirjallisuuden- ja kulttuurintutkimus on viimeisten kolmen vuosikymmenen aikana tullut yhä enenevässä määrin tietoiseksi tieteen ja taiteen suhteen monimutkaisesta luonteesta. Nykyään näiden kahden kulttuurin tutkimus muodostaa oman kenttänsä, jolla niiden suhdetta tarkastellaan ennen kaikkea dynaamisena vuorovaikutuksena, joka heijastaa kulttuurimme kieltä, arvoja ja ideologisia sisältöjä. Toisin kuin aiemmat näkemykset, jotka pitävät tiedettä ja taidetta toisilleen enemmän tai vähemmän vastakkaisina pyrkimyksinä, nykytutkimus lähtee oletuksesta, jonka mukaan ne ovat kulttuurillisesti rakentuneita diskursseja, jotka kohtaavat usein samankaltaisia todellisuuden mallintamiseen liittyviä ongelmia, vaikka niiden käyttämät metodit eroavatkin toisistaan. Väitöskirjani keskittyy yllä mainitun suhteen osa-alueista popularisoidun tietokirjallisuuden (muun muassa Paul Davies, James Gleick ja Richard Dawkins) käyttämän kielen ja luonnontieteistä ideoita ammentavan kaunokirjallisuuden (muun muassa Jeanette Winterson, Tom Stoppard ja Richard Powers) hyödyntämien keinojen tarkasteluun nojautuen yli 30 teoksen kattavaa aineistoa koskevaan tyylin ja teemojen tekstianalyysiin. Populaarin tietokirjallisuuden osalta tarkoituksenani on osoittaa, että sen käyttämä kieli rakentuu huomattavassa määrin sellaisille rakenteille, jotka tarjoavat mahdollisuuden esittää todellisuutta koskevia argumentteja mahdollisimman vakuuttavalla tavalla. Tässä tehtävässä monilla klassisen retoriikan määrittelemillä kuvioilla on tärkeä rooli, koska ne auttavat liittämään sanotun sisällön ja muodon tiukasti toisiinsa: retoristen kuvioiden käyttö ei näin ollen edusta pelkkää tyylikeinoa, vaan se myös usein kiteyttää argumenttien taustalla olevat tieteenfilosofiset olettamukset ja auttaa vakiinnuttamaan argumentoinnin logiikan. Koska monet aikaisemmin ilmestyneistä tutkimuksista ovat keskittyneet pelkästään metaforan rooliin tieteellisissä argumenteissa, tämä väitöskirja pyrkii laajentamaan tutkimuskenttää analysoimalla myös toisenlaisten kuvioiden käyttöä. Osoitan myös, että retoristen kuvioiden käyttö muodostaa yhtymäkohdan tieteellisiä ideoita hyödyntävään kaunokirjallisuuteen. Siinä missä popularisoitu tiede käyttää retoriikkaa vahvistaakseen sekä argumentatiivisia että kaunokirjallisia ominaisuuksiaan, kuvaa tällainen sanataide tiedettä tavoilla, jotka usein heijastelevat tietokirjallisuuden kielellisiä rakenteita. Toisaalta on myös mahdollista nähdä, miten kaunokirjallisuuden keinot heijastuvat popularisoidun tieteen kerrontatapoihin ja kieleen todistaen kahden kulttuurin dynaamisesta vuorovaikutuksesta. Nykyaikaisen populaaritieteen retoristen elementtien ja kaunokirjallisuuden keinojen vertailu näyttää lisäksi, kuinka tiede ja taide osallistuvat keskusteluun kulttuurimme tiettyjen peruskäsitteiden kuten identiteetin, tiedon ja ajan merkityksestä. Tällä tavoin on mahdollista nähdä, että molemmat ovat perustavanlaatuisia osia merkityksenantoprosessissa, jonka kautta niin tieteelliset ideat kuin ihmiselämän suuret kysymyksetkin saavat kulttuurillisesti rakentuneen merkityksensä.
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
Wireless technologies are continuously evolving. Second generation cellular networks have gained worldwide acceptance. Wireless LANs are commonly deployed in corporations or university campuses, and their diffusion in public hotspots is growing. Third generation cellular systems are yet to affirm everywhere; still, there is an impressive amount of research ongoing for deploying beyond 3G systems. These new wireless technologies combine the characteristics of WLAN based and cellular networks to provide increased bandwidth. The common direction where all the efforts in wireless technologies are headed is towards an IP-based communication. Telephony services have been the killer application for cellular systems; their evolution to packet-switched networks is a natural path. Effective IP telephony signaling protocols, such as the Session Initiation Protocol (SIP) and the H 323 protocol are needed to establish IP-based telephony sessions. However, IP telephony is just one service example of IP-based communication. IP-based multimedia sessions are expected to become popular and offer a wider range of communication capabilities than pure telephony. In order to conjoin the advances of the future wireless technologies with the potential of IP-based multimedia communication, the next step would be to obtain ubiquitous communication capabilities. According to this vision, people must be able to communicate also when no support from an infrastructured network is available, needed or desired. In order to achieve ubiquitous communication, end devices must integrate all the capabilities necessary for IP-based distributed and decentralized communication. Such capabilities are currently missing. For example, it is not possible to utilize native IP telephony signaling protocols in a totally decentralized way. This dissertation presents a solution for deploying the SIP protocol in a decentralized fashion without support of infrastructure servers. The proposed solution is mainly designed to fit the needs of decentralized mobile environments, and can be applied to small scale ad-hoc networks or also bigger networks with hundreds of nodes. A framework allowing discovery of SIP users in ad-hoc networks and the establishment of SIP sessions among them, in a fully distributed and secure way, is described and evaluated. Security support allows ad-hoc users to authenticate the sender of a message, and to verify the integrity of a received message. The distributed session management framework has been extended in order to achieve interoperability with the Internet, and the native Internet applications. With limited extensions to the SIP protocol, we have designed and experimentally validated a SIP gateway allowing SIP signaling between ad-hoc networks with private addressing space and native SIP applications in the Internet. The design is completed by an application level relay that permits instant messaging sessions to be established in heterogeneous environments. The resulting framework constitutes a flexible and effective approach for the pervasive deployment of real time applications.