19 resultados para XML, Information, Retrieval, Query, Language
em Helda - Digital Repository of University of Helsinki
Resumo:
A visual world eye-tracking study investigated the activation and persistence of implicit causality information in spoken language comprehension. We showed that people infer the implicit causality of verbs as soon as they encounter such verbs in discourse, as is predicted by proponents of the immediate focusing account (Greene & McKoon, 1995; Koornneef & Van Berkum, 2006; Van Berkum, Koornneef, Otten, & Nieuwland, 2007). Interestingly, we observed activation of implicit causality information even before people encountered the causal conjunction. However, while implicit causality information was persistent as the discourse unfolded, it did not have a privileged role as a focusing cue immediately at the ambiguous pronoun when people were resolving its antecedent. Instead, our study indicated that implicit causality does not affect all referents to the same extent, rather it interacts with other cues in the discourse, especially when one of the referents is already prominently in focus.
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
The usual task in music information retrieval (MIR) is to find occurrences of a monophonic query pattern within a music database, which can contain both monophonic and polyphonic content. The so-called query-by-humming systems are a famous instance of content-based MIR. In such a system, the user's hummed query is converted into symbolic form to perform search operations in a similarly encoded database. The symbolic representation (e.g., textual, MIDI or vector data) is typically a quantized and simplified version of the sampled audio data, yielding to faster search algorithms and space requirements that can be met in real-life situations. In this thesis, we investigate geometric approaches to MIR. We first study some musicological properties often needed in MIR algorithms, and then give a literature review on traditional (e.g., string-matching-based) MIR algorithms and novel techniques based on geometry. We also introduce some concepts from digital image processing, namely the mathematical morphology, which we will use to develop and implement four algorithms for geometric music retrieval. The symbolic representation in the case of our algorithms is a binary 2-D image. We use various morphological pre- and post-processing operations on the query and the database images to perform template matching / pattern recognition for the images. The algorithms are basically extensions to classic image correlation and hit-or-miss transformation techniques used widely in template matching applications. They aim to be a future extension to the retrieval engine of C-BRAHMS, which is a research project of the Department of Computer Science at University of Helsinki.
Resumo:
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
Resumo:
The aim was to analyse the growth and compositional development of the receptive and expressive lexicons between the ages 0,9 and 2;0 in the full-term (FT) and the very-low-birth-weight (VLBW) children who are acquiring Finnish. The associations between the expressive lexicon and grammar at 1;6 and 2;0 in the FT children were also studied. In addition, the language skills of the VLBW children at 2;0 were analysed, as well as the predictive value of early lexicon to the later language performance. Four groups took part in the studies: the longitudinal (N = 35) and cross-sectional (N = 146) samples of the FT children, and the longitudinal (N = 32) and cross-sectional (N = 66) samples of VLBW children. The data was gathered by applying of the structured parental rating method (the Finnish version of the Communicative Development Inventory), through analysis of the children´s spontaneous speech and by administering a a formal test (Reynell Developmental Language Scales). The FT children acquired their receptive lexicons earlier, at a faster rate and with larger individual variation than their expressive lexicons. The acquisition rate of the expressive lexicon increased from slow to faster in most children (91%). Highly parallel developmental paths for lexical semantic categories were detected in the receptive and expressive lexicons of the Finnish children when they were analysed in relation to the growth of the lexicon size, as described in the literature for children acquiring other languages. The emergence of grammar was closely associated with expressive lexical growth. The VLBW children acquired their receptive lexicons at a slower rate and had weaker language skills at 2;0 than the full-term children. The compositional development of both lexicons happened at a slower rate in the VLBW children when compared to the FT controls. However, when the compositional development was analysed in relation to the growth of lexicon size, this development occurred qualitatively in a nearly parallel manner in the VLBW children as in the FT children. Early receptive and expressive lexicon sizes were significantly associated with later language skills in both groups. The effect of the background variables (gender, length of the mother s basic education, birth weight) on the language development in the FT and the VLBW children differed. The results provide new information of early language acquisition by the Finnish FT and VLBW children. The results support the view that the early acquisition of the semantic lexical categories is related to lexicon growth. The current findings also propose that the early grammatical acquisition is closely related to the growth of expressive vocabulary size. The language development of the VLBW children should be followed in clinical work.
Resumo:
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.
Resumo:
Ett sätt att förbättra resultat i informationssökning är frågeutvidgning. Vid frågeutvidgning utökas användarens ursprungliga fråga med termer som berör samma ämne. Frågor som har stort likhetsvärde med ett dokument kan tänkas beskriva dokumentet väl och kan därför fungera som en källa för goda utvidgningstermer. Om tidigare frågor finns lagrade kan termer som hittas med hjälp av dessa användas som kandidater för frågeutvidgningstermer. I avhandlingen presenteras och jämförs tre metoder för användning av tidigare frågor vid frågeutvidgning. För att evaluera metodernas effektivitet, jämförs de med hjälp av sökmaskinen Lucene och en liten samling dokument som berör cancerforskning. Som jämförelseresultat används de omodifierade frågorna och en enkel pseudorelevansåterkopplingsmetod som inte använder sig av tidigare frågor. Ingen av frågeutvidgningsmetoderna klarade sig speciellt bra, vilket beror på att dokumentsamlingen och testfrågorna utgör en svår omgivning för denna typ av metoder.
Resumo:
Current smartphones have a storage capacity of several gigabytes. More and more information is stored on mobile devices. To meet the challenge of information organization, we turn to desktop search. Users often possess multiple devices, and synchronize (subsets of) information between them. This makes file synchronization more important. This thesis presents Dessy, a desktop search and synchronization framework for mobile devices. Dessy uses desktop search techniques, such as indexing, query and index term stemming, and search relevance ranking. Dessy finds files by their content, metadata, and context information. For example, PDF files may be found by their author, subject, title, or text. EXIF data of JPEG files may be used in finding them. User–defined tags can be added to files to organize and retrieve them later. Retrieved files are ranked according to their relevance to the search query. The Dessy prototype uses the BM25 ranking function, used widely in information retrieval. Dessy provides an interface for locating files for both users and applications. Dessy is closely integrated with the Syxaw file synchronizer, which provides efficient file and metadata synchronization, optimizing network usage. Dessy supports synchronization of search results, individual files, and directory trees. It allows finding and synchronizing files that reside on remote computers, or the Internet. Dessy is designed to solve the problem of efficient mobile desktop search and synchronization, also supporting remote and Internet search. Remote searches may be carried out offline using a downloaded index, or while connected to the remote machine on a weak network. To secure user data, transmissions between the Dessy client and server are encrypted using symmetric encryption. Symmetric encryption keys are exchanged with RSA key exchange. Dessy emphasizes extensibility. Also the cryptography can be extended. Users may tag their files with context tags and control custom file metadata. Adding new indexed file types, metadata fields, ranking methods, and index types is easy. Finding files is done with virtual directories, which are views into the user’s files, browseable by regular file managers. On mobile devices, the Dessy GUI provides easy access to the search and synchronization system. This thesis includes results of Dessy synchronization and search experiments, including power usage measurements. Finally, Dessy has been designed with mobility and device constraints in mind. It requires only MIDP 2.0 Mobile Java with FileConnection support, and Java 1.5 on desktop machines.
Resumo:
National anniversaries such as independence days demand precise coordination in order to make citizens change their routines to forego work and spend the day at rest or at festivities that provide social focus and spectacle. The complex social construction of national days is taken for granted and operates as a given in the news media, which are the main agents responsible for coordinating these planned disruptions of normal routines. This study examines the language used in the news to construct the rather unnatural idea of national days and to align people in observing them. The data for the study consist of news stories about the Fourth of July in the New York Times, sampled over 150 years and are supplemented by material from other sources and other countries. The study is multidimensional, applying concepts from pragmatics (speech acts, politeness, information structure), systemic functional linguistics (the interpersonal metafunction and the Appraisal framework) and cognitive linguistics (frames, metaphor) as well as journalism and communications to arrive at an interdisciplinary understanding of how resources for meaning are used by writers and readers of the news stories. The analysis shows that on national anniversaries, nations tend to be metaphorized as persons having birthdays, to whom politeness should be shown. The face of the nation is to be respected in the sense of identifying the nation's interests as one's own (positive face) and speaking of citizen responsibilities rather than rights (negative face). Resources are available for both positive and negative evaluations of events and participants and the newspaper deftly changes footings (Goffman 1981) to demonstrate the required politeness while also heteroglossically allowing for a certain amount of disattention and even protest - within limits, for state holidays are almost never construed as Bakhtinian festivals, as they tend to reaffirm the hierarchy rather than invert it. Celebrations are evaluated mainly for impressiveness, and for the essentially contested quality of appropriateness, which covers norms of predictability, size, audience response, aesthetics, and explicit reference to the past. Events may also be negatively evaluated as dull ("banal") or inauthentic ("hoopla"). Audiences are evaluated chiefly in terms of their enthusiasm, or production of appropriate displays for emotional response, for national days are supposed to be occasions of flooding-out of nationalistic feeling. By making these evaluations, the newspaper reinforces its powerful position as an independent critic, while at the same time playing an active role in the construction and reproduction of emotional order embodied in "the nation's birthday." As an occasion for mobilization and demonstrations of power, national days may be seen to stand to war in the relation of play to fighting (Bateson 1955). Evidence from the newspaper's coverage of recent conflicts is adduced to support this analysis. In the course of the investigation, methods are developed for analyzing large collections of newspaper content, particularly topical soft news and feature materials that have hitherto been considered less influential and worthy of study than so-called hard news. In his work on evaluation in newspaper stories, White (1998) proposed that the classic hard news story is focused on an event that threatens the social order, but news of holidays and celebrations in general does not fit this pattern, in fact its central event is a reproduction of the social order. Thus in the system of news values (Galtung and Ruge 1965), national holiday news draws on "ground" news values such as continuity and predictability rather than "figure" news values such as negativity and surprise. It is argued that this ground helps form a necessary space for hard news to be seen as important, similar to the way in which the information structure of language is seen to rely on the regular alternation of given and new information (Chafe 1994).
Resumo:
The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: Firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses). If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses. If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we ?nd translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets. In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classi?ers and discovered word sense classifications, and ?nally we present the word sense discovery and disambiguation methods of the publications. This work supports Harris' hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes. Keywords: Word senses, Context, Evaluation, Word sense disambiguation, Word sense discovery.
Resumo:
DEVELOPING A TEXTILE ONTOLOGY FOR THE SEMANTIC WEB AND CONNECTING IT TO MUSEUM CATALOGING DATA The goal of the Semantic Web is to share concept-based information in a versatile way on the Internet. This is achievable using formal data structures called ontologies. The goal of this re-search is to increase the usability of museum cataloging data in information retrieval. The work is interdisciplinary, involving craft science, terminology science, computer science, and museology. In the first part of the dissertation an ontology of concepts of textiles, garments, and accessories is developed for museum cataloging work. The ontology work was done with the help of thesauri, vocabularies, research reports, and standards. The basis of the ontology development was the Museoalan asiasanasto MASA, a thesaurus for museum cataloging work which has been enriched by other vocabularies. Concepts and terms concerning the research object, as well as the material names of textiles, costumes, and accessories, were focused on. The research method was terminological concept analysis complemented by an ontological view of the Semantic Web. The concept structure was based on the hierarchical generic relation. Attention was also paid to other relations between terms and concepts, and between concepts themselves. Altogether 977 concept classes were created. Issues including how to choose and name concepts for the ontology hierarchy and how deep and broad the hierarchy could be are discussed from the viewpoint of the ontology developer and museum cataloger. The second part of the dissertation analyzes why some of the cataloged terms did not match with the developed textile ontology. This problem is significant because it prevents automatic ontological content integration of the cataloged data on the Semantic Web. The research datasets, i.e. the cataloged museum data on textile collections, came from three museums: Espoo City Museum, Lahti City Museum and The National Museum of Finland. The data included 1803 textile, costume, and accessory objects. Unmatched object and textile material names were analyzed. In the case of the object names six categories (475 cases), and of the material names eight categories (423 cases), were found where automatic annotation was not possible. The most common explanation was that the cataloged field was filled with a long sentence comprised of many terms. Sometimes in the compound term, the object name and material, or the name and the way of usage, were combined. As well, numeric values in the material name cataloging field prevented annotation and so did the absence of a corresponding concept in the ontology. Ready-made drop-down lists of materials used in one cataloging system facilitated the annotation. In the case of naming objects and materials, one should use terms in basic form without attributes. The developed textile ontology has been applied in two cultural portals, MuseumFinland and Culturesampo, where one can search for and browse information based on cataloged data using integrated ontologies in an interoperable way. The textile ontology is also part of the national FinnONTO ontology infrastructure. Keywords: annotation, concept, concept analysis, cataloging, museum collection, ontology, Semantic Web, textile collection, textile material
Resumo:
A child learns new things, creates social relationships and participates in play with the help of language. How can a child overcome these challenges if the surrounding language is not his mother tongue? The objective of learning a new language in the Pre-school education is an active bilingualism in all fields of the language. Theoretical context of the research rises from bilingualism, learning a language, language skills and evaluating them. Object of the research was to understand language skills of a child from a different linguistic and cultural background in the final stage of Pre-school education and to clarify how learning Finnish was supported during the Pre-school year. Answers to the research issues will be searched with the following questions: 1) What kind of language skills does a child from a different linguistic and cultural backgrounds have at the final stage of Pre-school education?, 1.1) What kind of listening comprehension skills?, 1.2) What kind of speech and vocabulary skills?, 1.3) What kind of structural skills?, 2) What kind of individual differences are there in language skills of children from different linguistic and cultural backgrounds?, and 3) How has a child from a different linguistic and cultural background been supported in learning Finnish during the Pre-school education? The view of language skills in this research is holistic even though it will be analysed in separate fields. The aim of this research is to form an overall impression of Finnish skills of the children participating in the research. Eight Pre-school-aged children with different linguistic and cultural backgrounds and their kindergarten teachers participated in this research. The children had taken part in Finnish activities for about three years. The research material consists of the test series (KITA), which evaluate children’s language skills – and of the questionnaire to the kindergarten teachers. The purpose of the questionnaire was to provide additional information on children’s language skills in Pre-school teaching situations and on supporting Finnish in Pre-school education. This research is qualitative and processing of the material is based on content analysis. According to the kindergarten teachers, the children’s social language skills were sufficient to cope in everyday life but children needed assistance with longer instructions. The same phenomenon could also be seen with the KITA tests – in which long and abstract instructions turned out to be difficult. Individual differences of the children were perceived in productivity skills, which were realised in fluent or influent speech. The children were supported in learning Finnish individually, in small-groups and in the activities of a whole group. ‘Finnish as the second language’ small-groups were the most common form of support in learning the language. The support at understanding activities was emphasized in whole group situations as well as in individual situations while assisting the child’s language skills. Generally, the children’s language skills were in the same level with developing basic language skills. The data of this research help to understand children’s language skills after three years of adopting Finnish. The results can be utilised in planning and evaluation of teaching another language.
Resumo:
Information retrieval of concise and consistent text passages is called passage retrieval. Passages can be used in an information retrieval system to improve its user interface and performance. In this thesis passage retrieval is compared to other forms of information retrieval. Implementation of passage retrieval as a feature of an information retrieval system is discussed. Various existing passage retrieval methods, their implementation and their efficiency are compared. I evaluated two different implementations of passage retrieval: direct passage retrieval and combined passage retrieval. In comparison combined passage retrieval turned out to be more efficient.
Resumo:
Lullabies in Kvevlax. Linguistic structures and constructions. The study is a linguistic analysis of constructions that shape the texts used in lullabies in Kvevlax in Ostrobothnia in Finland. The empirical goal is to identify linguistic constructions in traditional lullabies that make use of the dialect of the region. The theoretical goal was to test the usability of Construction Grammar (CxG) in analyses of this type of material, and to further develop the formal description of Construction Grammar in such a way as to make it possible to analyze all kinds of linguistically complex texts. The material that I collected in the 1960s comprises approximately 600 lullabies and concomitant interviews with the singers on the use of lullabies. In 1991 I collected additional material in Kvevlax. The number of informants is close to 250. Supplementary material covering the Swedish-language regions in Finland was compiled from the archives of the Society of Swedish Literature in Finland. The first part of the study is mainly based on traditional grammar and gives general information about the language and the structures used in the lullabies. In the detailed study of the Kvevlax lullabies in the latter part of the study I use a version of Construction Grammar intended for the linguistic analysis of usage-based texts. The analysis focuses on the most salient constructions in the lullabies. The study shows that Construction Grammar as a method has more general applicability than traditional linguistic methods. The study identifies important constructions, including elements typical of this genre, that structure the text in different variants of the same lullabies. In addition, CxG made it possible to study pragmatic aspects of the interactional, cultural and contextual language that is used in communication with small children. The constructions found in lullabies are also used in language in general. In addition to being able to give detailed linguistic descriptions of the texts, Construction Grammar can also explain the multidimensionality of language and the variations in the texts. The use of CxG made it possible to show that variations are not random but follow prototypical linguistic patterns, constructions. Constructions are thus found to be linguistic resources with built-in variation potentials.