17 resultados para Geographical information retrieval

em Helda - Digital Repository of University of Helsinki


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The usual task in music information retrieval (MIR) is to find occurrences of a monophonic query pattern within a music database, which can contain both monophonic and polyphonic content. The so-called query-by-humming systems are a famous instance of content-based MIR. In such a system, the user's hummed query is converted into symbolic form to perform search operations in a similarly encoded database. The symbolic representation (e.g., textual, MIDI or vector data) is typically a quantized and simplified version of the sampled audio data, yielding to faster search algorithms and space requirements that can be met in real-life situations. In this thesis, we investigate geometric approaches to MIR. We first study some musicological properties often needed in MIR algorithms, and then give a literature review on traditional (e.g., string-matching-based) MIR algorithms and novel techniques based on geometry. We also introduce some concepts from digital image processing, namely the mathematical morphology, which we will use to develop and implement four algorithms for geometric music retrieval. The symbolic representation in the case of our algorithms is a binary 2-D image. We use various morphological pre- and post-processing operations on the query and the database images to perform template matching / pattern recognition for the images. The algorithms are basically extensions to classic image correlation and hit-or-miss transformation techniques used widely in template matching applications. They aim to be a future extension to the retrieval engine of C-BRAHMS, which is a research project of the Department of Computer Science at University of Helsinki.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: Firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses). If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses. If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we ?nd translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets. In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classi?ers and discovered word sense classifications, and ?nally we present the word sense discovery and disambiguation methods of the publications. This work supports Harris' hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes. Keywords: Word senses, Context, Evaluation, Word sense disambiguation, Word sense discovery.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The aim of this study is to examine the relationship of the Roman villa to its environment. The villa was an important feature of the countryside intended both for agricultural production and for leisure. Manuals of Roman agriculture give instructions on how to select a location for an estate. The ideal location was a moderate slope facing east or south in a healthy area and good neighborhood, near good water resources and fertile soils. A road or a navigable river or the sea was needed for transportation of produce. A market for selling the produce, a town or a village, should have been nearby. The research area is the surroundings of the city of Rome, a key area for the development of the villa. The materials used consist of archaeological settlement sites, literary and epigraphical evidence as well as environmental data. The sites include all settlement sites from the 7th century BC to 5th century AD to examine changes in the tradition of site selection. Geographical Information Systems were used to analyze the data. Six aspects of location were examined: geology, soils, water resources, terrain, visibility/viewability and relationship to roads and habitation centers. Geology was important for finding building materials and the large villas from the 2nd century BC onwards are close to sources of building stones. Fertile soils were sought even in the period of the densest settlement. The area is rich in water, both rainfall and groundwater, and finding a water supply was fairly easy. A certain kind of terrain was sought over very long periods: a small spur or ridge shoulder facing preferably south with an open area in front of the site. The most popular villa resorts are located on the slopes visible from almost the entire Roman region. A visible villa served the social and political aspirations of the owner, whereas being in the villa created a sense of privacy. The area has a very dense road network ensuring good connectivity from almost anywhere in the region. The best visibility/viewability, dense settlement and most burials by roads coincide, creating a good neighborhood. The locations featuring the most qualities cover nearly a quarter of the area and more than half of the settlement sites are located in them. The ideal location was based on centuries of practical experience and rationalized by the literary tradition.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

DEVELOPING A TEXTILE ONTOLOGY FOR THE SEMANTIC WEB AND CONNECTING IT TO MUSEUM CATALOGING DATA The goal of the Semantic Web is to share concept-based information in a versatile way on the Internet. This is achievable using formal data structures called ontologies. The goal of this re-search is to increase the usability of museum cataloging data in information retrieval. The work is interdisciplinary, involving craft science, terminology science, computer science, and museology. In the first part of the dissertation an ontology of concepts of textiles, garments, and accessories is developed for museum cataloging work. The ontology work was done with the help of thesauri, vocabularies, research reports, and standards. The basis of the ontology development was the Museoalan asiasanasto MASA, a thesaurus for museum cataloging work which has been enriched by other vocabularies. Concepts and terms concerning the research object, as well as the material names of textiles, costumes, and accessories, were focused on. The research method was terminological concept analysis complemented by an ontological view of the Semantic Web. The concept structure was based on the hierarchical generic relation. Attention was also paid to other relations between terms and concepts, and between concepts themselves. Altogether 977 concept classes were created. Issues including how to choose and name concepts for the ontology hierarchy and how deep and broad the hierarchy could be are discussed from the viewpoint of the ontology developer and museum cataloger. The second part of the dissertation analyzes why some of the cataloged terms did not match with the developed textile ontology. This problem is significant because it prevents automatic ontological content integration of the cataloged data on the Semantic Web. The research datasets, i.e. the cataloged museum data on textile collections, came from three museums: Espoo City Museum, Lahti City Museum and The National Museum of Finland. The data included 1803 textile, costume, and accessory objects. Unmatched object and textile material names were analyzed. In the case of the object names six categories (475 cases), and of the material names eight categories (423 cases), were found where automatic annotation was not possible. The most common explanation was that the cataloged field was filled with a long sentence comprised of many terms. Sometimes in the compound term, the object name and material, or the name and the way of usage, were combined. As well, numeric values in the material name cataloging field prevented annotation and so did the absence of a corresponding concept in the ontology. Ready-made drop-down lists of materials used in one cataloging system facilitated the annotation. In the case of naming objects and materials, one should use terms in basic form without attributes. The developed textile ontology has been applied in two cultural portals, MuseumFinland and Culturesampo, where one can search for and browse information based on cataloged data using integrated ontologies in an interoperable way. The textile ontology is also part of the national FinnONTO ontology infrastructure. Keywords: annotation, concept, concept analysis, cataloging, museum collection, ontology, Semantic Web, textile collection, textile material

Relevância:

80.00% 80.00%

Publicador:

Resumo:

XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Ett sätt att förbättra resultat i informationssökning är frågeutvidgning. Vid frågeutvidgning utökas användarens ursprungliga fråga med termer som berör samma ämne. Frågor som har stort likhetsvärde med ett dokument kan tänkas beskriva dokumentet väl och kan därför fungera som en källa för goda utvidgningstermer. Om tidigare frågor finns lagrade kan termer som hittas med hjälp av dessa användas som kandidater för frågeutvidgningstermer. I avhandlingen presenteras och jämförs tre metoder för användning av tidigare frågor vid frågeutvidgning. För att evaluera metodernas effektivitet, jämförs de med hjälp av sökmaskinen Lucene och en liten samling dokument som berör cancerforskning. Som jämförelseresultat används de omodifierade frågorna och en enkel pseudorelevansåterkopplingsmetod som inte använder sig av tidigare frågor. Ingen av frågeutvidgningsmetoderna klarade sig speciellt bra, vilket beror på att dokumentsamlingen och testfrågorna utgör en svår omgivning för denna typ av metoder.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Information retrieval of concise and consistent text passages is called passage retrieval. Passages can be used in an information retrieval system to improve its user interface and performance. In this thesis passage retrieval is compared to other forms of information retrieval. Implementation of passage retrieval as a feature of an information retrieval system is discussed. Various existing passage retrieval methods, their implementation and their efficiency are compared. I evaluated two different implementations of passage retrieval: direct passage retrieval and combined passage retrieval. In comparison combined passage retrieval turned out to be more efficient.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Current smartphones have a storage capacity of several gigabytes. More and more information is stored on mobile devices. To meet the challenge of information organization, we turn to desktop search. Users often possess multiple devices, and synchronize (subsets of) information between them. This makes file synchronization more important. This thesis presents Dessy, a desktop search and synchronization framework for mobile devices. Dessy uses desktop search techniques, such as indexing, query and index term stemming, and search relevance ranking. Dessy finds files by their content, metadata, and context information. For example, PDF files may be found by their author, subject, title, or text. EXIF data of JPEG files may be used in finding them. User–defined tags can be added to files to organize and retrieve them later. Retrieved files are ranked according to their relevance to the search query. The Dessy prototype uses the BM25 ranking function, used widely in information retrieval. Dessy provides an interface for locating files for both users and applications. Dessy is closely integrated with the Syxaw file synchronizer, which provides efficient file and metadata synchronization, optimizing network usage. Dessy supports synchronization of search results, individual files, and directory trees. It allows finding and synchronizing files that reside on remote computers, or the Internet. Dessy is designed to solve the problem of efficient mobile desktop search and synchronization, also supporting remote and Internet search. Remote searches may be carried out offline using a downloaded index, or while connected to the remote machine on a weak network. To secure user data, transmissions between the Dessy client and server are encrypted using symmetric encryption. Symmetric encryption keys are exchanged with RSA key exchange. Dessy emphasizes extensibility. Also the cryptography can be extended. Users may tag their files with context tags and control custom file metadata. Adding new indexed file types, metadata fields, ranking methods, and index types is easy. Finding files is done with virtual directories, which are views into the user’s files, browseable by regular file managers. On mobile devices, the Dessy GUI provides easy access to the search and synchronization system. This thesis includes results of Dessy synchronization and search experiments, including power usage measurements. Finally, Dessy has been designed with mobility and device constraints in mind. It requires only MIDP 2.0 Mobile Java with FileConnection support, and Java 1.5 on desktop machines.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Topics in Spatial Econometrics — With Applications to House Prices Spatial effects in data occur when geographical closeness of observations influences the relation between the observations. When two points on a map are close to each other, the observed values on a variable at those points tend to be similar. The further away the two points are from each other, the less similar the observed values tend to be. Recent technical developments, geographical information systems (GIS) and global positioning systems (GPS) have brought about a renewed interest in spatial matters. For instance, it is possible to observe the exact location of an observation and combine it with other characteristics. Spatial econometrics integrates spatial aspects into econometric models and analysis. The thesis concentrates mainly on methodological issues, but the findings are illustrated by empirical studies on house price data. The thesis consists of an introductory chapter and four essays. The introductory chapter presents an overview of topics and problems in spatial econometrics. It discusses spatial effects, spatial weights matrices, especially k-nearest neighbours weights matrices, and various spatial econometric models, as well as estimation methods and inference. Further, the problem of omitted variables, a few computational and empirical aspects, the bootstrap procedure and the spatial J-test are presented. In addition, a discussion on hedonic house price models is included. In the first essay a comparison is made between spatial econometrics and time series analysis. By restricting the attention to unilateral spatial autoregressive processes, it is shown that a unilateral spatial autoregression, which enjoys similar properties as an autoregression with time series, can be defined. By an empirical study on house price data the second essay shows that it is possible to form coordinate-based, spatially autoregressive variables, which are at least to some extent able to replace the spatial structure in a spatial econometric model. In the third essay a strategy for specifying a k-nearest neighbours weights matrix by applying the spatial J-test is suggested, studied and demonstrated. In the final fourth essay the properties of the asymptotic spatial J-test are further examined. A simulation study shows that the spatial J-test can be used for distinguishing between general spatial models with different k-nearest neighbours weights matrices. A bootstrap spatial J-test is suggested to correct the size of the asymptotic test in small samples.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The area of Östersundom (29,1 square kilometers) was attached to Helsinki in the beginning of the year 2009. Östersundom is formed mostly from the municipality of Sipoo, and partly from the city of Vantaa. Nowadays Östersundom is still quite rural, but city planning has already started, and there are plans to develop Östersundom into a district with 45 000 inhabitants. In this study, the headwaters, streams and small lakes of Östersundom were studied to produce information as a basis for city planning. There are six main streams and five small lakes in Östersundom. The main methodology used in this study was the examination of the physical and the chemical quality of the water. The hygienic quality of the water was also studied. It was also examined whether the waters are in their natural state, or have they been treated and transformed by man. In addition, other factors affecting the waters were examined. Geographical information data was produced as a result of this work. Östersundom is the main area looked at in this study, some factors are examined in the scope of the catchment areas. Water samples were collected in three sampling periods: 31.8 4.9.2009, 3. 4.2.2010, and 10. 14.4.2010. There were 20 sampling points in Östersundom (5 in small lakes, 15 in streams). In the winter sampling period, only six samples were collected, from which one was taken from a small lake. Field measurements associated with water sampling included water temperature, oxygen concentration, pH and electoral conductivity. Water samples were analyzed in the Laboratories of Physical Geography in the University of Helsinki for the following properties: total suspended solids (TSS), total dissolved substances (TDS), organic matter, alkalinity, colour, principal anions and cations and trace elements. Metropolilab analyzed the amount of faecal coliform bacteria in the samples. The waters in Östersundom can be divided to three classes according to water quality and other characteristics: the upper course of the streams, the lower course of the streams and the small lakes. The streams in their upper course are in general acidic, and their acid neutralization capacity is low. The proportion of the organic matter is high. Also the concentrations of aluminium and iron tend to be high. The streams in the lower course have acidity closer to neutral, and the buffering capacity is good. The amounts of TSS and TDS are high, and as a result, the concentrations of many ions and trace elements are high as well. Bacteria were detected at times in the streams of the lower course. Four of the five small lakes in Östersundom are humic and acidic. TSS and TDS concentrations tend to be low, but the proportion of organic matter is often high. There were no bacteria in the small lakes. The fifth small lake (Landbonlampi) differs from the others by its water colour, which is very clear. This lake is very acidic, and its buffering capacity is extremely low. Compared to the headwaters in Finland in general, the concentrations of many ions and trace elements are higher in Östersundom. On the other hand, the characteristics of water were different according to the classification upper course streams, lower course streams, and small lakes. Generally, the best water quality was observed in the stream of Gumbölenpuro and in the lakes Storträsk, Genaträsk, Hältingträsk and Landbonlampi. Several valuable waters in their natural state were discovered from the area. The most representative example is the stream of Östersundominpuro in its lower course, where the stream flows through a broad-leaf forest area. The small lakes of Östersundom, and the biggest stream Krapuoja, with its meandering channel, are also valuable in their natural state.

Relevância:

80.00% 80.00%

Publicador: