12 resultados para similarity retrieval
em Helda - Digital Repository of University of Helsinki
Resumo:
A wide range of models used in agriculture, ecology, carbon cycling, climate and other related studies require information on the amount of leaf material present in a given environment to correctly represent radiation, heat, momentum, water, and various gas exchanges with the overlying atmosphere or the underlying soil. Leaf area index (LAI) thus often features as a critical land surface variable in parameterisations of global and regional climate models, e.g., radiation uptake, precipitation interception, energy conversion, gas exchange and momentum, as all areas are substantially determined by the vegetation surface. Optical wavelengths of remote sensing are the common electromagnetic regions used for LAI estimations and generally for vegetation studies. The main purpose of this dissertation was to enhance the determination of LAI using close-range remote sensing (hemispherical photography), airborne remote sensing (high resolution colour and colour infrared imagery), and satellite remote sensing (high resolution SPOT 5 HRG imagery) optical observations. The commonly used light extinction models are applied at all levels of optical observations. For the sake of comparative analysis, LAI was further determined using statistical relationships between spectral vegetation index (SVI) and ground based LAI. The study areas of this dissertation focus on two regions, one located in Taita Hills, South-East Kenya characterised by tropical cloud forest and exotic plantations, and the other in Gatineau Park, Southern Quebec, Canada dominated by temperate hardwood forest. The sampling procedure of sky map of gap fraction and size from hemispherical photographs was proven to be one of the most crucial steps in the accurate determination of LAI. LAI and clumping index estimates were significantly affected by the variation of the size of sky segments for given zenith angle ranges. On sloping ground, gap fraction and size distributions present strong upslope/downslope asymmetry of foliage elements, and thus the correction and the sensitivity analysis for both LAI and clumping index computations were demonstrated. Several SVIs can be used for LAI mapping using empirical regression analysis provided that the sensitivities of SVIs at varying ranges of LAI are large enough. Large scale LAI inversion algorithms were demonstrated and were proven to be a considerably efficient alternative approach for LAI mapping. LAI can be estimated nonparametrically from the information contained solely in the remotely sensed dataset given that the upper-end (saturated SVI) value is accurately determined. However, further study is still required to devise a methodology as well as instrumentation to retrieve on-ground green leaf area index . Subsequently, the large scale LAI inversion algorithms presented in this work can be precisely validated. Finally, based on literature review and this dissertation, potential future research prospects and directions were recommended.
Resumo:
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.
Resumo:
The usual task in music information retrieval (MIR) is to find occurrences of a monophonic query pattern within a music database, which can contain both monophonic and polyphonic content. The so-called query-by-humming systems are a famous instance of content-based MIR. In such a system, the user's hummed query is converted into symbolic form to perform search operations in a similarly encoded database. The symbolic representation (e.g., textual, MIDI or vector data) is typically a quantized and simplified version of the sampled audio data, yielding to faster search algorithms and space requirements that can be met in real-life situations. In this thesis, we investigate geometric approaches to MIR. We first study some musicological properties often needed in MIR algorithms, and then give a literature review on traditional (e.g., string-matching-based) MIR algorithms and novel techniques based on geometry. We also introduce some concepts from digital image processing, namely the mathematical morphology, which we will use to develop and implement four algorithms for geometric music retrieval. The symbolic representation in the case of our algorithms is a binary 2-D image. We use various morphological pre- and post-processing operations on the query and the database images to perform template matching / pattern recognition for the images. The algorithms are basically extensions to classic image correlation and hit-or-miss transformation techniques used widely in template matching applications. They aim to be a future extension to the retrieval engine of C-BRAHMS, which is a research project of the Department of Computer Science at University of Helsinki.
Resumo:
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
Resumo:
A straightforward computation of the list of the words (the `tail words' of the list) that are distributionally most similar to a given word (the `head word' of the list) leads to the question: How semantically similar to the head word are the tail words; that is: how similar are their meanings to its meaning? And can we do better? The experiment was done on nearly 18,000 most frequent nouns in a Finnish newsgroup corpus. These nouns are considered to be distributionally similar to the extent that they occur in the same direct dependency relations with the same nouns, adjectives and verbs. The extent of the similarity of their computational representations is quantified with the information radius. The semantic classification of head-tail pairs is intuitive; some tail words seem to be semantically similar to the head word, some do not. Each such pair is also associated with a number of further distributional variables. Individually, their overlap for the semantic classes is large, but the trained classification-tree models have some success in using combinations to predict the semantic class. The training data consists of a random sample of 400 head-tail pairs with the tail word ranked among the 20 distributionally most similar to the head word, excluding names. The models are then tested on a random sample of another 100 such pairs. The best success rates range from 70% to 92% of the test pairs, where a success means that the model predicted my intuitive semantic class of the pair. This seems somewhat promising when distributional similarity is used to capture semantically similar words. This analysis also includes a general discussion of several different similarity formulas, arranged in three groups: those that apply to sets with graded membership, those that apply to the members of a vector space, and those that apply to probability mass functions.
Resumo:
Based on the Aristotelian criterion referred to as 'abductio', Peirce suggests a method of hypothetical inference, which operates in a different way than the deductive and inductive methods. “Abduction is nothing but guessing” (Peirce, 7.219). This principle is of extreme value for the study of our understanding of mathematical self-similarity in both of its typical presentations: relative or absolute. For the first case, abduction incarnates the quantitative/qualitative relationships of a self-similar object or process; for the second case, abduction makes understandable the statistical treatment of self-similarity, 'guessing' the continuity of geometric features to the infinity through the use of a systematic stereotype (for instance, the assumption that the general shape of the Sierpiński triangle continuates identically into its particular shapes). The metaphor coined by Peirce, of an exact map containig itself the same exact map (a map of itself), is not only the most important precedent of Mandelbrot’s problem of measuring the boundaries of a continuous irregular surface with a logarithmic ruler, but also still being a useful abstraction for the conceptualisation of relative and absolute self-similarity, and its mechanisms of implementation. It is useful, also, for explaining some of the most basic geometric ontologies as mental constructions: in the notion of infinite convergence of points in the corners of a triangle, or the intuition for defining two parallel straight lines as two lines in a plane that 'never' intersect.
Resumo:
Self-similarity, a concept taken from mathematics, is gradually becoming a keyword in musicology. Although a polysemic term, self-similarity often refers to the multi-scalar feature repetition in a set of relationships, and it is commonly valued as an indication for musical coherence and consistency . This investigation provides a theory of musical meaning formation in the context of intersemiosis, that is, the translation of meaning from one cognitive domain to another cognitive domain (e.g. from mathematics to music, or to speech or graphic forms). From this perspective, the degree of coherence of a musical system relies on a synecdochic intersemiosis: a system of related signs within other comparable and correlated systems. This research analyzes the modalities of such correlations, exploring their general and particular traits, and their operational bounds. Looking forward in this direction, the notion of analogy is used as a rich concept through its two definitions quoted by the Classical literature: proportion and paradigm, enormously valuable in establishing measurement, likeness and affinity criteria. Using quantitative qualitative methods, evidence is presented to justify a parallel study of different modalities of musical self-similarity. For this purpose, original arguments by Benoît B. Mandelbrot are revised, alongside a systematic critique of the literature on the subject. Furthermore, connecting Charles S. Peirce s synechism with Mandelbrot s fractality is one of the main developments of the present study. This study provides elements for explaining Bolognesi s (1983) conjecture, that states that the most primitive, intuitive and basic musical device is self-reference, extending its functions and operations to self-similar surfaces. In this sense, this research suggests that, with various modalities of self-similarity, synecdochic intersemiosis acts as system of systems in coordination with greater or lesser development of structural consistency, and with a greater or lesser contextual dependence.