169 resultados para Compressed text search
em Helda - Digital Repository of University of Helsinki
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.
Resumo:
In visual search one tries to find the currently relevant item among other, irrelevant items. In the present study, visual search performance for complex objects (characters, faces, computer icons and words) was investigated, and the contribution of different stimulus properties, such as luminance contrast between characters and background, set size, stimulus size, colour contrast, spatial frequency, and stimulus layout were investigated. Subjects were required to search for a target object among distracter objects in two-dimensional stimulus arrays. The outcome measure was threshold search time, that is, the presentation duration of the stimulus array required by the subject to find the target with a certain probability. It reflects the time used for visual processing separated from the time used for decision making and manual reactions. The duration of stimulus presentation was controlled by an adaptive staircase method. The number and duration of eye fixations, saccade amplitude, and perceptual span, i.e., the number of items that can be processed during a single fixation, were measured. It was found that search performance was correlated with the number of fixations needed to find the target. Search time and the number of fixations increased with increasing stimulus set size. On the other hand, several complex objects could be processed during a single fixation, i.e., within the perceptual span. Search time and the number of fixations depended on object type as well as luminance contrast. The size of the perceptual span was smaller for more complex objects, and decreased with decreasing luminance contrast within object type, especially for very low contrasts. In addition, the size and shape of perceptual span explained the changes in search performance for different stimulus layouts in word search. Perceptual span was scale invariant for a 16-fold range of stimulus sizes, i.e., the number of items processed during a single fixation was independent of retinal stimulus size or viewing distance. It is suggested that saccadic visual search consists of both serial (eye movements) and parallel (processing within perceptual span) components, and that the size of the perceptual span may explain the effectiveness of saccadic search in different stimulus conditions. Further, low-level visual factors, such as the anatomical structure of the retina, peripheral stimulus visibility and resolution requirements for the identification of different object types are proposed to constrain the size of the perceptual span, and thus, limit visual search performance. Similar methods were used in a clinical study to characterise the visual search performance and eye movements of neurological patients with chronic solvent-induced encephalopathy (CSE). In addition, the data about the effects of different stimulus properties on visual search in normal subjects were presented as simple practical guidelines, so that the limits of human visual perception could be taken into account in the design of user interfaces.
Resumo:
Human growth and attained height are determined by a combination of genetic and environmental effects and in modern Western societies > 80% of the observed variation in height is determined by genetic factors. Height is a fundamental human trait that is associated with many socioeconomic and psychosocial factors and health measures, however little is known of the identity of the specific genes that influence height variation in the general population. This thesis work aimed to identify the genetic variants that influence height in the general population by genome-wide linkage analysis utilizing large family samples. The study focused on analysis of three separate sets of families consisting of: 1) 1,417 individuals from 277 Finnish families (FinnHeight), 2) 8,450 individuals from 3,817 families from Australia and Europe (EUHeight) and 3) 9,306 individuals from 3,302 families from the United States (USHeight). The most significant finding in this study was found in the Finnish family sample where we a locus in the chromosomal region 1p21 was linked to adult height. Several regions showed evidence for linkage in the Australian, European and US families with 8q21 and 15q25 being the most significant. The region on 1p21 was followed up with further studies and we were able to show that the collagen 11-alpha-1 gene (COL11A1) residing at this location was associated with adult height. This association was also confirmed in an independent Finnish population cohort (Health 2000) consisting of 6,542 individuals. From this population sample, we estimated that homozygous males and females for this gene variant were 1.1 and 0.6 cm taller than the respective controls. In this thesis work we identified a gene variant in the COL11A1 gene that influences human height, although this variant alone explains only 0.1% of height variation in the Finnish population. We also demonstrated in this study that special stratification strategies such as performing sex-limited analyses, focusing on dizygous twin pairs, analyzing ethnic groups within a population separately and utilizing homogenous populations such as the Finns can improve the statistical power of finding QTL significantly. Also, we concluded from the results of this study that even though genetic effects explain a great proportion of height variance, it is likely that there are tens or even hundreds of genes with small individual effects underlying the genetic architecture of height.
Resumo:
Schizophrenia is a severe mental disorder affecting 0.4-1% of the population worldwide. It is characterized by impairments in the perception of reality and by significant social or occupational dysfunction. The disorder is one of the major contributors to the global burden of diseases. Studies of twins, families, and adopted children point to strong genetic components for schizophrenia, but environmental factors also play a role in the pathogenesis of disease. Molecular genetic studies have identified several potential positional candidate genes. The strongest evidence for putative schizophrenia susceptibility loci relates to the genes encoding dysbindin (DTNBP1) and neuregulin (NRG1), but studies lack impressive consistency in the precise genetic regions and alleles implicated. We have studied the role of three potential candidate genes by genotyping 28 single nucleotide polymorphisms in the DNTBP1, NRG1, and AKT1 genes in a large schizophrenia family sample consisting of 441 families with 865 affected individuals from Finland. Our results do not support a major role for these genes in the pathogenesis of schizophrenia in Finland. We have previously identified a region on chromosome 5q21-34 as a susceptibility locus for schizophrenia in a Finnish family sample. Recently, two studies reported association between the γ-aminobutyric acid type A receptor cluster of genes in this region and one study showed suggestive evidence for association with another regional gene encoding clathrin interactor 1 (CLINT1, also called Epsin 4 and ENTH). To further address the significance of these genes under the linkage peak in the Finnish families, we genotyped SNPs of these genes, and observed statistically significant association of variants between GABRG2 and schizophrenia. Furthermore, these variants also seem to affect the functioning of the working memory. Fetal events and obstetric complications are associated with schizophrenia. Rh incompatibility has been implicated as a risk factor for schizophrenia in several epidemiological studies. We conducted a family-based candidate-gene study that assessed the role of maternal-fetal genotype incompatibility at the RhD locus in schizophrenia. There was significant evidence for an RhD maternal-fetal genotype incompatibility, and the risk ratio was estimated at 2.3. This is the first candidate-gene study to explicitly test for and provide evidence of a maternal-fetal genotype incompatibility mechanism in schizophrenia. In conclusion, in this thesis we found evidence that one GABA receptor subunit, GABRG2, is significantly associated with schizophrenia. Furthermore, it also seems to affect to the functioning of the working memory. In addition, an RhD maternal-fetal genotype incompatibility increases the risk of schizophrenia by two-fold.
Resumo:
Buffer zones are vegetated strip-edges of agricultural fields along watercourses. As linear habitats in agricultural ecosystems, buffer strips dominate and play a leading ecological role in many areas. This thesis focuses on the plant species diversity of the buffer zones in a Finnish agricultural landscape. The main objective of the present study is to identify the determinants of floral species diversity in arable buffer zones from local to regional levels. This study was conducted in a watershed area of a farmland landscape of southern Finland. The study area, Lepsämänjoki, is situated in the Nurmijärvi commune 30 km to the north of Helsinki, Finland. The biotope mosaics were mapped in GIS. A total of 59 buffer zones were surveyed, of which 29 buffer strips surveyed were also sampled by plot. Firstly, two diversity components (species richness and evenness) were investigated to determine whether the relationship between the two is equal and predictable. I found no correlation between species richness and evenness. The relationship between richness and evenness is unpredictable in a small-scale human-shaped ecosystem. Ordination and correlation analyses show that richness and evenness may result from different ecological processes, and thus should be considered separately. Species richness correlated negatively with phosphorus content, and species evenness correlated negatively with the ratio of organic carbon to total nitrogen in soil. The lack of a consistent pattern in the relationship between these two components may be due to site-specific variation in resource utilization by plant species. Within-habitat configuration (width, length, and area) were investigated to determine which is more effective for predicting species richness. More species per unit area increment could be obtained from widening the buffer strip than from lengthening it. The width of the strips is an effective determinant of plant species richness. The increase in species diversity with an increase in the width of buffer strips may be due to cross-sectional habitat gradients within the linear patches. This result can serve as a reference for policy makers, and has application value in agricultural management. In the framework of metacommunity theory, I found that both mass effect(connectivity) and species sorting (resource heterogeneity) were likely to explain species composition and diversity on a local and regional scale. The local and regional processes were interactively dominated by the degree to which dispersal perturbs local communities. In the lowly and intermediately connected regions, species sorting was of primary importance to explain species diversity, while the mass effect surpassed species sorting in the highly connected region. Increasing connectivity in communities containing high habitat heterogeneity can lead to the homogenization of local communities, and consequently, to lower regional diversity, while local species richness was unrelated to the habitat connectivity. Of all species found, Anthriscus sylvestris, Phalaris arundinacea, and Phleum pretense significantly responded to connectivity, and showed high abundance in the highly connected region. We suggest that these species may play a role in switching the force from local resources to regional connectivity shaping the community structure. On the landscape context level, the different responses of local species richness and evenness to landscape context were investigated. Seven landscape structural parameters served to indicate landscape context on five scales. On all scales but the smallest scales, the Shannon-Wiener diversity of land covers (H') correlated positively with the local richness. The factor (H') showed the highest correlation coefficients in species richness on the second largest scale. The edge density of arable field was the only predictor that correlated with species evenness on all scales, which showed the highest predictive power on the second smallest scale. The different predictive power of the factors on different scales showed a scaledependent relationship between the landscape context and local plant species diversity, and indicated that different ecological processes determine species richness and evenness. The local richness of species depends on a regional process on large scales, which may relate to the regional species pool, while species evenness depends on a fine- or coarse-grained farming system, which may relate to the patch quality of the habitats of field edges near the buffer strips. My results suggested some guidelines of species diversity conservation in the agricultural ecosystem. To maintain a high level of species diversity in the strips, a high level of phosphorus in strip soil should be avoided. Widening the strips is the most effective mean to improve species richness. Habitat connectivity is not always favorable to species diversity because increasing connectivity in communities containing high habitat heterogeneity can lead to the homogenization of local communities (beta diversity) and, consequently, to lower regional diversity. Overall, a synthesis of local and regional factors emerged as the model that best explain variations in plant species diversity. The studies also suggest that the effects of determinants on species diversity have a complex relationship with scale.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
Current smartphones have a storage capacity of several gigabytes. More and more information is stored on mobile devices. To meet the challenge of information organization, we turn to desktop search. Users often possess multiple devices, and synchronize (subsets of) information between them. This makes file synchronization more important. This thesis presents Dessy, a desktop search and synchronization framework for mobile devices. Dessy uses desktop search techniques, such as indexing, query and index term stemming, and search relevance ranking. Dessy finds files by their content, metadata, and context information. For example, PDF files may be found by their author, subject, title, or text. EXIF data of JPEG files may be used in finding them. User–defined tags can be added to files to organize and retrieve them later. Retrieved files are ranked according to their relevance to the search query. The Dessy prototype uses the BM25 ranking function, used widely in information retrieval. Dessy provides an interface for locating files for both users and applications. Dessy is closely integrated with the Syxaw file synchronizer, which provides efficient file and metadata synchronization, optimizing network usage. Dessy supports synchronization of search results, individual files, and directory trees. It allows finding and synchronizing files that reside on remote computers, or the Internet. Dessy is designed to solve the problem of efficient mobile desktop search and synchronization, also supporting remote and Internet search. Remote searches may be carried out offline using a downloaded index, or while connected to the remote machine on a weak network. To secure user data, transmissions between the Dessy client and server are encrypted using symmetric encryption. Symmetric encryption keys are exchanged with RSA key exchange. Dessy emphasizes extensibility. Also the cryptography can be extended. Users may tag their files with context tags and control custom file metadata. Adding new indexed file types, metadata fields, ranking methods, and index types is easy. Finding files is done with virtual directories, which are views into the user’s files, browseable by regular file managers. On mobile devices, the Dessy GUI provides easy access to the search and synchronization system. This thesis includes results of Dessy synchronization and search experiments, including power usage measurements. Finally, Dessy has been designed with mobility and device constraints in mind. It requires only MIDP 2.0 Mobile Java with FileConnection support, and Java 1.5 on desktop machines.