111 resultados para iterated local search


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis studies optimisation problems related to modern large-scale distributed systems, such as wireless sensor networks and wireless ad-hoc networks. The concrete tasks that we use as motivating examples are the following: (i) maximising the lifetime of a battery-powered wireless sensor network, (ii) maximising the capacity of a wireless communication network, and (iii) minimising the number of sensors in a surveillance application. A sensor node consumes energy both when it is transmitting or forwarding data, and when it is performing measurements. Hence task (i), lifetime maximisation, can be approached from two different perspectives. First, we can seek for optimal data flows that make the most out of the energy resources available in the network; such optimisation problems are examples of so-called max-min linear programs. Second, we can conserve energy by putting redundant sensors into sleep mode; we arrive at the sleep scheduling problem, in which the objective is to find an optimal schedule that determines when each sensor node is asleep and when it is awake. In a wireless network simultaneous radio transmissions may interfere with each other. Task (ii), capacity maximisation, therefore gives rise to another scheduling problem, the activity scheduling problem, in which the objective is to find a minimum-length conflict-free schedule that satisfies the data transmission requirements of all wireless communication links. Task (iii), minimising the number of sensors, is related to the classical graph problem of finding a minimum dominating set. However, if we are not only interested in detecting an intruder but also locating the intruder, it is not sufficient to solve the dominating set problem; formulations such as minimum-size identifying codes and locating dominating codes are more appropriate. This thesis presents approximation algorithms for each of these optimisation problems, i.e., for max-min linear programs, sleep scheduling, activity scheduling, identifying codes, and locating dominating codes. Two complementary approaches are taken. The main focus is on local algorithms, which are constant-time distributed algorithms. The contributions include local approximation algorithms for max-min linear programs, sleep scheduling, and activity scheduling. In the case of max-min linear programs, tight upper and lower bounds are proved for the best possible approximation ratio that can be achieved by any local algorithm. The second approach is the study of centralised polynomial-time algorithms in local graphs these are geometric graphs whose structure exhibits spatial locality. Among other contributions, it is shown that while identifying codes and locating dominating codes are hard to approximate in general graphs, they admit a polynomial-time approximation scheme in local graphs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Current smartphones have a storage capacity of several gigabytes. More and more information is stored on mobile devices. To meet the challenge of information organization, we turn to desktop search. Users often possess multiple devices, and synchronize (subsets of) information between them. This makes file synchronization more important. This thesis presents Dessy, a desktop search and synchronization framework for mobile devices. Dessy uses desktop search techniques, such as indexing, query and index term stemming, and search relevance ranking. Dessy finds files by their content, metadata, and context information. For example, PDF files may be found by their author, subject, title, or text. EXIF data of JPEG files may be used in finding them. User–defined tags can be added to files to organize and retrieve them later. Retrieved files are ranked according to their relevance to the search query. The Dessy prototype uses the BM25 ranking function, used widely in information retrieval. Dessy provides an interface for locating files for both users and applications. Dessy is closely integrated with the Syxaw file synchronizer, which provides efficient file and metadata synchronization, optimizing network usage. Dessy supports synchronization of search results, individual files, and directory trees. It allows finding and synchronizing files that reside on remote computers, or the Internet. Dessy is designed to solve the problem of efficient mobile desktop search and synchronization, also supporting remote and Internet search. Remote searches may be carried out offline using a downloaded index, or while connected to the remote machine on a weak network. To secure user data, transmissions between the Dessy client and server are encrypted using symmetric encryption. Symmetric encryption keys are exchanged with RSA key exchange. Dessy emphasizes extensibility. Also the cryptography can be extended. Users may tag their files with context tags and control custom file metadata. Adding new indexed file types, metadata fields, ranking methods, and index types is easy. Finding files is done with virtual directories, which are views into the user’s files, browseable by regular file managers. On mobile devices, the Dessy GUI provides easy access to the search and synchronization system. This thesis includes results of Dessy synchronization and search experiments, including power usage measurements. Finally, Dessy has been designed with mobility and device constraints in mind. It requires only MIDP 2.0 Mobile Java with FileConnection support, and Java 1.5 on desktop machines.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

By detecting leading protons produced in the Central Exclusive Diffractive process, p+p → p+X+p, one can measure the missing mass, and scan for possible new particle states such as the Higgs boson. This process augments - in a model independent way - the standard methods for new particle searches at the Large Hadron Collider (LHC) and will allow detailed analyses of the produced central system, such as the spin-parity properties of the Higgs boson. The exclusive central diffractive process makes possible precision studies of gluons at the LHC and complements the physics scenarios foreseen at the next e+e− linear collider. This thesis first presents the conclusions of the first systematic analysis of the expected precision measurement of the leading proton momentum and the accuracy of the reconstructed missing mass. In this initial analysis, the scattered protons are tracked along the LHC beam line and the uncertainties expected in beam transport and detection of the scattered leading protons are accounted for. The main focus of the thesis is in developing the necessary radiation hard precision detector technology for coping with the extremely demanding experimental environment of the LHC. This will be achieved by using a 3D silicon detector design, which in addition to the radiation hardness of up to 5×10^15 neutrons/cm2, offers properties such as a high signal-to- noise ratio, fast signal response to radiation and sensitivity close to the very edge of the detector. This work reports on the development of a novel semi-3D detector design that simplifies the 3D fabrication process, but conserves the necessary properties of the 3D detector design required in the LHC and in other imaging applications.