823 resultados para Edit Distance


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. Results: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment. Conclusion: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Finding an adequate paraphrase representation formalism is a challenging issue in Natural Language Processing. In this paper, we analyse the performance of Tree Edit Distance as a paraphrase representation baseline. Our experiments using Edit Distance Textual Entailment Suite show that, as Tree Edit Distance consists of a purely syntactic approach, paraphrase alternations not based on structural reorganizations do not find an adequate representation. They also show that there is much scope for better modelling of the way trees are aligned.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The design of a large and reliable DNA codeword library is a key problem in DNA based computing. DNA codes, namely sets of fixed length edit metric codewords over the alphabet {A, C, G, T}, satisfy certain combinatorial constraints with respect to biological and chemical restrictions of DNA strands. The primary constraints that we consider are the reverse--complement constraint and the fixed GC--content constraint, as well as the basic edit distance constraint between codewords. We focus on exploring the theory underlying DNA codes and discuss several approaches to searching for optimal DNA codes. We use Conway's lexicode algorithm and an exhaustive search algorithm to produce provably optimal DNA codes for codes with small parameter values. And a genetic algorithm is proposed to search for some sub--optimal DNA codes with relatively large parameter values, where we can consider their sizes as reasonable lower bounds of DNA codes. Furthermore, we provide tables of bounds on sizes of DNA codes with length from 1 to 9 and minimum distance from 1 to 9.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We introduce a problem called maximum common characters in blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We show that this problem is NP-complete, but can nevertheless be solved satisfactorily using integer linear programming for instances of practical interest. Two integer linear formulations are proposed and compared in terms of their linear relaxations. We also compare the results of the approximate matching with other known measures such as the Levenshtein (edit) distance. (C) 2008 Elsevier B.V. All rights reserved.