801 resultados para TREE EDIT DISTANCE


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Plasmodium falciparum is the parasite responsible for the most acute form of malaria in humans. Recently, the serine repeat antigen (SERA) in P. falciparum has attracted attention as a potential vaccine and drug target, and it has been shown to be a member of a large gene family. To clarify the relationships among the numerous P. falciparum SERAs and to identify orthologs to SERA5 and SERA6 in Plasmodium species affecting rodents, gene trees were inferred from nucleotide and amino acid sequence data for 33 putative SERA homologs in seven different species. (A distance method for nucleotide sequences that is specifically designed to accommodate differing GC content yielded results that were largely compatible with the amino acid tree. Standard-distance and maximum-likelihood methods for nucleotide sequences, on the other hand, yielded gene trees that differed in important respects.) To infer the pattern of duplication, speciation, and gene loss events in the SERA gene family history, the resulting gene trees were then "reconciled" with two competing Plasmodium species tree topologies that have been identified by previous phylogenetic studies. Parsimony of reconciliation was used as a criterion for selecting a gene tree/species tree pair and provided (1) support for one of the two species trees and for the core topology of the amino acid-derived gene tree, (2) a basis for critiquing fine detail in a poorly resolved region of the gene tree, (3) a set of predicted "missing genes" in some species, (4) clarification of the relationship among the P. falciparum SERA, and (5) some information about SERA5 and SERA6 orthologs in the rodent malaria parasites. Parsimony of reconciliation and a second criterion--implied mutational pattern at two key active sites in the SERA proteins-were also seen to be useful supplements to standard "bootstrap" analysis for inferred topologies.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We introduce a problem called maximum common characters in blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We show that this problem is NP-complete, but can nevertheless be solved satisfactorily using integer linear programming for instances of practical interest. Two integer linear formulations are proposed and compared in terms of their linear relaxations. We also compare the results of the approximate matching with other known measures such as the Levenshtein (edit) distance. (C) 2008 Elsevier B.V. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Information Retrieval systems normally have to work with rather heterogeneous sources, such as Web sites or documents from Optical Character Recognition tools. The correct conversion of these sources into flat text files is not a trivial task since noise may easily be introduced as a result of spelling or typeset errors. Interestingly, this is not a great drawback when the size of the corpus is sufficiently large, since redundancy helps to overcome noise problems. However, noise becomes a serious problem in restricted-domain Information Retrieval specially when the corpus is small and has little or no redundancy. This paper devises an approach which adds noise-tolerance to Information Retrieval systems. A set of experiments carried out in the agricultural domain proves the effectiveness of the approach presented.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The aims of the project were twofold: 1) To investigate classification procedures for remotely sensed digital data, in order to develop modifications to existing algorithms and propose novel classification procedures; and 2) To investigate and develop algorithms for contextual enhancement of classified imagery in order to increase classification accuracy. The following classifiers were examined: box, decision tree, minimum distance, maximum likelihood. In addition to these the following algorithms were developed during the course of the research: deviant distance, look up table and an automated decision tree classifier using expert systems technology. Clustering techniques for unsupervised classification were also investigated. Contextual enhancements investigated were: mode filters, small area replacement and Wharton's CONAN algorithm. Additionally methods for noise and edge based declassification and contextual reclassification, non-probabilitic relaxation and relaxation based on Markov chain theory were developed. The advantages of per-field classifiers and Geographical Information Systems were investigated. The conclusions presented suggest suitable combinations of classifier and contextual enhancement, given user accuracy requirements and time constraints. These were then tested for validity using a different data set. A brief examination of the utility of the recommended contextual algorithms for reducing the effects of data noise was also carried out.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We analyze an approach to a similarity preserving coding of symbol sequences based on neural distributed representations and show that it can be viewed as a metric embedding process.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis addressed the problem of risk analysis in mental healthcare, with respect to the GRiST project at Aston University. That project provides a risk-screening tool based on the knowledge of 46 experts, captured as mind maps that describe relationships between risks and patterns of behavioural cues. Mind mapping, though, fails to impose control over content, and is not considered to formally represent knowledge. In contrast, this thesis treated GRiSTs mind maps as a rich knowledge base in need of refinement; that process drew on existing techniques for designing databases and knowledge bases. Identifying well-defined mind map concepts, though, was hindered by spelling mistakes, and by ambiguity and lack of coverage in the tools used for researching words. A novel use of the Edit Distance overcame those problems, by assessing similarities between mind map texts, and between spelling mistakes and suggested corrections. That algorithm further identified stems, the shortest text string found in related word-forms. As opposed to existing approaches’ reliance on built-in linguistic knowledge, this thesis devised a novel, more flexible text-based technique. An additional tool, Correspondence Analysis, found patterns in word usage that allowed machines to determine likely intended meanings for ambiguous words. Correspondence Analysis further produced clusters of related concepts, which in turn drove the automatic generation of novel mind maps. Such maps underpinned adjuncts to the mind mapping software used by GRiST; one such new facility generated novel mind maps, to reflect the collected expert knowledge on any specified concept. Mind maps from GRiST are stored as XML, which suggested storing them in an XML database. In fact, the entire approach here is ”XML-centric”, in that all stages rely on XML as far as possible. A XML-based query language allows user to retrieve information from the mind map knowledge base. The approach, it was concluded, will prove valuable to mind mapping in general, and to detecting patterns in any type of digital information.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Bangla OCR (Optical Character Recognition) is a long deserving software for Bengali community all over the world. Numerous e efforts suggest that due to the inherent complex nature of Bangla alphabet and its word formation process development of high fidelity OCR producing a reasonably acceptable output still remains a challenge. One possible way of improvement is by using post processing of OCR’s output; algorithms such as Edit Distance and the use of n-grams statistical information have been used to rectify misspelled words in language processing. This work presents the first known approach to use these algorithms to replace misrecognized words produced by Bangla OCR. The assessment is made on a set of fifty documents written in Bangla script and uses a dictionary of 541,167 words. The proposed correction model can correct several words lowering the recognition error rate by 2.87% and 3.18% for the character based n- gram and edit distance algorithms respectively. The developed system suggests a list of 5 (five) alternatives for a misspelled word. It is found that in 33.82% cases, the correct word is the topmost suggestion of 5 words list for n-gram algorithm while using Edit distance algorithm the first word in the suggestion properly matches 36.31% of the cases. This work will ignite rooms of thoughts for possible improvements in character recognition endeavour.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Approximately 7.2% of the Atlantic rainforest remains in Brazil, with only 16% of this forest remaining in the State of Rio de Janeiro, all of it distributed in fragments. This forest fragmentation can produce biotic and abiotic differences between edges and the fragment interior. In this study, we compared the structure and richness of tree communities in three habitats - an anthropogenic edge (AE), a natural edge (NE) and the fragment interior (FI) - of a fragment of Atlantic forest in the State of Rio de Janeiro, Brazil (22°50'S and 42°28'W). One thousand and seventy-six trees with a diameter at breast height > 4.8 cm, belonging to 132 morphospecies and 39 families, were sampled in a total study area of 0.75 ha. NE had the greatest basal area and the trees in this habitat had the greatest diameter:height allometric coefficient, whereas AE had a lower richness and greater variation in the height of the first tree branch. Tree density, diameter, height and the proportion of standing dead trees did not differ among the habitats. There was marked heterogeneity among replicates within each habitat. These results indicate that the forest interior and the fragment edges (natural or anthropogenic) do not differ markedly considering the studied parameters. Other factors, such as the age from the edge, type of matrix and proximity of gaps, may play a more important role in plant community structure than the proximity from edges.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Approximately 7.2% of the Atlantic rainforest remains in Brazil, with only 16% of this forest remaining in the State of Rio de Janeiro, all of it distributed in fragments. This forest fragmentation can produce biotic and abiotic differences between edges and the fragment interior. In this study, we compared the structure and richness of tree communities in three habitats - an anthropogenic edge (AE), a natural edge (NE) and the fragment interior (FI) - of a fragment of Atlantic forest in the State of Rio de Janeiro, Brazil (22°50'S and 42°28'W). One thousand and seventy-six trees with a diameter at breast height > 4.8 cm, belonging to 132 morphospecies and 39 families, were sampled in a total study area of 0.75 ha. NE had the greatest basal area and the trees in this habitat had the greatest diameter:height allometric coefficient, whereas AE had a lower richness and greater variation in the height of the first tree branch. Tree density, diameter, height and the proportion of standing dead trees did not differ among the habitats. There was marked heterogeneity among replicates within each habitat. These results indicate that the forest interior and the fragment edges (natural or anthropogenic) do not differ markedly considering the studied parameters. Other factors, such as the age from the edge, type of matrix and proximity of gaps, may play a more important role in plant community structure than the proximity from edges.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The importance of dispersal for the maintenance of biodiversity, while long-recognized, has remained unresolved. We used molecular markers to measure effective dispersal in a natural population of the vertebrate-dispersed Neotropical tree, Simarouba amara (Simaroubaceae) by comparing the distances between maternal parents and their offspring and comparing gene movement via seed and pollen in the 50 ha plot of the Barro Colorado Island forest, Central Panama. In all cases (parent-pair, mother-offspring, father-offspring, sib-sib) distances between related pairs were significantly greater than distances to nearest possible neighbours within each category. Long-distance seedling establishment was frequent: 74% of assigned seedlings established > 100 m from the maternal parent [mean = 392 +/- 234.6 m (SD), range = 9.3-1000.5 m] and pollen-mediated gene flow was comparable to that of seed [mean = 345.0 +/- 157.7 m (SD), range 57.6-739.7 m]. For S. amara we found approximately a 10-fold difference between distances estimated by inverse modelling and mean seedling recruitment distances (39 m vs. 392 m). Our findings have important implications for future studies in forest demography and regeneration, with most seedlings establishing at distances far exceeding those demonstrated by negative density-dependent effects.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We have developed an alignment-free method that calculates phylogenetic distances using a maximum-likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf+py at http://www.bioinformatics.org.au), we have created a data set of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees. We find our pattern-based method statistically superior to all other tested alignment-free methods. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ficus arpazusa Casaretto is a fig tree native to the Atlantic Rain Forest sensu lato. High levels of genetic diversity and no inbreeding were observed in Ficus arpazusa. This genetic pattern is due to the action of its pollinator, Pegoscapus sp., which disperses pollen an estimated distance of 5.6 km, and of Ficus arpazusa`s mating system which, in the study area, is allogamous. This study highlights the importance of adding both ecological and genetic data into population studies, allowing a better understanding of evolutionary processes and in turn increasing the efficacy of forest management and revegetation projects, as well as species conservation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Darwin's paradigm holds that the diversity of present-day organisms has arisen via a process of genetic descent with modification, as on a bifurcating tree. Evidence is accumulating that genes are sometimes transferred not along lineages but rather across lineages. To the extent that this is so, Darwin's paradigm can apply only imperfectly to genomes, potentially complicating or perhaps undermining attempts to reconstruct historical relationships among genomes (i.e., a genome tree). Whether most genes in a genome have arisen via treelike (vertical) descent or by lateral transfer across lineages can be tested if enough complete genome sequences are used. We define a phylogenetically discordant sequence (PDS) as an open reading frame (ORF) that exhibits patterns of similarity relationships statistically distinguishable from those of most other ORFs in the same genome. PDSs represent between 6.0 and 16.8% (mean, 10.8%) of the analyzable ORFs in the genomes of 28 bacteria, eight archaea, and one eukaryote (Saccharomyces cerevisiae). In this study we developed and assessed a distance-based approach, based on mean pairwise sequence similarity, for generating genome trees. Exclusion of PDSs improved bootstrap support for basal nodes but altered few topological features, indicating that there is little systematic bias among PDSs. Many but not all features of the genome tree from which PDSs were excluded are consistent with the 16S rRNA tree.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A multilocus mixed-mating model was used to evaluate the mating system of a population of Couratari multiflora, an emergent tree species found in low densities (1 individual/10 ha) in lowland forests of central Amazonia. We surveyed and observed phenologically 41 trees in an area of 400 ha. From these, only four mother trees were analyzed here because few of them set fruits, which also suffered high predation. No difference was observed between the population multilocus outcrossing rate (t mp = 0.953 ± 0.040) and the average single locus rate (t sp = 0.968 ± 0.132). The four mother trees were highly outcrossed (t m ~ 1). Two out of five loci showed departures from the Hardy-Weinberg Equilibrium (HWE) expectations, and the same results occurred with the mixed-mating model. Besides the low number of trees analyzed, the proportion of loci in HWE suggests random mating in the population. However, the pollen pool was heterogeneous among families, probably due to both the small sample number and the flowering of trees at different times of the flowering season. Reproductive phenology of the population and the results presented here suggest, at least for part of the population, a long-distance pollen movement, around 1,000 m.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Altitudinal tree lines are mainly constrained by temperature, but can also be influenced by factors such as human activity, particularly in the European Alps, where centuries of agricultural use have affected the tree-line. Over the last decades this trend has been reversed due to changing agricultural practices and land-abandonment. We aimed to combine a statistical land-abandonment model with a forest dynamics model, to take into account the combined effects of climate and human land-use on the Alpine tree-line in Switzerland. Land-abandonment probability was expressed by a logistic regression function of degree-day sum, distance from forest edge, soil stoniness, slope, proportion of employees in the secondary and tertiary sectors, proportion of commuters and proportion of full-time farms. This was implemented in the TreeMig spatio-temporal forest model. Distance from forest edge and degree-day sum vary through feed-back from the dynamics part of TreeMig and climate change scenarios, while the other variables remain constant for each grid cell over time. The new model, TreeMig-LAb, was tested on theoretical landscapes, where the variables in the land-abandonment model were varied one by one. This confirmed the strong influence of distance from forest and slope on the abandonment probability. Degree-day sum has a more complex role, with opposite influences on land-abandonment and forest growth. TreeMig-LAb was also applied to a case study area in the Upper Engadine (Swiss Alps), along with a model where abandonment probability was a constant. Two scenarios were used: natural succession only (100% probability) and a probability of abandonment based on past transition proportions in that area (2.1% per decade). The former showed new forest growing in all but the highest-altitude locations. The latter was more realistic as to numbers of newly forested cells, but their location was random and the resulting landscape heterogeneous. Using the logistic regression model gave results consistent with observed patterns of land-abandonment: existing forests expanded and gaps closed, leading to an increasingly homogeneous landscape.