17 resultados para SEQUENCE DATA
em Helda - Digital Repository of University of Helsinki
Resumo:
Evolutionary genetics incorporates traditional population genetics and studies of the origins of genetic variation by mutation and recombination, and the molecular evolution of genomes. Among the primary forces that have potential to affect the genetic variation within and among populations, including those that may lead to adaptation and speciation, are genetic drift, gene flow, mutations and natural selection. The main challenges in knowing the genetic basis of evolutionary changes is to distinguish the adaptive selection forces that cause existent DNA sequence variants and also to identify the nucleotide differences responsible for the observed phenotypic variation. To understand the effects of various forces, interpretation of gene sequence variation has been the principal basis of many evolutionary genetic studies. The main aim of this thesis was to assess different forms of teleost gene sequence polymorphisms in evolutionary genetic studies of Atlantic salmon (Salmo salar) and other species. Firstly, the level of Darwinian adaptive evolution affected coding regions of the growth hormone (GH) gene during the teleost evolution was investigated based on the sequence data existing in public databases. Secondly, a target gene approach was used to identify within population variation in the growth hormone 1 (GH1) gene in salmon. Then, a new strategy for single nucleotide polymorphisms (SNPs) discovery in salmonid fishes was introduced, and, finally, the usefulness of a limited number of SNP markers as molecular tools in several applications of population genetics in Atlantic salmon was assessed. This thesis showed that the gene sequences in databases can be utilized to perform comparative studies of molecular evolution, and some putative evidence of the existence of Darwinian selection during the teleost GH evolution was presented. In addition, existent sequence data was exploited to investigate GH1 gene variation within Atlantic salmon populations throughout its range. Purifying selection is suggested to be the predominant evolutionary force controlling the genetic variation of this gene in salmon, and some support for gene flow between continents was also observed. The novel approach to SNP discovery in species with duplicated genome fragments introduced here proved to be an effective method, and this may have several applications in evolutionary genetics with different species - e.g. when developing gene-targeted markers to investigate quantitative genetic variation. The thesis also demonstrated that only a few SNPs performed highly similar signals in some of the population genetic analyses when compared with the microsatellite markers. This may have useful applications when estimating genetic diversity in genes having a potential role in ecological and conservation issues, or when using hard biological samples in genetic studies as SNPs can be applied with relatively highly degraded DNA.
Resumo:
This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.
Resumo:
Phylogenetic studies of cyanobacterial lichens Lichens are symbiotic assemblages between fungi (mycobiont) and green algae (phycobiont) or/and cyanobacteria (cyanobiont). Fossil records show that lichen-like symbioses occurred already 600 million years ago. Lichen symbiosis has since then become an important life strategy for the Fungi, particularly for species in the phylum Ascomycota as approximately 98% of the lichenized fungal species are ascomycetes. The taxonomy of lichen associations is based on the mycobiont. We reconstructed, using DNA sequence data, hypotheses of phylogenetic relationships of lichen-forming fungi that include species associated with cyanobacteria. These hypotheses of phylogeny should form the basis for the taxonomy. They also allowed studies of the origin and the evolution of specific symbioses. Genetic diversity and phylogenetic relationships of symbiotic cyanobionts were also studied in order to examine selectivity of cyanobionts and mycobionts as well as possible co-evolution between partners involved in lichen associations. The suggested circumscription of the family Stereocaulaceae to include Stereocaulon and Lepraria is supported. The recently described crustose Stereocaulon species seem to be correctly placed in the genus, although Stereocaulon traditionally included only fruticose species. The monospecific crustose genus Muhria is also shown to be best placed in Stereocaulon. Family Lobariaceae as currently delimited is monophyletic. Within Lobariaceae genus Sticta including Dendriscocaulon dendroides form a monophyletic group while the genera Lobaria and Pseudocyphellaria are non-monophyletic. A new classification of Lobariaceae is obviously needed. Further studies are however required before a final proposal for a new classification can be made. Our results show that the cyanobacterial symbiotic state has been gained repeatedly in the Ascomycota while losses of symbiotic cyanobacteria appear to be rare. The symbiosis with green algae is confirmed to have been gained repeatedly in Ascomycota but also repeatedly lost. Cyanobacterial symbioses therefore seem to be more stable than green algal associations. Cyanobacteria are perhaps more beneficial for the lichen fungi and therefore maintained. The results indicate a dynamic association of the lichen symbiosis. This evolutionary instability will perhaps be important for the lichen fungi as the utilization of options will perhaps enable lichens to colonize new substrates and survive environmental changes. Some cyanobacterial lichen genera seem to be highly selective towards the cyanobiont while others form symbioses with a broad spectrum of cyanobacteria. No evidence of co-evolution between fungi and cyanobacteria in cyanolichens could be demonstrated.
Resumo:
Lead contamination in the environment is of particular concern, as it is a known toxin. Until recently, however, much less attention has been given to the local contamination caused by activities at shooting ranges compared to large-scale industrial contamination. In Finland, more than 500 tons of Pb is produced each year for shotgun ammunition. The contaminant threatens various organisms, ground water and the health of human populations. However, the forest at shooting ranges usually shows no visible sign of stress compared to nearby clean environments. The aboveground biota normally reflects the belowground ecosystem. Thus, the soil microbial communities appear to bear strong resistance to contamination, despite the influence of lead. The studies forming this thesis investigated a shooting range site at Hälvälä in Southern Finland, which is heavily contaminated by lead pellets. Previously it was experimentally shown that the growth of grasses and degradation of litter are retarded. Measurements of acute toxicity of the contaminated soil or soil extracts gave conflicting results, as enchytraeid worms used as toxicity reporters were strongly affected, while reporter bacteria showed no or very minor decreases in viability. Measurements using sensitive inducible luminescent reporter bacteria suggested that the bioavailability of lead in the soil is indeed low, and this notion was supported by the very low water extractability of the lead. Nevertheless, the frequency of lead-resistant cultivable bacteria was elevated based on the isolation of cultivable strains. The bacterial and fungal diversity in heavily lead contaminated shooting sectors were compared with those of pristine sections of the shooting range area. The bacterial 16S rRNA gene and fungal ITS rRNA gene were amplified, cloned and sequenced using total DNA extracted from the soil humus layer as the template. Altogether, 917 sequenced bacterial clones and 649 sequenced fungal clones revealed a high soil microbial diversity. No effect of lead contamination was found on bacterial richness or diversity, while fungal richness and diversity significantly differed between lead contaminated and clean control areas. However, even in the case of fungi, genera that were deemed sensitive were not totally absent from the contaminated area: only their relative frequency was significantly reduced. Some operational taxonomic units (OTUs) assigned to Basidiomycota were clearly affected, and were much rarer in the lead contaminated areas. The studies of this thesis surveyed EcM sporocarps, analyzed morphotyped EcM root tips by direct sequencing, and 454-pyrosequenced fungal communities in in-growth bags. A total of 32 EcM fungi that formed conspicuous sporocarps, 27 EcM fungal OTUs from 294 root tips, and 116 EcM fungal OTUs from a total of 8 194 ITS2 454 sequences were recorded. The ordination analyses by non-parametric multidimensional scaling (NMS) indicated that Pb enrichment induced a shift in the EcM community composition. This was visible as indicative trends in the sporocarp and root tip datasets, but explicitly clear in the communities observed in the in-growth bags. The compositional shift in the EcM community was mainly attributable to an increase in the frequencies of OTUs assigned to the genus Thelephora, and to a decrease in the OTUs assigned to Pseudotomentella, Suillus and Tylospora in Pb-contaminated areas when compared to the control. The enrichment of Thelephora in contaminated areas was also observed when examining the total fungal communities in soil using DNA cloning and sequencing technology. While the compositional shifts are clear, their functional consequences for the dominant trees or soil ecosystem remain undetermined. The results indicate that at the Hälvälä shooting range, lead influences the fungal communities but not the bacterial communities. The forest ecosystem shows apparent functional redundancy, since no significant effects were seen on forest trees. Recently, by means of 454 pyrosequencing , the amount of sequences in a single analysis run can be up to one million. It has been applied in microbial ecology studies to characterize microbial communities. The handling of sequence data with traditional programs is becoming difficult and exceedingly time consuming, and novel tools are needed to handle the vast amounts of data being generated. The field of microbial ecology has recently benefited from the availability of a number of tools for describing and comparing microbial communities using robust statistical methods. However, although these programs provide methods for rapid calculation, it has become necessary to make them more amenable to larger datasets and numbers of samples from pyrosequencing. As part of this thesis, a new program was developed, MuSSA (Multi-Sample Sequence Analyser), to handle sequence data from novel high-throughput sequencing approaches in microbial community analyses. The greatest advantage of the program is that large volumes of sequence data can be manipulated, and general OTU series with a frequency value can be calculated among a large number of samples.
Resumo:
This study addressed the large-scale molecular zoogeography in two brackish water bivalve molluscs, Macoma balthica and Cerastoderma glaucum, and genetic signatures of the postglacial colonization of Northern Europe by them. The traditional view poses that M. balthica in the Baltic, White and Barents seas (i.e. marginal seas) represent direct postglacial descendants of the adjacent Northeast Atlantic populations, but this has recently been challenged by observations of close genetic affinities between these marginal populations and those of the Northeast Pacific. The primary aim of the thesis was to verify, quantify and characterize the Pacific genetic contribution across North European populations of M. balthica and to resolve the phylogeographic histories of the two bivalve taxa in range-wide studies using information from mitochondrial DNA (mtDNA) and nuclear allozyme polymorphisms. The presence of recent Pacific genetic influence in M. balthica of the Baltic, White and Barents seas, along with an Atlantic element, was confirmed by mtDNA sequence data. On a broader temporal and geographical scale, altogether four independent trans-Arctic invasions of Macoma from the Pacific since the Miocene seem to have been involved in generating the current North Atlantic lineage diversity. The latest trans-Arctic invasion that affected the current Baltic, White and Barents Sea populations probably took place in the early post-glacial. The nuclear genetic compositions of these marginal sea populations are intermediate between those of pure Pacific and Atlantic subspecies. In the marginal sea populations of mixed ancestry (Barents, White and Northern Baltic seas), the Pacific and Atlantic components are now randomly associated in the genomes of individual clams, which indicates both pervasive historical interbreeding between the previously long-isolated lineages (subspecies), and current isolation of these populations from the adjacent pure Atlantic populations. These mixed populations can be characterized as self-supporting hybrid swarms, and they arguably represent the most extensive marine animal hybrid swarms so far documented. Each of the three swarms still has a distinct genetic composition, and the relative Pacific contributions vary from 30 to 90 % in local populations. This diversity highlights the potential of introgressive hybridization to rapidly give rise to new evolutionarily and ecologically significant units in the marine realm. In the south of the Danish straits and in the Southern Baltic Sea, a broad genetic transition zone links the pure North Sea subspecies M. balthica rubra to the inner Baltic hybrid swarm, which has about 60 % of Pacific contribution in its genome. This transition zone has no regular smooth clinal structure, but its populations show strong genotypic disequilibria typical of a hybrid zone maintained by the interplay of selection and gene flow by dispersing pelagic larvae. The structure of the genetic transition is partly in line with features of Baltic water circulation and salinity stratification, with greater penetration of Atlantic genes on the Baltic south coast and in deeper water populations. In all, the scenarios of historical isolation and secondary contact that arise from the phylogeographic studies of both Macoma and Cerastoderma shed light to the more general but enigmatic patterns seen in marine phylogeography, where deep genetic breaks are often seen in species with high dispersal potential.
Resumo:
Social behaviour affects dispersal of animals and is an important modifier of genetic population structures. The female sex is often philopatric, which maintains coancestry within the breeding groups and promotes cooperative behaviours. This enables also inclusive fitness returns from altruism and explains why some individuals sacrifice personal reproduction for the good of others in social insects such as ants. However, reduced dispersal and population substructuring at the level of colonies may also entail inbreeding, loss of genetic diversity, and vulnerability. In addition, the most vulnerable ants are species that are evolved to parasitize colonies of other ants, and which compromise between abilities to disperse and the efficiency to parasitize the host. On the other hand, certain social organisations of ant colonies may facilitate a species to disperse outside its natural range and become a pest. Altogether, knowledge on genetic structuring of ant populations, as well as the evolution of their life histories can contribute to conservation biology and population management. The aim of this thesis was to investigate population structures and phylogenetic evolution of the ant Plagiolepis pygmaea and its two obligatory, workerless social parasites (inquilines) P. xene and P. grassei with genetic markers and DNA sequence data. The results support the general assumption that populations of inquiline parasites are highly fragmented and genetically vulnerable. Comparison of the two parasites suggests that differences in their relative abundance may follow from their interaction with the host, i.e. how well the species is adapted to reproduce in the host colonies. The results also indicate that the most recent free living ancestor to these two parasite species is their common host. This is considered to provide evidence for the controversial issue of sympatric speciation. Further, given that the level of adaptations to parasitic life history depends on the evolutionary time since the free-living ancestor, the results establish a link between species rarity and its evolutionary age. The populations of the host species P. pygmaea displayed significantly reduced dispersal both among the females (queens) and males, and high levels of inbreeding which may enhance worker altruism. In addition, the queens were found to mate with multiple males. Given the high relatedness between the queens and their mates, this occurs probably for non-genetic reasons, e.g. without benefits associated in genetically more diverse offspring. The results hence caution that the contribution of non-genetic factors to the prevailing mating patterns and genetic population structures should not be underestimated.
Resumo:
Phylogenetic analyses of the Hypnales usually show the same picture of poorly resolved trees with a large number of polyphyletic taxa and low support for the few reconstructed clades. One odd clade, however, consisting of three genera that are currently treated either within the Leskeaceae (Miyabea) or Neckeraceae (Homaliadelphus and Bissetia), was retrieved in a previously published phylogeny based on chloroplast rbcL. In order to elucidate the reliability of the observed Homaliadelphus - Miyabea - Bissetia - clade (HMB-clade) and to reveal its phylogenetic relationships a molecular study based on a representative set of hypnalean taxa was performed. Sequence data from all three genomes, namely the ITS1 and 2 (nuclear), the trnS-rps4-trnT-trnL-trnF cluster (plastid), the nad5 intron (mitochondrial), were analyzed. Although the phylogenetic reconstruction of the combined data set was not fully resolved regarding the backbone it clearly indicated the polyphyletic nature of various hypnalean families, such as the Leskeaceae, Hypnaceae, Hylocomiaceae, Neckeraceae, Leptodontaceae and Anomodontaceae with respect to the included taxa. In addition the results favor the inclusion of the Leptodontaceae and Thamnobryaceae in the Neckeraceae. The maximally supported HMB-clade consisting of the three genera Homaliadelphus (2-3 species), Miyabea (3 species) and Bissetia (1 species) is resolved sister to a so far unnamed clade comprising Taxiphyllum aomoriense, Glossadelphus ogatae and Leptopterigynandrum. The well-resolved and supported HMB-clade, here formally described as the Miyabeaceae, fam. nov. is additionally supported by morphological characters such as strongly incrassate, porose leaf cells, a relatively weak and diffuse costa and the presence of dwarf males. The latter are absent in the Neckeraceae and the Leskeaceae. It is essentially an East Asian family, with one species occurring in North America.
Resumo:
Earlier phylogenetic studies, including species belonging to the Neckeraceae, have indicated that this pleurocarpous moss family shares a strongly supported sister group relationship with the Lembophyllaceae, but the family delimitation of the former needs adjustment. To test the monophyly of the Neckeraceae, as well as to redefine the family circumscription and to pinpoint its phylogenetic position in a larger context, a phylogenetic study based on molecular data was carried out. Sequence data were compiled, combining data from all three genomes: nuclear ITS1 and 2, plastid trnS-rps4-trnT-trnL-trnF and rpl16, and mitochondrial nad5 intron. The Neckeraceae have sometimes been divided into the two families, Neckeraceae and Thamnobryaceae, a division rejected here. Both parsimony and Bayesian analyses of molecular data revealed that the family concept of the Neckeraceae needs several further adjustments, such as the exclusion of some individual species and smaller genera as well as the inclusion of the Leptodontaceae. Within the family three well-supported clades (A, B and C) can be distinguished. Members of clade A are mainly non-Asiatic and nontropical. Most species have a weak costa and immersed capsules with reduced peristomes (mainly Neckera spp.) and the teeth at the leaf margins are usually unicellular. Clade B members are also mainly non-Asiatic. They are typically fairly robust, distinctly stipilate, having a single, at least relatively strong costa, long setae (capsules exserted), and the peristomes are well developed or only somewhat reduced. Members of clade C are essentially Asiatic and tropical. The species of this clade usually have a strong costa and a long seta, the seta often being mammillose in its upper part. The peristome types in this clade are mixed, since both reduced and unreduced types are found. Several neckeraceous genera that were recognised on a morphological basis are polyphyletic (e.g. Neckera, Homalia, Thamnobryum, Porotrichum). Ancestral state reconstructions revealed that currently used diagnostic traits, such as the leaf asymmetry and costa strength are highly homoplastic. Similarly, the reconstructions revealed that the 'reduced' sporophyte features have evolved independently in each of the three clades.
Resumo:
This dissertation is focused on the taxonomy, phylogeny, and ecology of the vagrant, erratic and allied terricolous and saxicolous species of the genera Aspicilia A. Massal. and Circinaria Link (Megasporaceae), particularly those traditionally referred to as manna lichens . The group has previously been defined on the basis of few morphological characters. The phylogeny of the family Megasporaceae is inferred from the combined dataset of nuLSU and mtSSU sequences. Five genera Aspicilia, Circinaria, Lobothallia, Megaspora, and Sagedia are recognized. Lobothallia is sister of the four other genera, while Aspicilia and Sagedia form the next clade. All these genera have small asci with eight spores. Circinaria is a sister genus of Megaspora, and these two have in common asci with (1 4) 6 8 large spores. Circinaria forms a monophyletic group and sphaerothallioid species form a monophyletic group within Circinaria. The presence of certain morphological characters such as pseudocyphellae, thickness of cortex and medulla layers, as well as ecological differences in sphaerothallioid species distinguish it from some other crustose species, especially those containing aspicilin and characterised by thin cortex and medulla layers, conidium length c. 6 12 µm and absence of pseudocyphellae. If sphaerothallioid species are accepted as a distinct genus, the rest of the Circinaria species would remain as a paraphyletic assemblage. Currently, the genus Circinaria includes all the sphaerothallioid species and its generic position is confirmed and accepted. Thus, it is proposed as a correct generic name also for the manna lichens described originally in other genera. Phylogeny at the species level was studied using nrITS sequence data. Traditionally, morphological characters have been used for the recognition of species. They were re-evaluated in the light of molecular data. Since characters such as vagrant, erratic and crustose growth forms proved to be misleading for the recognition of some species, a combination of several characters (including molecular data) is recommended. Vagrant growth form seems to have evolved several times among the distantly related lineages and even within a single population. The reasons behind the high plasticity in the external morphology of the sphaerothallioid Circinaria remain, however, unknown. Six new species are recognized: Aspicilia tibetica, Circinaria arida, C. digitata nom provis., C. gyrosa nom. provis., C. rogeri nom. provis., and C. rostamii nom. provis. Based on an analysis of nrITS dataset, three new erratic, vagrant and crustose species were also recognized, but these require additional study. The results also reveal that C. elmorei and C. hispida are not monophyletic as currently understood. In addition, 13 new combinations in the genus Circinaria are proposed.
Resumo:
Microbes in natural and artificial environments as well as in the human body are a key part of the functional properties of these complex systems. The presence or absence of certain microbial taxa is a correlate of functional status like risk of disease or course of metabolic processes of a microbial community. As microbes are highly diverse and mostly notcultivable, molecular markers like gene sequences are a potential basis for detection and identification of key types. The goal of this thesis was to study molecular methods for identification of microbial DNA in order to develop a tool for analysis of environmental and clinical DNA samples. Particular emphasis was placed on specificity of detection which is a major challenge when analyzing complex microbial communities. The approach taken in this study was the application and optimization of enzymatic ligation of DNA probes coupled with microarray read-out for high-throughput microbial profiling. The results show that fungal phylotypes and human papillomavirus genotypes could be accurately identified from pools of PCR amplicons generated from purified sample DNA. Approximately 1 ng/μl of sample DNA was needed for representative PCR amplification as measured by comparisons between clone sequencing and microarray. A minimum of 0,25 amol/μl of PCR amplicons was detectable from amongst 5 ng/μl of background DNA, suggesting that the detection limit of the test comprising of ligation reaction followed by microarray read-out was approximately 0,04%. Detection from sample DNA directly was shown to be feasible with probes forming a circular molecule upon ligation followed by PCR amplification of the probe. In this approach, the minimum detectable relative amount of target genome was found to be 1% of all genomes in the sample as estimated from 454 deep sequencing results. Signal-to-noise of contact printed microarrays could be improved by using an internal microarray hybridization control oligonucleotide probe together with a computational algorithm. The algorithm was based on identification of a bias in the microarray data and correction of the bias as shown by simulated and real data. The results further suggest semiquantitative detection to be possible by ligation detection, allowing estimation of target abundance in a sample. However, in practise, comprehensive sequence information of full length rRNA genes is needed to support probe design with complex samples. This study shows that DNA microarray has the potential for an accurate microbial diagnostic platform to take advantage of increasing sequence data and to replace traditional, less efficient methods that still dominate routine testing in laboratories. The data suggests that ligation reaction based microarray assay can be optimized to a degree that allows good signal-tonoise and semiquantitative detection.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
The analysis of sequential data is required in many diverse areas such as telecommunications, stock market analysis, and bioinformatics. A basic problem related to the analysis of sequential data is the sequence segmentation problem. A sequence segmentation is a partition of the sequence into a number of non-overlapping segments that cover all data points, such that each segment is as homogeneous as possible. This problem can be solved optimally using a standard dynamic programming algorithm. In the first part of the thesis, we present a new approximation algorithm for the sequence segmentation problem. This algorithm has smaller running time than the optimal dynamic programming algorithm, while it has bounded approximation ratio. The basic idea is to divide the input sequence into subsequences, solve the problem optimally in each subsequence, and then appropriately combine the solutions to the subproblems into one final solution. In the second part of the thesis, we study alternative segmentation models that are devised to better fit the data. More specifically, we focus on clustered segmentations and segmentations with rearrangements. While in the standard segmentation of a multidimensional sequence all dimensions share the same segment boundaries, in a clustered segmentation the multidimensional sequence is segmented in such a way that dimensions are allowed to form clusters. Each cluster of dimensions is then segmented separately. We formally define the problem of clustered segmentations and we experimentally show that segmenting sequences using this segmentation model, leads to solutions with smaller error for the same model cost. Segmentation with rearrangements is a novel variation to the segmentation problem: in addition to partitioning the sequence we also seek to apply a limited amount of reordering, so that the overall representation error is minimized. We formulate the problem of segmentation with rearrangements and we show that it is an NP-hard problem to solve or even to approximate. We devise effective algorithms for the proposed problem, combining ideas from dynamic programming and outlier detection algorithms in sequences. In the final part of the thesis, we discuss the problem of aggregating results of segmentation algorithms on the same set of data points. In this case, we are interested in producing a partitioning of the data that agrees as much as possible with the input partitions. We show that this problem can be solved optimally in polynomial time using dynamic programming. Furthermore, we show that not all data points are candidates for segment boundaries in the optimal solution.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.