6 resultados para sequential data
em Helda - Digital Repository of University of Helsinki
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
The analysis of sequential data is required in many diverse areas such as telecommunications, stock market analysis, and bioinformatics. A basic problem related to the analysis of sequential data is the sequence segmentation problem. A sequence segmentation is a partition of the sequence into a number of non-overlapping segments that cover all data points, such that each segment is as homogeneous as possible. This problem can be solved optimally using a standard dynamic programming algorithm. In the first part of the thesis, we present a new approximation algorithm for the sequence segmentation problem. This algorithm has smaller running time than the optimal dynamic programming algorithm, while it has bounded approximation ratio. The basic idea is to divide the input sequence into subsequences, solve the problem optimally in each subsequence, and then appropriately combine the solutions to the subproblems into one final solution. In the second part of the thesis, we study alternative segmentation models that are devised to better fit the data. More specifically, we focus on clustered segmentations and segmentations with rearrangements. While in the standard segmentation of a multidimensional sequence all dimensions share the same segment boundaries, in a clustered segmentation the multidimensional sequence is segmented in such a way that dimensions are allowed to form clusters. Each cluster of dimensions is then segmented separately. We formally define the problem of clustered segmentations and we experimentally show that segmenting sequences using this segmentation model, leads to solutions with smaller error for the same model cost. Segmentation with rearrangements is a novel variation to the segmentation problem: in addition to partitioning the sequence we also seek to apply a limited amount of reordering, so that the overall representation error is minimized. We formulate the problem of segmentation with rearrangements and we show that it is an NP-hard problem to solve or even to approximate. We devise effective algorithms for the proposed problem, combining ideas from dynamic programming and outlier detection algorithms in sequences. In the final part of the thesis, we discuss the problem of aggregating results of segmentation algorithms on the same set of data points. In this case, we are interested in producing a partitioning of the data that agrees as much as possible with the input partitions. We show that this problem can be solved optimally in polynomial time using dynamic programming. Furthermore, we show that not all data points are candidates for segment boundaries in the optimal solution.
Resumo:
This thesis is an empirical study of how two words in Icelandic, "nú" and "núna", are used in contemporary Icelandic conversation. My aims in this study are, first, to explain the differences between the temporal functions of "nú" and "núna", and, second, to describe the non-temporal functions of "nú". In the analysis, a focus is placed on comparing the sequential placement of the two words, on their syntactical distribution, and on their prosodic realization. The empirical data comprise 14 hours and 11 minutes of naturally occurring conversation recorded between 1996 and 2003. The selected conversations represent a wide range of interactional contexts including informal dinner parties, institutional and non-institutional telephone conversations, radio programs for teenagers, phone-in programs, and, finally, a political debate on television. The theoretical and methodological framework is interactional linguistics, which can be described as linguistically oriented conversation analysis (CA). A comparison of "nú" and "núna" shows that the two words have different syntactic distributions. "Nú" has a clear tendency to occur in the front field, before the finite verb, while "núna" typically occurs in the end field, after the object. It is argued that this syntactic difference reflects a functional difference between "nú" and "núna". A sequential analysis of "núna" shows that the word refers to an unspecified period of time which includes the utterance time as well as some time in the past and in the future. This temporal relation is referred to as reference time. "Nú", by contrast, is mainly used in three different environments: a) in temporal comparisons, 2) in transitions, and 3) when the speaker is taking an affective stance. The non-temporal functions of "nú" are divided into three categories: a) "nú" as a tone particle, 2) "nú" as an utterance particle, and 3) "nú" as a dialogue particle. "Nú" as a tone particle is syntactically integrated and can occur in two syntactic positions: pre-verbally and post-verbally. I argue that these instances are employed in utterances in which a speaker is foregrounding information or marking it as particularly important. The study shows that, although these instances are typically prosodically non-prominent and unstressed, they are in some cases delivered with stress and with a higher pitch than the surrounding talk. "Nú" as an utterance particle occurs turn-initially and is syntactically non-integrated. By using "nú", speakers show continuity between turns and link new turns to prior ones. These instances initiate either continuations by the same speaker or new turns after speaker shifts. "Nú" as a dialogue particle occurs as a turn of its own. The study shows that these instances register informings in prior turns as unexpected or as a departure from the normal state of affairs. "Nú" as a dialogue particle is often delivered with a prolonged vowel and a recognizable intonation contour. A comparative sequential and prosodic analysis shows that in these cases there is a correlation between the function of "nú" and the intonation contour by which it is delivered. Finally, I argue that despite the many functions of "nú", all the instances can be said to have a common denominator, which is to display attention towards the present moment and the utterances which are produced prior or after the production of "nú". Instead of anchoring the utterances in external time or reference time, these instances position the utterance in discourse internal time, or discourse time.
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
The study of soil microbiota and their activities is central to the understanding of many ecosystem processes such as decomposition and nutrient cycling. The collection of microbiological data from soils generally involves several sequential steps of sampling, pretreatment and laboratory measurements. The reliability of results is dependent on reliable methods in every step. The aim of this thesis was to critically evaluate some central methods and procedures used in soil microbiological studies in order to increase our understanding of the factors that affect the measurement results and to provide guidance and new approaches for the design of experiments. The thesis focuses on four major themes: 1) soil microbiological heterogeneity and sampling, 2) storage of soil samples, 3) DNA extraction from soil, and 4) quantification of specific microbial groups by the most-probable-number (MPN) procedure. Soil heterogeneity and sampling are discussed as a single theme because knowledge on spatial (horizontal and vertical) and temporal variation is crucial when designing sampling procedures. Comparison of adjacent forest, meadow and cropped field plots showed that land use has a strong impact on the degree of horizontal variation of soil enzyme activities and bacterial community structure. However, regardless of the land use, the variation of microbiological characteristics appeared not to have predictable spatial structure at 0.5-10 m. Temporal and soil depth-related patterns were studied in relation to plant growth in cropped soil. The results showed that most enzyme activities and microbial biomass have a clear decreasing trend in the top 40 cm soil profile and a temporal pattern during the growing season. A new procedure for sampling of soil microbiological characteristics based on stratified sampling and pre-characterisation of samples was developed. A practical example demonstrated the potential of the new procedure to reduce the analysis efforts involved in laborious microbiological measurements without loss of precision. The investigation of storage of soil samples revealed that freezing (-20 °C) of small sample aliquots retains the activity of hydrolytic enzymes and the structure of the bacterial community in different soil matrices relatively well whereas air-drying cannot be recommended as a storage method for soil microbiological properties due to large reductions in activity. Freezing below -70 °C was the preferred method of storage for samples with high organic matter content. Comparison of different direct DNA extraction methods showed that the cell lysis treatment has a strong impact on the molecular size of DNA obtained and on the bacterial community structure detected. An improved MPN method for the enumeration of soil naphthalene degraders was introduced as an alternative to more complex MPN protocols or the DNA-based quantification approach. The main advantage of the new method is the simple protocol and the possibility to analyse a large number of samples and replicates simultaneously.
Resumo:
An inverse problem for the wave equation is a mathematical formulation of the problem to convert measurements of sound waves to information about the wave speed governing the propagation of the waves. This doctoral thesis extends the theory on the inverse problems for the wave equation in cases with partial measurement data and also considers detection of discontinuous interfaces in the wave speed. A possible application of the theory is obstetric sonography in which ultrasound measurements are transformed into an image of the fetus in its mother's uterus. The wave speed inside the body can not be directly observed but sound waves can be produced outside the body and their echoes from the body can be recorded. The present work contains five research articles. In the first and the fifth articles we show that it is possible to determine the wave speed uniquely by using far apart sound sources and receivers. This extends a previously known result which requires the sound waves to be produced and recorded in the same place. Our result is motivated by a possible application to reflection seismology which seeks to create an image of the Earth s crust from recording of echoes stimulated for example by explosions. For this purpose, the receivers can not typically lie near the powerful sound sources. In the second article we present a sound source that allows us to recover many essential features of the wave speed from the echo produced by the source. Moreover, these features are known to determine the wave speed under certain geometric assumptions. Previously known results permitted the same features to be recovered only by sequential measurement of echoes produced by multiple different sources. The reduced number of measurements could increase the number possible applications of acoustic probing. In the third and fourth articles we develop an acoustic probing method to locate discontinuous interfaces in the wave speed. These interfaces typically correspond to interfaces between different materials and their locations are of interest in many applications. There are many previous approaches to this problem but none of them exploits sound sources varying freely in time. Our use of more variable sources could allow more robust implementation of the probing.