973 resultados para sequential data


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This dissertation deals with aspects of sequential data assimilation (in particular ensemble Kalman filtering) and numerical weather forecasting. In the first part, the recently formulated Ensemble Kalman-Bucy (EnKBF) filter is revisited. It is shown that the previously used numerical integration scheme fails when the magnitude of the background error covariance grows beyond that of the observational error covariance in the forecast window. Therefore, we present a suitable integration scheme that handles the stiffening of the differential equations involved and doesn’t represent further computational expense. Moreover, a transform-based alternative to the EnKBF is developed: under this scheme, the operations are performed in the ensemble space instead of in the state space. Advantages of this formulation are explained. For the first time, the EnKBF is implemented in an atmospheric model. The second part of this work deals with ensemble clustering, a phenomenon that arises when performing data assimilation using of deterministic ensemble square root filters in highly nonlinear forecast models. Namely, an M-member ensemble detaches into an outlier and a cluster of M-1 members. Previous works may suggest that this issue represents a failure of EnSRFs; this work dispels that notion. It is shown that ensemble clustering can be reverted also due to nonlinear processes, in particular the alternation between nonlinear expansion and compression of the ensemble for different regions of the attractor. Some EnSRFs that use random rotations have been developed to overcome this issue; these formulations are analyzed and their advantages and disadvantages with respect to common EnSRFs are discussed. The third and last part contains the implementation of the Robert-Asselin-Williams (RAW) filter in an atmospheric model. The RAW filter is an improvement to the widely popular Robert-Asselin filter that successfully suppresses spurious computational waves while avoiding any distortion in the mean value of the function. Using statistical significance tests both at the local and field level, it is shown that the climatology of the SPEEDY model is not modified by the changed time stepping scheme; hence, no retuning of the parameterizations is required. It is found the accuracy of the medium-term forecasts is increased by using the RAW filter.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recently, a lot of effort has been spent in the efficient computation of kriging predictors when observations are assimilated sequentially. In particular, kriging update formulae enabling significant computational savings were derived. Taking advantage of the previous kriging mean and variance computations helps avoiding a costly matrix inversion when adding one observation to the TeX already available ones. In addition to traditional update formulae taking into account a single new observation, Emery (2009) proposed formulae for the batch-sequential case, i.e. when TeX new observations are simultaneously assimilated. However, the kriging variance and covariance formulae given in Emery (2009) for the batch-sequential case are not correct. In this paper, we fix this issue and establish correct expressions for updated kriging variances and covariances when assimilating observations in parallel. An application in sequential conditional simulation finally shows that coupling update and residual substitution approaches may enable significant speed-ups.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In many applications, e.g., bioinformatics, web access traces, system utilisation logs, etc., the data is naturally in the form of sequences. People have taken great interest in analysing the sequential data and finding the inherent characteristics or relationships within the data. Sequential association rule mining is one of the possible methods used to analyse this data. As conventional sequential association rule mining very often generates a huge number of association rules, of which many are redundant, it is desirable to find a solution to get rid of those unnecessary association rules. Because of the complexity and temporal ordered characteristics of sequential data, current research on sequential association rule mining is limited. Although several sequential association rule prediction models using either sequence constraints or temporal constraints have been proposed, none of them considered the redundancy problem in rule mining. The main contribution of this research is to propose a non-redundant association rule mining method based on closed frequent sequences and minimal sequential generators. We also give a definition for the non-redundant sequential rules, which are sequential rules with minimal antecedents but maximal consequents. A new algorithm called CSGM (closed sequential and generator mining) for generating closed sequences and minimal sequential generators is also introduced. A further experiment has been done to compare the performance of generating non-redundant sequential rules and full sequential rules, meanwhile, performance evaluation of our CSGM and other closed sequential pattern mining or generator mining algorithms has also been conducted. We also use generated non-redundant sequential rules for query expansion in order to improve recommendations for infrequently purchased products.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Over the past decade, many powerful data mining techniques have been developed to analyze temporal and sequential data. The time is now fertile for addressing problems of larger scope under the purview of temporal data mining. The fourth SIGKDD workshop on temporal data mining focused on the question: What can we infer about the structure of a complex dynamical system from observed temporal data? The goals of the workshop were to critically evaluate the need in this area by bringing together leading researchers from industry and academia, and to identify promising technologies and methodologies for doing the same. We provide a brief summary of the workshop proceedings and ideas arising out of the discussions.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Data mining is concerned with analysing large volumes of (often unstructured) data to automatically discover interesting regularities or relationships which in turn lead to better understanding of the underlying processes. The field of temporal data mining is concerned with such analysis in the case of ordered data streams with temporal interdependencies. Over the last decade many interesting techniques of temporal data mining were proposed and shown to be useful in many applications. Since temporal data mining brings together techniques from different fields such as statistics, machine learning and databases, the literature is scattered among many different sources. In this article, we present an overview of techniques of temporal data mining.We mainly concentrate on algorithms for pattern discovery in sequential data streams.We also describe some recent results regarding statistical analysis of pattern discovery methods.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Narrative therapy is a postmodern therapy that takes the position that people create self-narratives to make sense of their experiences. To date, narrative therapy has compiled virtually no quantitative and very little qualitative research, leaving gaps in almost all areas of process and outcome. White (2006a), one of the therapy's founders, has recently utilized Vygotsky's (1934/1987) theories of the zone of proximal development (ZPD) and concept formation to describe the process of change in narrative therapy with children. In collaboration with the child client, the narrative therapist formalizes therapeutic concepts and submits them to increasing levels of generalization to create a ZPD. This study sought to determine whether the child's development proceeds through the stages of concept formation over the course of a session, and whether therapists' utterances scaffold this movement. A sequential analysis was used due to its unique ability to measure dynamic processes in social interactions. Stages of concept formation and scaffolding were coded over time. A hierarchical log-linear analysis was performed on the sequential data to develop a model of therapist scaffolding and child concept development. This was intended to determine what patterns occur and whether the stated intent of narrative therapy matches its actual process. In accordance with narrative therapy theory, the log-linear analysis produced a final model with interactions between therapist and child utterances, and between both therapist and child utterances and time. Specifically, the child and youth participants in therapy tended to respond to therapist scaffolding at the corresponding level of concept formation. Both children and youth and therapists also tended to move away from earlier and toward later stages of White's scaffolding conversations map as the therapy session advanced. These findings provide support for White's contention that narrative therapists promote child development by scaffolding child concept formation in therapy.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Here we make an initial step toward the development of an ocean assimilation system that can constrain the modelled Atlantic Meridional Overturning Circulation (AMOC) to support climate predictions. A detailed comparison is presented of 1° and 1/4° resolution global model simulations with and without sequential data assimilation, to the observations and transport estimates from the RAPID mooring array across 26.5° N in the Atlantic. Comparisons of modelled water properties with the observations from the merged RAPID boundary arrays demonstrate the ability of in situ data assimilation to accurately constrain the east-west density gradient between these mooring arrays. However, the presence of an unconstrained "western boundary wedge" between Abaco Island and the RAPID mooring site WB2 (16 km offshore) leads to the intensification of an erroneous southwards flow in this region when in situ data are assimilated. The result is an overly intense southward upper mid-ocean transport (0–1100 m) as compared to the estimates derived from the RAPID array. Correction of upper layer zonal density gradients is found to compensate mostly for a weak subtropical gyre circulation in the free model run (i.e. with no assimilation). Despite the important changes to the density structure and transports in the upper layer imposed by the assimilation, very little change is found in the amplitude and sub-seasonal variability of the AMOC. This shows that assimilation of upper layer density information projects mainly on the gyre circulation with little effect on the AMOC at 26° N due to the absence of corrections to density gradients below 2000 m (the maximum depth of Argo). The sensitivity to initial conditions was explored through two additional experiments using a climatological initial condition. These experiments showed that the weak bias in gyre intensity in the control simulation (without data assimilation) develops over a period of about 6 months, but does so independently from the overturning, with no change to the AMOC. However, differences in the properties and volume transport of North Atlantic Deep Water (NADW) persisted throughout the 3 year simulations resulting in a difference of 3 Sv in AMOC intensity. The persistence of these dense water anomalies and their influence on the AMOC is promising for the development of decadal forecasting capabilities. The results suggest that the deeper waters must be accurately reproduced in order to constrain the AMOC.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Radiometric data in the visible domain acquired by satellite remote sensing have proven to be powerful for monitoring the states of the ocean, both physical and biological. With the help of these data it is possible to understand certain variations in biological responses of marine phytoplankton on ecological time scales. Here, we implement a sequential data-assimilation technique to estimate from a conventional nutrient–phytoplankton–zooplankton (NPZ) model the time variations of observed and unobserved variables. In addition, we estimate the time evolution of two biological parameters, namely, the specific growth rate and specific mortality of phytoplankton. Our study demonstrates that: (i) the series of time-varying estimates of specific growth rate obtained by sequential data assimilation improves the fitting of the NPZ model to the satellite-derived time series: the model trajectories are closer to the observations than those obtained by implementing static values of the parameter; (ii) the estimates of unobserved variables, i.e., nutrient and zooplankton, obtained from an NPZ model by implementation of a pre-defined parameter evolution can be different from those obtained on applying the sequences of parameters estimated by assimilation; and (iii) the maximum estimated specific growth rate of phytoplankton in the study area is more sensitive to the sea-surface temperature than would be predicted by temperature-dependent functions reported previously. The overall results of the study are potentially useful for enhancing our understanding of the biological response of phytoplankton in a changing environment.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing over SDWs are essential. In this paper, we propose two efficient indices for SDW: the SB-index and the HSB-index. The proposed indices share the following characteristics. They enable multidimensional queries with spatial predicate for SDW and also support predefined spatial hierarchies. Furthermore, they compute the spatial predicate and transform it into a conventional one, which can be evaluated together with other conventional predicates by accessing a star-join Bitmap index. While the SB-index has a sequential data structure, the HSB-index uses a hierarchical data structure to enable spatial objects clustering and a specialized buffer-pool to decrease the number of disk accesses. The advantages of the SB-index and the HSB-index over the DBMS resources for SDW indexing (i.e. star-join computation and materialized views) were investigated through performance tests, which issued roll-up operations extended with containment and intersection range queries. The performance results showed that improvements ranged from 68% up to 99% over both the star-join computation and the materialized view. Furthermore, the proposed indices proved to be very compact, adding only less than 1% to the storage requirements. Therefore, both the SB-index and the HSB-index are excellent choices for SDW indexing. Choosing between the SB-index and the HSB-index mainly depends on the query selectivity of spatial predicates. While low query selectivity benefits the HSB-index, the SB-index provides better performance for higher query selectivity.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The joint modeling of longitudinal and survival data is a new approach to many applications such as HIV, cancer vaccine trials and quality of life studies. There are recent developments of the methodologies with respect to each of the components of the joint model as well as statistical processes that link them together. Among these, second order polynomial random effect models and linear mixed effects models are the most commonly used for the longitudinal trajectory function. In this study, we first relax the parametric constraints for polynomial random effect models by using Dirichlet process priors, then three longitudinal markers rather than only one marker are considered in one joint model. Second, we use a linear mixed effect model for the longitudinal process in a joint model analyzing the three markers. In this research these methods were applied to the Primary Biliary Cirrhosis sequential data, which were collected from a clinical trial of primary biliary cirrhosis (PBC) of the liver. This trial was conducted between 1974 and 1984 at the Mayo Clinic. The effects of three longitudinal markers (1) Total Serum Bilirubin, (2) Serum Albumin and (3) Serum Glutamic-Oxaloacetic transaminase (SGOT) on patients' survival were investigated. Proportion of treatment effect will also be studied using the proposed joint modeling approaches. ^ Based on the results, we conclude that the proposed modeling approaches yield better fit to the data and give less biased parameter estimates for these trajectory functions than previous methods. Model fit is also improved after considering three longitudinal markers instead of one marker only. The results from analysis of proportion of treatment effects from these joint models indicate same conclusion as that from the final model of Fleming and Harrington (1991), which is Bilirubin and Albumin together has stronger impact in predicting patients' survival and as a surrogate endpoints for treatment. ^

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Information behavior models generally focus on one of many aspects of information behavior, either information finding, conceptualized as information seeking, information foraging or information sense-making, information organizing and information using. This ongoing study is developing an integrated model of information behavior. The research design involves a 2-week-long daily information journal self-maintained by the participants, combined with two interviews, one before, and one after the journal-keeping period. The data from the study will be analyzed using grounded theory to identify when the participants engage in the various behaviors that have already been observed, identified, and defined in previous models, in order to generate useful sequential data and an integrated model.