875 resultados para document clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents some developments in query expansion and document representation of our spoken document retrieval system and shows how various retrieval techniques affect performance for different sets of transcriptions derived from a common speech source. Modifications of the document representation are used, which combine several techniques for query expansion, knowledge-based on one hand and statistics-based on the other. Taken together, these techniques can improve Average Precision by over 19% relative to a system similar to that which we presented at TREC-7. These new experiments have also confirmed that the degradation of Average Precision due to a word error rate (WER) of 25% is quite small (3.7% relative) and can be reduced to almost zero (0.2% relative). The overall improvement of the retrieval system can also be observed for seven different sets of transcriptions from different recognition engines with a WER ranging from 24.8% to 61.5%. We hope to repeat these experiments when larger document collections become available, in order to evaluate the scalability of these techniques.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The standard, ad-hoc stopping criteria used in decision tree-based context clustering are known to be sub-optimal and require parameters to be tuned. This paper proposes a new approach for decision tree-based context clustering based on cross validation and hierarchical priors. Combination of cross validation and hierarchical priors within decision tree-based context clustering offers better model selection and more robust parameter estimation than conventional approaches, with no tuning parameters. Experimental results on HMM-based speech synthesis show that the proposed approach achieved significant improvements in naturalness of synthesized speech over the conventional approaches. © 2011 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper focuses on document data, one of the most significant sources for technology intelligence. To help organisations use their knowledge in documents effectively, this research aims to identify what organizations really want from documents and what might be possible to obtain from them. The research involves a literature review, a series of in-depth/on-site interviews and a descriptive analysis of document mining applications. The output of the research includes: a document mining framework; an analysis of the current condition of document mining in technology-based organisations together with their future requirements; and guidelines for introducing document mining into an organisation along with a discussion on the practical issues that are faced by users. Copyright © 2011 Inderscience Enterprises Ltd.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a novel model for the spatio-temporal clustering of trajectories based on motion, which applies to challenging street-view video sequences of pedestrians captured by a mobile camera. A key contribution of our work is the introduction of novel probabilistic region trajectories, motivated by the non-repeatability of segmentation of frames in a video sequence. Hierarchical image segments are obtained by using a state-of-the-art hierarchical segmentation algorithm, and connected from adjacent frames in a directed acyclic graph. The region trajectories and measures of confidence are extracted from this graph using a dynamic programming-based optimisation. Our second main contribution is a Bayesian framework with a twofold goal: to learn the optimal, in a maximum likelihood sense, Random Forests classifier of motion patterns based on video features, and construct a unique graph from region trajectories of different frames, lengths and hierarchical levels. Finally, we demonstrate the use of Isomap for effective spatio-temporal clustering of the region trajectories of pedestrians. We support our claims with experimental results on new and existing challenging video sequences. © 2011 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Chapter 20 Clustering User Data for User Modelling in the GUIDE Multi-modal Set- top Box PM Langdon and P. Biswas 20.1 ... It utilises advanced user modelling and simulation in conjunction with a single layer interface that permits a ...

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present a new co-clustering problem of images and visual features. The problem involves a set of non-object images in addition to a set of object images and features to be co-clustered. Co-clustering is performed in a way that maximises discrimination of object images from non-object images, thus emphasizing discriminative features. This provides a way of obtaining perceptual joint-clusters of object images and features. We tackle the problem by simultaneously boosting multiple strong classifiers which compete for images by their expertise. Each boosting classifier is an aggregation of weak-learners, i.e. simple visual features. The obtained classifiers are useful for object detection tasks which exhibit multimodalities, e.g. multi-category and multi-view object detection tasks. Experiments on a set of pedestrian images and a face data set demonstrate that the method yields intuitive image clusters with associated features and is much superior to conventional boosting classifiers in object detection tasks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For many applications, it is necessary to produce speech transcriptions in a causal fashion. To produce high quality transcripts, speaker adaptation is often used. This requires online speaker clustering and incremental adaptation techniques to be developed. This paper presents an integrated approach to online speaker clustering and adaptation which allows efficient clustering of speakers using the same accumulated statistics that are normally used for adaptation. Using a consistent criterion for both clustering and adaptation should yield gains for both stages. The proposed approach is evaluated on a meetings transcription task using audio from multiple distant microphones. Consistent gains over standard clustering and adaptation were obtained. Copyright © 2011 ISCA.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This research proposes a method for extracting technology intelligence (TI) systematically from a large set of document data. To do this, the internal and external sources in the form of documents, which might be valuable for TI, are first identified. Then the existing techniques and software systems applicable to document analysis are examined. Finally, based on the reviews, a document-mining framework designed for TI is suggested and guidelines for software selection are proposed. The research output is expected to support intelligence operatives in finding suitable techniques and software systems for getting value from document-mining and thus facilitate effective knowledge management. Copyright © 2012 Inderscience Enterprises Ltd.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.