943 resultados para speaker clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cluster analysis has played a key role in data understanding. When such an important data mining task is extended to the context of data streams, it becomes more challenging since the data arrive at a mining system in one-pass manner. The problem is even more difficult when the clustering task is considered in a sliding window model which requiring the elimination of outdated data must be dealt with properly. We propose SWEM algorithm that exploits the Expectation Maximization technique to address these challenges. SWEM is not only able to process the stream in an incremental manner, but also capable to adapt to changes happened in the underlying stream distribution.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cluster analysis has played a key role in data stream understanding. The problem is difficult when the clustering task is considered in a sliding window model in which the requirement of outdated data elimination must be dealt with properly. We propose SWEM algorithm that is designed based on the Expectation Maximization technique to address these challenges. Equipped in SWEM is the capability to compute clusters incrementally using a small number of statistics summarized over the stream and the capability to adapt to the stream distribution’s changes. The feasibility of SWEM has been verified via a number of experiments and we show that it is superior than Clustream algorithm, for both synthetic and real datasets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a novel multi-label classification framework for domains with large numbers of labels. Automatic image annotation is such a domain, as the available semantic concepts are typically hundreds. The proposed framework comprises an initial clustering phase that breaks the original training set into several disjoint clusters of data. It then trains a multi-label classifier from the data of each cluster. Given a new test instance, the framework first finds the nearest cluster and then applies the corresponding model. Empirical results using two clustering algorithms, four multi-label classification algorithms and three image annotation data sets suggest that the proposed approach can improve the performance and reduce the training time of standard multi-label classification algorithms, particularly in the case of large number of labels.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Lung nodules can be detected through examining CT scans. An automated lung nodule classification system is presented in this paper. The system employs random forests as it base classifier. A unique architecture for classification-aided-by-clustering is presented. Four experiments are conducted to study the performance of the developed system. 5721 CT lung image slices from the LIDC database are employed in the experiments. According to the experimental results, the highest sensitivity of 97.92%, and specificty of 96.28% are achieved by the system. The results demonstrate that the system has improved the performances of its tested counterparts.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An automated lung nodule detection system can help spot lung abnormalities in CT lung images. Lung nodule detection can be achieved using template-based, segmentation-based, and classification-based methods. The existing systems that include a classification component in their structures have demonstrated better performances than their counterparts. Ensemble learners combine decisions of multiple classifiers to form an integrated output. To improve the performance of automated lung nodule detection, an ensemble classification aided by clustering (CAC) method is proposed. The method takes advantage of the random forest algorithm and offers a structure for a hybrid random forest based lung nodule classification aided by clustering. Several experiments are carried out involving the proposed method as well as two other existing methods. The parameters of the classifiers are varied to identify the best performing classifiers. The experiments are conducted using lung scans of 32 patients including 5721 images within which nodule locations are marked by expert radiologists. Overall, the best sensitivity of 98.33% and specificity of 97.11% have been recorded for proposed system. Also, a high receiver operating characteristic (ROC) Az of 0.9786 has been achieved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Model systems of sodium iodide dissolved in dimethyl ether or 1,2-dimethoxyethane (glyme) were studied in order to investigate the structural and dynamic properties of ionic solutions in small and polymeric ethers. Full molecular dynamics simulations were performed at a range of different salt concentrations. An algorithm was designed which assigns ions to clusters and then calculates all the terms which contribute to ionic conductivity. In dilute solutions, free ions are the most common ionic species, followed by ion pairs. As the concentration increases, pairs become the most common species, with significant concentrations of clusters with 3 through 6 ions. Changing the solvent from dimethyl ether to glyme significantly decreases the ion clustering due to the chelate effect in which the two oxygens on a solvent stabilize an associated cation. The conductivity in stable systems is shown to be primarily the result of the movement of free ions and the relative movement of ions within neutral pairs. The Nernst-Einstein relation, commonly used in the discussion of polymer electrolytes, is shown to be inadequate to quantitatively describe conductivity in the model systems.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Selecting a suitable proximity measure is one of the fundamental tasks in clustering. How to effectively utilize all available side information, including the instance level information in the form of pair-wise constraints, and the attribute level information in the form of attribute order preferences, is an essential problem in metric learning. In this paper, we propose a learning framework in which both the pair-wise constraints and the attribute order preferences can be incorporated simultaneously. The theory behind it and the related parameter adjusting technique have been described in details. Experimental results on benchmark data sets demonstrate the effectiveness of proposed method.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The appearance of patterns could be found in different modalities of a domain, where the different modalities refer to the data sources that constitute different aspects of a domain. Particularly, the domain of our discussion refers to crime and the different modalities refer to the different data sources such as offender data, weapon data, etc. in crime domain. In addition, patterns also exist in different levels of granularity for each modality. In order to have a thorough understanding a domain, it is important to reveal the hidden patterns through the data explorations at different levels of granularity and for each modality. Therefore, this paper presents a new model for identifying patterns that exist in different levels of granularity for different modes of crime data. A hierarchical clustering approach - growing self organising maps (GSOM) has been deployed. Furthermore, the model is enhanced with experiments that exhibit the significance of exploring data at different granularities.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

One reason for semi-supervised clustering fail to deliver satisfactory performance in document clustering is that the transformed optimization problem could have many candidate solutions, but existing methods provide no mechanism to select a suitable one from all those candidates. This paper alleviates this problem by posing the same task as a soft-constrained optimization problem, and introduces the salient degree measure as an information guide to control the searching of an optimal solution. Experimental results show the effectiveness of the proposed method in the improvement of the performance, especially when the amount of priori domain knowledge is limited.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a salience-based technique for the annotation of directly quoted speech from fiction text. In particular, this paper determines to what extent a naïve (without the use of complex machine learning or knowledge-based techniques) scoring technique can be used for the identification of the speaker of speech quotes. The presented technique makes use of a scoring technique, similar to that commonly found in knowledge-poor anaphora resolution research, as well as a set of hand-coded rules for the final identification of the speaker of each quote in the text. Speaker identification is shown to be achieved using three tasks: the identification of a speech-verb associated with a quote with a recall of 94.41%; the identification of the actor associated with a quote with a recall of 88.22%; and the selection of a speaker with an accuracy of 79.40%.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a hierarchical pattern matching and generalisation technique which is applied to the problem of locating the correct speaker of quoted speech found in fiction books. Patterns from a training set are generalised to create a small number of rules, which can be used to identify items of interest within the text. The pattern matching technique is applied to finding the Speech-Verb, Actor and Speaker of quotes found in ction books. The technique performs well over the training data, resulting in rule-sets many times smaller than the training set, but providing very high accuracy. While the rule-set generalised from one book is less effective when applied to different books than an approach based on hand coded heuristics, performance is comparable when testing on data closely related to the training set.