125 resultados para constrained clustering


Relevância:

80.00% 80.00%

Publicador:

Resumo:

With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper elaborates the approach used by the Applied Data Mining Research Group (ADMRG) for the Social Event Detection (SED) Tasks of the 2013 MediaEval Benchmark. We extended the constrained clustering algorithm to apply to the first semi-supervised clustering task, and we compared several classifiers with Latent Dirichlet Allocation as feature selector in the second event classification task. The proposed approach focuses on scalability and efficient memory allocation when applied to a high dimensional data with large clusters. Results of the first task show the effectiveness of the proposed method. Results from task 2 indicate that attention on the imbalance categories distributions is needed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Mobile ad-hoc networks (MANETs) are temporary wireless networks useful in emergency rescue services, battlefields operations, mobile conferencing and a variety of other useful applications. Due to dynamic nature and lack of centralized monitoring points, these networks are highly vulnerable to attacks. Intrusion detection systems (IDS) provide audit and monitoring capabilities that offer the local security to a node and help to perceive the specific trust level of other nodes. We take benefit of the clustering concept in MANETs for the effective communication between nodes, where each cluster involves a number of member nodes and is managed by a cluster-head. It can be taken as an advantage in these battery and memory constrained networks for the purpose of intrusion detection, by separating tasks for the head and member nodes, at the same time providing opportunity for launching collaborative detection approach. The clustering schemes are generally used for the routing purposes to enhance the route efficiency. However, the effect of change of a cluster tends to change the route; thus degrades the performance. This paper presents a low overhead clustering algorithm for the benefit of detecting intrusion rather than efficient routing. It also discusses the intrusion detection techniques with the help of this simplified clustering scheme.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.

Relevância:

20.00% 20.00%

Publicador: